This benchmark indexes and searches a 20 M document subset of the New York City taxi ride corpus, in both a sparse and dense way. Green taxi rides make up ~11.5% of the 20 M documents, and yellow are ~88.5%. See this blog post for details.

