There are many ways to model and index data in MongoDB for efficient querying. This is straightforward in many cases, but in others it may require a bit more thought and insight to get an optimal solution.

In this post,  we first take a look at the common use cases and corresponding indexing patterns. Then we examine the challenge of efficient partial and case insensitive keyword searches in MongoDB, along with a proposed solution.

Single field and compound indexes

Simple use cases, such as key-value lookups, require only a single field index on the appropriate fields and MongoDB takes care of the rest. It (generally) does not matter if the field contains integers, strings, dates, arrays, or other data types.

Things are more complicated if compound indexes are needed (e.g. when there are multiple fields in the query and/or sorting is required), as the order of the fields in the index can make a big difference in performance. It is best to measure and compare the relative performance on a representative data set to determine the best option, as it varies with data distribution. For details, see Jesse’s blog post.

Dynamic attributes

Some use cases are hard to optimize with compound indexes alone, as the field names may not be known upfront and/or require an impractically large number of indexes (which can significantly reduce write performance).  Fortunately, this can be tackled with the attribute lookup strategy, which is described in more detail on Asya’s blog.

Keywords – exact matches

Another common use case is to add a set of keywords or tags to individual documents in MongoDB for searching. This is explained in the keyword search pattern and works great for exact matches, but does not scale well when case-insensitivity and/or partial matches are needed.

For example, consider a collection called items with the following schema:

If we create an index on the keys field, we can then do exact matches on it very efficiently [1]:

Note that   .explain(1)  is shorthand for .explain({ verbose: "allExecutionPlans" })  to actually execute the query (MongoDB 3.0 and above), in order to obtain the execution statistics. For details, see cursor.explain().

Keywords – case insensitive searches

However, case insensitive searches (using regular expressions) are far more expensive as MongoDB cannot use the index effectively (there are 8 million documents in this items collection, each with 2 items in the keys array field):

Option #1: Convert the strings in keys to upper case before storing them in MongoDB, and then perform the same conversion on the query value so that an exact match can be used instead. For example:

Option #2: Use text indexes, which are case insensitive. For example:

However, take note of the restrictions and the fact that text indexes uses stemming to determine the root word, so the search results may be different from the exact match.

Keywords – partial (and case insensitive) searches

Partial keyword searches can be performed with regular expressions, but if this is not left anchored (e.g. starts with) MongoDB will again not be able to use the index efficiently. For example, using the same search value as the previous example but with the first and last characters removed:

Performance wise this is similar to the case insensitive regular expression search discussed in the previous section. Text indexes do not perform partial matches/substrings so they cannot be used here. What can we do to make this faster?

One solution is to precompute all (upper cased) suffixes and store them for efficient left anchored regular expression searches. To do so, one can use the following reference Javascript function:

This can copied and pasted in the MongoDB shell, and then executed. For example, using the original keys values from the initial example, we can compute the suffixes:

We can now add these computed suffixes to the documents accordingly. For example:

One can now perform partial and case insensitive searches efficiently with left anchored regular expressions. For example:

This approach naturally increases the document and index size, but it is a well worth trade-off as it hugely speeds up such partial searches.

Try it out yourself

You can try this (and your own variants) by using the sample data generator suffix-generator.js. For example, save this locally and run it with the default settings:

This will generate 800,000 documents using 4 threads in the test database, items collection. Each document has two random strings in the keys array field, with the computed suffixes in the suffixes field. Single field indexes on the fields keys and suffixes respectively are also created.

With the WiredTiger storage engine, you should get a collection with statistics similar to the following:

Note that the index size for suffixes is about 4x larger than keys.

The default values for THREADS , BATCH_SIZE , COUNT  (number of batches thread), and COLL_NAME  (collection name) can be overridden. For example:

Do let me know in the comments or pull request if any amendments or further improvements should be made.