5 ways to improve MongoDB performance
Ever wondered how will you process millions of records efficiently? Or do you want to improve the performance of your application by reducing database query time? Working on similar set of problems, I have identified few things which will definitely help you in improving performance of your application. These tips are simple, common yet very practical.
Without further ado, let’s start.
1. Design your schema
Even though MongoDB doesn’t enforce schema, it’s vital to design a schema in a way that it improves overall performance. One-to-One relationship can be easily achieved by single level schema but when designing a MongoDB schema for One-to-N relationship, you need to start with a question like:
Is there any need to access the embedded object outside the context of the parent object?
If not, then you can simply embed the N side into parent object provided that number of array elements well below four figures. If the N-side objects should stand alone for any reasons, you can use an array of references to the N-side objects.
Inside documents, it is best to keep the number of array elements well below four figures. Though MongoDB accommodate large documents of upto 16 MB in collections, it works best if you keep individual documents to a few KB in size. If the array is added too frequently, it will outgrow the containing document so that it’s location on disk has to moved. A lot of index rewriting is going to take place when a document with a large array is re-indexed, because MongoDB automatically creates a multi-key index if any indexed field is an array. This re-indexing also happens when such a document is inserted or deleted.
You might think that you could get around this by not indexing arrays. Unfortunately, without the indexes, you can run into other problems. Because documents are scanned from start to end, it takes longer to find elements towards the end of an array. Hence you should always design your schema on your particular application’s data access patterns.
2. Design & Manage indexes
Once schema has been created, next step is to design an index. Imagine you want to begin with the algorithms and you have a very large book of DSA from which you want one particular topic. Now what you will generally do is: go to the index, look for the topic and index will tell you the page number of the topic. But if book doesn’t have any index then you probably need to go through each page and try to find your topic — isn’t that a tedious?
Similarly, when you ask for some document in a database, the database tries to use an index to quickly find the results for you. If there’s no index to use as reference, it has to check each document, the same way you would have to if your book didn’t have an index.
Consider a below scenario to check how index help us in improving overall performance. Without the index, the query would scan the whole collection of ~1500 documents to return 24 matching documents. The ratio of documents examined to document returned is too high which means performance is poor.
When run with an index, the query scanned 24 index entries and 24 documents to return 24 matching documents, resulting in a very efficient query.
There is a tremendous improvement in the query performance with just a simple configuration in indexing.
- This ratio of documents examined to documents returned is 1 which is perfect.
- You are using a better approach, which is an index scan instead of the whole collection scan. This results in better performance.
The majority of use cases can be covered by single-field index but you can also create compound indexes on two or more fields. As shown below, again ratio of documents examined to documents returned is 1 for compound index of given query.
Compound index considers the order of the fields when you create the index. You will have optimized the case when querying involves: release year only, or release year and title. Release year always needs to be specified first if you want to leverage the compound index you just created.
Once you create an index, it is very important to manage the index efficiently. When creating a compound index, ask yourself this question: “Which property of my find query is the most ‘unique’ one? The higher uniqueness you have for your first properties in the compound index, the better it will perform. Additionally you have to make sure that index fits in RAM available in the database server in order to prevent fetching it from disk. Also simply removing unused and redundant indexes can also boost in performance.
3. Analyze the queries
When you effectively design schema and create efficient index, performance of query should also be efficient. But What if queries are still insufficient and you want to find root cause behind them. You can use explain which reveals how a database operation worked.
collection.explain()
methodthe
cursor.explain()
methodthe
explain
command
All three can be used to return information on query plans and the execution statistics of those query plans. This returns a large JSON result, but there are few values which can be helpful for evaluation:
explain.executionStats.nReturned
: tells you how many documents in the collection matched
explain.executionStats.executionTimeMillis
: tells you how long it took
explain.executionStats.totalKeysExamined
: tells you the number of index entries scanned
explain.executionStats.totalDocsExamined
: tells you the number of documents scanned
queryPlanner.winningPlan.inputStage.stage :
displaysIXSCAN
to indicate index use.
queryPlanner.winningPlan.stage
: displaysCOLLSCAN
to indicate a collection scan.
To manually compare the performance of a query using more than one index, you can use the hint() method in conjunction with the explain() method.
4. Identify order of aggregation stages
MongoDB does have a query optimizer (can be checked via profiler & explain()), and in most cases it’s effective at picking the best of multiple possible plans. However in the case of the aggregate function, the sequence in which various stages are executed is completely under your control. The optimizer won’t reorder stages into the optimal sequence to get you out of trouble.
You need to make sure that the data is reduced as early as possible in the pipeline as it reduces the amount of work that has to be done by each successive stage. data can be reduced by using $match and $project, sorts happen only once the data is reduced, and that lookups happen in the order you intend. Performance of query can be improved when below stages are efficiently designed.
$match: A document has to match the provided criteria in the query for it to pass to the next stage. In order to achieve the best performance of the $match stage, use it early in the aggregation process since it will:
- Take advantage of the indexes hence become much faster
- Limit the number of documents that will be passed to the next stage
$project: In this stage, the documents are modified either to add or remove some fields that will be returned. Best performance of $project stage can be achieved when we project only indexed field or covered query is used.
- Project lesser fields will not improve query performance unless all the returned fields can be satisfy using an index.
- An index can improve performance, but covered queries can level up your query performance. Covered query can perform faster than normal optimized query using Index Scan
$limit: Limit the number of documents which can be passed to next stage or result set using $limit. This can improve performance as next stage has to process less documents comparatively.
$sort: it is done in one of the final stages, after filtering the result, to reduce the amount of data being sorted. Sorting works effectively when you have an index defined on sorting field(s). Either the single or compound index defined would be suitable. If you don’t have an index defined, MongoDB must sort the result itself, and this can be problematic when analyzing a large set of returned documents since database impose 32MB memory limit on combined size of all the documents in sorting operation. Best performance from sorting stage can be achieved when:
- Field on which sort needs to be performed is index fields so that MongoDB does not need in-memory sort.
- In case multiple fields used in sorting, compound index is created on all the fields.
$lookup: Lookups perform a similar function to a SQL join. To perform well, they require an index on the key value used as the foreign key
5. Use dedicated hardware, and SSDs / Rebuild Your Indexes
MongoDB doesn’t really have many bottlenecks when it comes to CPU because CPU bound operations are rare (usually things like building indexes), but what really causes problem is CPU steal — when other guests on the host are competing for the CPU resources. To eliminate the possibility of CPU steal, move mongoDB server on dedicated hardware.
In order to reduce the time complexity, you may need to increase the space complexity or perform pre-computation. Here also your server hardware plays very important role if you are working with millions of documents. Make sure hardware configurations are as per requirement. Also avoid problems with shared storage by deploying the dbpath onto locally mounted SSDs
If you’re satisfied your structure is efficient yet queries are still running slowly, you could try rebuilding indexes on each collection. I’d recommend a full backup before rebuilding index. This should be considered a last option when all other options have been exhausted.
Thanks for reading the article. I hope this article helps you in considering the right steps to improve overall MongoDB performance. With these configurations, you can easily boost MongoDB performance. Please let me know if you have further performance tips