10 Tips to Improve Your MongoDB Aggregation Pipelines
Tips, tricks & things to avoid when aggregating data from a mongo cluster.
Matt Van Dyke
December 22nd, 2020
At Pulse Analytics aggregation is an important part of our data pipeline. Combining and restructuring data appropriately for our views has unlocked faster load times and cleaner frontend code. Most, if not all, of this aggregation happens in MongoDB's aggregation pipelines, and our team has learned to build pipelines quickly while avoiding common gotchas along the way. Here are some tips, tricks, and things to avoid when building aggregation pipelines.
Setup
You can run the example aggregation pipelines throughout each section in a number of ways while reading through this post, but I'd suggest downloading Compass and using the New pipeline from text
import feature. You will, of course, need to have access to a MongoDB cluster.
The following sample data should be complex enough to easily demonstrate the pipelines that follow.
Sample Data
[
{
"name": "carrot",
"type": "vegetable",
"color": "orange",
"farm": {
"name": "Shady Pastures",
"state": "PA"
},
"vitamins": ["A", "K", "C"]
},
{
"name": "eggplant",
"type": "vegetable",
"color": "purple",
"farm": {
"name": "Shady Pastures"
},
"vitamins": []
},
{
"name": "clementine",
"type": "fruit",
"color": "orange",
"farm": {
"name": "Shady Pastures",
"state": "PA"
}
},
{
"name": "vitelotte",
"type": "vegetable",
"color": "purple",
"farm": {
"name": "Shady Pastures",
"state": "PA"
},
"vitamins": ["C", "B"]
}
]
1. Save time with $addFields over $project
Make sure to take full advantage of the $addFields
stage! Instead of using $project
and having to include all fields that won't change, just to replace an existing field, you can use $addFields
to quickly replace the value.
Here's an example using our mock data, where we can change the farm
field from a subdoc to just the name without having to specify all the other fields in our schema.
// Quick and Easy
[
{
'$addFields': {
'farm': '$farm.name'
}
}
]
// Same result, but way too verbose
[
{
'$project': {
'name': 1,
'type': 1,
'color': 1,
'farm': '$farm.name',
'vitamins': 1
}
}
]
2. Use $unwind syntactic sugar for cleaner pipelines
Unless you need to preserve null values when unwinding, or need further customization, make use of the quicker and cleaner $unwind syntax to speed up your aggregation pipeline building process.
// Sugar VS Fully Configurable Stage
// Sugar
[
{
'$unwind': '$vitamins'
}
]
// Fully Configurable Stage
[
{
'$unwind': {
'path': '$vitamins'
}
}
]
3. Match both empty/missing arrays
To match documents with empty or missing arrays, you can .
(dot) into the array at the first index and do an $exists: false
check. If you use array $size
to accomplish this, you'll end up only matching on empty arrays.
// Will match `vitamins: []` only
[
{
'$match': {
'vitamins': { '$size': 0 }
}
},
]
// Will match `vitamins: []` or docs where the field does not exist
[
{
'$match': {
'vitamins.0': { '$exists': false }
}
}
]
4. Stub missing or null values with the $ifNull operator
In our mock data, some documents are missing the vitamins
field. You can use $ifNull
in an $addFields
stage to stub the field wherever it's missing, bringing the schema to a much more stable state. $ifNull
will catch actual null values, or, like in this case, anything that resolves to null, such as missing fields.
[
{
'$addFields': {
'vitamins': {
'$ifNull': [
'$vitamins', []
]
}
}
}
]
5. How to use ObjectId's internal date as a quick createdOn
It's easy to forget timestamps in the schema-less world of MongoDB, but there's an easy way to match the equivalent of createdOn
on any doc that has an ObjectId
. (Credit: stack overflow post, where we learned this trick.)
[
{
'$addFields': {
'createdOn': {
'$toDate': '$_id'
}
}
}, {
'$match': {
'$and': [
{
'createdOn': {
'$gt': new Date('Wed, 18 Sep 2019 00:00:00 GMT')
}
}, {
'createdOn': {
'$lt': new Date('Mon, 07 Dec 2020 05:00:00 GMT')
}
}
]
}
}
]
6. Get month/day/year/etc from timestamps with $dateToParts
If we build off of the previous section's aggregation pipeline, where we formed a createdOn
field from the document's ObjectId
, we can get more granular insight into the "date parts".
[
{
'$addFields': {
'createdOn': {
'$toDate': '$_id'
}
}
}, {
'$addFields': {
'dateParts': {
'$dateToParts': {
'date': '$createdOn'
}
}
}
}
]
The resulting dateParts
field will include everything from year to millisecond, providing a much more user friendly way to group or match data on year or month while not caring about the full timestamp string. If you need it, you can also specify timezone, too!
7. Nest docs for easy drill-down with multiple $group stages
When the goal of aggregation is to create a read-optimized data structure, following the logic of the user interface is key to success. A common UI flow allows users to "drill down" into data, starting at more abstract levels of grouping, then drilling down to more specific groupings--e.g., a list of produce, where clicking on a produce type brings you to a list organized by something like color, then clicking on a color will bring you to the most granular level: the produce itself.
The following aggregation pipeline takes advantage of multiple levels of grouping to accomplish this task, following the simple rule that the most specific grouping should always be applied first.
[
{
'$group': {
'_id': {
'color': '$color',
'type': '$type'
},
'docsByColorType': {
'$push': '$$ROOT'
}
}
}, {
'$group': {
'_id': '$_id.type',
'docsByColor': {
'$push': {
'color': '$_id.color',
'docs': '$$ROOT.docsByColorType'
}
}
}
}
]
8. Create a hash map only using the aggregation pipeline
A slightly more contrived use case than #7 in this list: After grouping, we can create a map solely in the aggregation pipeline by restructuring our data and using $arrayToObject
. The following aggregation pipeline will get you a simple map: { purple: [PurpleProduce], orange: [OrangeProduce] }
[
{
'$group': {
'_id': '$color',
'docs': {
'$push': '$$ROOT'
}
}
}, {
'$project': {
'_id': 0,
'arr': [
'$_id', '$docs'
]
}
}, {
'$group': {
'_id': null,
'docs': {
'$addToSet': '$arr'
}
}
}, {
'$project': {
'map': {
'$arrayToObject': '$docs'
}
}
}, {
'$replaceRoot': {
'newRoot': '$map'
}
}
]
Also, take note of the _id: null
trick in the grouping stage used to blindly recombine all documents from the previous stage.
9. Avoid extra groups in $group stage due to unstable subdoc schema
You might have noticed in this article's mock data that the farm
subdoc isn't schema stable--farm.state
is missing from the eggplant
document. This is a common situation that can happen for many different reasons.
Maybe certain write ops against the collection exclude fields from subdocs when not relevant. Or, maybe, let's pretend you're in the later stages of an aggregation pipeline, after a $lookup
stage yielded no data in a few cases, and after converting the empty array (result of failed lookups) to an object, the field just disappears on a number of your documents. My point is that missing fields can very easily arise in your data and create schema-unstable situations.
Where this instability can really bite you is if you're using subdocs in a $group
stage. Create a $group
stage with _id: '$farm'
against the mock data, and you'll see what I mean. That missing farm.state
field causes two groups in the stage's output, instead of one. It's even easier to fall into this trap if you're grouping by multiple fields.
So, what's the solution to this problem? I'd recommend only grouping with entire subdocs if you're 100% sure the document is schema stable. When you're not 100% sure, it might be safer to .
(dot) into the data structure for something like _id
. In this situation with the farm
subdoc, grouping by farm.name
is much more reliable than the entire doc!
10. When you'll lose sort order after $unwind
We've found that sorting sometimes doesn't always stick when building aggregation pipelines. A lot depends on how you're grouping and unwinding throughout the pipeline.
[
{
'$sort': {
'name': 1
}
}, {
'$group': {
'_id': '$color',
'docs': {
'$push': '$$ROOT'
}
}
}, {
'$unwind': {
'path': '$docs'
}
}, {
'$replaceRoot': {
'newRoot': '$docs'
}
}
]
In the above aggregation pipeline, we sort by name then group by color. The documents within each docs
array in the $group
stage remain sorted by name (though the groups themselves aren't necessarily in any order). Then, when $unwind
executes, the docs
array is flattened in arbitrary order. Sometimes, the first group's subdocs will appear before the second group's subdocs; sometimes, it'll be the other way around. Check out the GIF below to see what I mean. Notice how $unwind
arbitrarily spreads the docs out.
In cases like these, it would have been more reliable to save the $sort
stage until after the groups are flattened.
We hope you've found this an interesting read. If you'd like to join a team that's always striving to build out innovative ways to improve data infrastructure and integrity, check out our openings!