Engineering

10 Tips to Improve Your MongoDB Aggregation Pipelines

Tips, tricks & things to avoid when aggregating data from a mongo cluster.

Matt Van Dyke

December 22nd, 2020

At Pulse Analytics aggregation is an important part of our data pipeline. Combining and restructuring data appropriately for our views has unlocked faster load times and cleaner frontend code. Most, if not all, of this aggregation happens in MongoDB's aggregation pipelines, and our team has learned to build pipelines quickly while avoiding common gotchas along the way. Here are some tips, tricks, and things to avoid when building aggregation pipelines.

Setup

You can run the example aggregation pipelines throughout each section in a number of ways while reading through this post, but I'd suggest downloading Compass and using the New pipeline from text import feature. You will, of course, need to have access to a MongoDB cluster.

The following sample data should be complex enough to easily demonstrate the pipelines that follow.

Sample Data

[
  {
    "name": "carrot",
    "type": "vegetable",
    "color": "orange",
    "farm": {
      "name": "Shady Pastures",
      "state": "PA"
    },
    "vitamins": ["A", "K", "C"]
  },
  {
    "name": "eggplant",
    "type": "vegetable",
    "color": "purple",
    "farm": {
      "name": "Shady Pastures"
    },
    "vitamins": []
  },
  {
    "name": "clementine",
    "type": "fruit",
    "color": "orange",
    "farm": {
      "name": "Shady Pastures",
      "state": "PA"
    }
  },
  {
    "name": "vitelotte",
    "type": "vegetable",
    "color": "purple",
    "farm": {
      "name": "Shady Pastures",
      "state": "PA"
    },
    "vitamins": ["C", "B"]
  }
]

1. Save time with $addFields over $project

Make sure to take full advantage of the $addFields stage! Instead of using $project and having to include all fields that won't change, just to replace an existing field, you can use $addFields to quickly replace the value.

Here's an example using our mock data, where we can change the farm field from a subdoc to just the name without having to specify all the other fields in our schema.

// Quick and Easy
[
  {
    '$addFields': {
      'farm': '$farm.name'
    }
  }
]

// Same result, but way too verbose
[
  {
    '$project': {
      'name': 1, 
      'type': 1, 
      'color': 1, 
      'farm': '$farm.name', 
      'vitamins': 1
    }
  }
]

2. Use $unwind syntactic sugar for cleaner pipelines

Unless you need to preserve null values when unwinding, or need further customization, make use of the quicker and cleaner $unwind syntax to speed up your aggregation pipeline building process.

// Sugar VS Fully Configurable Stage

// Sugar
[
  {
    '$unwind': '$vitamins'
  }
]

// Fully Configurable Stage
[
  {
    '$unwind': {
      'path': '$vitamins'
    }
  }
]

3. Match both empty/missing arrays

To match documents with empty or missing arrays, you can . (dot) into the array at the first index and do an $exists: false check. If you use array $size to accomplish this, you'll end up only matching on empty arrays.

// Will match `vitamins: []` only
[
  {
    '$match': {
      'vitamins': { '$size': 0 }
    }
  },
]

// Will match `vitamins: []` or docs where the field does not exist
[
  {
    '$match': {
      'vitamins.0': { '$exists': false }
    }
  }
]

4. Stub missing or null values with the $ifNull operator

In our mock data, some documents are missing the vitamins field. You can use $ifNull in an $addFields stage to stub the field wherever it's missing, bringing the schema to a much more stable state. $ifNull will catch actual null values, or, like in this case, anything that resolves to null, such as missing fields.

[
  {
    '$addFields': {
      'vitamins': {
        '$ifNull': [
          '$vitamins', []
        ]
      }
    }
  }
]

5. How to use ObjectId's internal date as a quick createdOn

It's easy to forget timestamps in the schema-less world of MongoDB, but there's an easy way to match the equivalent of createdOn on any doc that has an ObjectId. (Credit: stack overflow post, where we learned this trick.)

[
  {
    '$addFields': {
      'createdOn': {
        '$toDate': '$_id'
      }
    }
  }, {
    '$match': {
      '$and': [
        {
          'createdOn': {
            '$gt': new Date('Wed, 18 Sep 2019 00:00:00 GMT')
          }
        }, {
          'createdOn': {
            '$lt': new Date('Mon, 07 Dec 2020 05:00:00 GMT')
          }
        }
      ]
    }
  }
]

6. Get month/day/year/etc from timestamps with $dateToParts

If we build off of the previous section's aggregation pipeline, where we formed a createdOn field from the document's ObjectId, we can get more granular insight into the "date parts".

[
  {
    '$addFields': {
      'createdOn': {
        '$toDate': '$_id'
      }
    }
  }, {
    '$addFields': {
      'dateParts': {
        '$dateToParts': {
          'date': '$createdOn'
        }
      }
    }
  }
]

The resulting dateParts field will include everything from year to millisecond, providing a much more user friendly way to group or match data on year or month while not caring about the full timestamp string. If you need it, you can also specify timezone, too!

7. Nest docs for easy drill-down with multiple $group stages

When the goal of aggregation is to create a read-optimized data structure, following the logic of the user interface is key to success. A common UI flow allows users to "drill down" into data, starting at more abstract levels of grouping, then drilling down to more specific groupings--e.g., a list of produce, where clicking on a produce type brings you to a list organized by something like color, then clicking on a color will bring you to the most granular level: the produce itself.

The following aggregation pipeline takes advantage of multiple levels of grouping to accomplish this task, following the simple rule that the most specific grouping should always be applied first.

[
  {
    '$group': {
      '_id': {
        'color': '$color', 
        'type': '$type'
      }, 
      'docsByColorType': {
        '$push': '$$ROOT'
      }
    }
  }, {
    '$group': {
      '_id': '$_id.type', 
      'docsByColor': {
        '$push': {
          'color': '$_id.color', 
          'docs': '$$ROOT.docsByColorType'
        }
      }
    }
  }
]

8. Create a hash map only using the aggregation pipeline

A slightly more contrived use case than #7 in this list: After grouping, we can create a map solely in the aggregation pipeline by restructuring our data and using $arrayToObject. The following aggregation pipeline will get you a simple map: { purple: [PurpleProduce], orange: [OrangeProduce] }

[
  {
    '$group': {
      '_id': '$color', 
      'docs': {
        '$push': '$$ROOT'
      }
    }
  }, {
    '$project': {
      '_id': 0, 
      'arr': [
        '$_id', '$docs'
      ]
    }
  }, {
    '$group': {
      '_id': null, 
      'docs': {
        '$addToSet': '$arr'
      }
    }
  }, {
    '$project': {
      'map': {
        '$arrayToObject': '$docs'
      }
    }
  }, {
    '$replaceRoot': {
      'newRoot': '$map'
    }
  }
]

Also, take note of the _id: null trick in the grouping stage used to blindly recombine all documents from the previous stage.

9. Avoid extra groups in $group stage due to unstable subdoc schema

You might have noticed in this article's mock data that the farm subdoc isn't schema stable--farm.state is missing from the eggplant document. This is a common situation that can happen for many different reasons.

Maybe certain write ops against the collection exclude fields from subdocs when not relevant. Or, maybe, let's pretend you're in the later stages of an aggregation pipeline, after a $lookup stage yielded no data in a few cases, and after converting the empty array (result of failed lookups) to an object, the field just disappears on a number of your documents. My point is that missing fields can very easily arise in your data and create schema-unstable situations.

Where this instability can really bite you is if you're using subdocs in a $group stage. Create a $group stage with _id: '$farm' against the mock data, and you'll see what I mean. That missing farm.state field causes two groups in the stage's output, instead of one. It's even easier to fall into this trap if you're grouping by multiple fields.

So, what's the solution to this problem? I'd recommend only grouping with entire subdocs if you're 100% sure the document is schema stable. When you're not 100% sure, it might be safer to . (dot) into the data structure for something like _id. In this situation with the farm subdoc, grouping by farm.name is much more reliable than the entire doc!

10. When you'll lose sort order after $unwind

We've found that sorting sometimes doesn't always stick when building aggregation pipelines. A lot depends on how you're grouping and unwinding throughout the pipeline.

[
  {
    '$sort': {
      'name': 1
    }
  }, {
    '$group': {
      '_id': '$color', 
      'docs': {
        '$push': '$$ROOT'
      }
    }
  }, {
    '$unwind': {
      'path': '$docs'
    }
  }, {
    '$replaceRoot': {
      'newRoot': '$docs'
    }
  }
]

In the above aggregation pipeline, we sort by name then group by color. The documents within each docs array in the $group stage remain sorted by name (though the groups themselves aren't necessarily in any order). Then, when $unwind executes, the docs array is flattened in arbitrary order. Sometimes, the first group's subdocs will appear before the second group's subdocs; sometimes, it'll be the other way around. Check out the GIF below to see what I mean. Notice how $unwind arbitrarily spreads the docs out.

compass unwind lose sort demo

In cases like these, it would have been more reliable to save the $sort stage until after the groups are flattened.

We hope you've found this an interesting read. If you'd like to join a team that's always striving to build out innovative ways to improve data infrastructure and integrity, check out our openings!

⬅︎ Back to All Blog Posts