17 · Aggregation Intro

01

What is the Pipeline

The aggregation framework conceptual model

concept

The Aggregation Pipeline is MongoDB's framework for data processing and transformation. It is modeled as a conveyor belt: documents enter the pipeline, pass through a series of sequential stages, and emerge as transformed output.

Each stage takes documents as input, applies one specific operation (filter, reshape, group, sort, join, etc.), and passes the result to the next stage.

Collection

→

$match

→

$group

→

$sort

→

$project

→

Output

Basic Syntax

db.collection.aggregate([
  { $stage1: { /* options */ } },
  { $stage2: { /* options */ } },
  { $stage3: { /* options */ } }
])
// Array of stage objects — executed top to bottom
// Output of each stage is the input of the next

Why Use the Aggregation Pipeline?

Group documents and compute summaries (GROUP BY equivalent)
Join multiple collections (LEFT JOIN equivalent)
Compute derived fields with expressions
Reshape documents — include, exclude, rename, compute fields
Deconstruct arrays into individual documents
Date/time extraction and formatting
Statistical operations (standard deviation, etc.)

NOTE

aggregate() returns a cursor, just like find(). Documents flow through the pipeline lazily — they are not all loaded into memory at once.

02

How It Works

Sequential execution, stage isolation, performance implications

execution

Execution is strictly sequential — top to bottom. The output of one stage becomes the direct input of the next. No stage has visibility into stages that follow it.

Key Implications

Stages that reduce data volume ($match) should appear as early as possible
Expensive operations ($lookup, $unwind) should operate on the smallest possible dataset
Stage order is the primary performance lever — a misplaced $match can turn a 10ms query into a 10-second query
Computed fields created in a stage are not available in prior stages

Recommended Stage Order

// Optimal pipeline ordering pattern
db.orders.aggregate([
  { $match: { ... } },        // 1. Filter first — enables index usage
  { $sort: { ... } },         // 2. Sort early if needed for $first/$last
  { $skip: N },               // 3. Skip (pagination)
  { $limit: N },              // 4. Limit — Top-K optimization with $sort
  { $lookup: { ... } },       // 5. Join — on smallest filtered set
  { $unwind: "$arr" },        // 6. Unwind — after filtering, before group
  { $group: { ... } },        // 7. Group — on reduced dataset
  { $project: { ... } }       // 8. Project last — shape final output
])

WARN

Placing $match after $group loses index access — MongoDB must scan every intermediate document. Always filter before grouping whenever possible.

Index Usage in the Pipeline

Indexes are exploitable only in the earliest stages, before any transformation occurs. Once documents have been transformed (e.g., field renamed via $project), the original index on that field can no longer be used.

// Index on "status" IS used — $match is first
db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$category", total: { $sum: "$amount" } } }
])

// Index on "status" is NOT used — $match is after $group
db.orders.aggregate([
  { $group: { _id: "$category", total: { $sum: "$amount" } } },
  { $match: { status: "completed" } }   // scans ALL grouped results
])

03

Stages Overview

Core stages with SQL equivalents

reference

Order	Stage	SQL Equivalent	Purpose	Performance Note
1	`$match`	WHERE	Filter documents by condition	Always first. Enables index usage and reduces downstream load.
2	`$sort`	ORDER BY	Reorder documents	Can use indexes if placed early, before transformation stages.
3	`$skip`	OFFSET	Skip first N documents	O(n) — costly at large offsets. Prefer keyset pagination.
4	`$limit`	LIMIT	Restrict output to N documents	With `$sort` enables Top-K memory optimization.
5	`$lookup`	LEFT JOIN	Join another collection	Resource-intensive. Only join filtered documents.
6	`$unwind`	—	Deconstruct array field into individual documents	Multiplies document count. Apply `$match` before.
7	`$group`	GROUP BY	Aggregate documents into summaries	100MB RAM limit per stage.
8	`$project`	SELECT	Include, exclude, rename, or compute fields	Usually last to shape final client output.
any	`$addFields` / `$set`	SELECT (add)	Add computed fields without removing existing ones	Alias of each other. Preferred when only adding fields.
any	`$count`	COUNT(*)	Count documents entering the stage	Outputs `{ fieldName: count }`.
last	`$out`	INSERT INTO	Write results to a collection (replace)	Destroys and replaces the target collection.
last	`$merge`	MERGE / UPSERT	Write results to a collection (upsert/merge)	More flexible than `$out`; can merge into existing docs.

04

Pipeline Options

Second argument to aggregate()

options

db.collection.aggregate(
  [ /* pipeline stages */ ],
  { /* options object */ }
)

Option	Type	Description
`allowDiskUse`	Boolean	Allow stages to spill to disk when 100MB RAM limit is hit. Default: `false`.
`cursor`	Object	Control cursor batch size. Example: `{ batchSize: 100 }`.
`maxTimeMS`	Number	Abort if execution exceeds this duration (milliseconds).
`collation`	Object	Language-specific string comparison rules (e.g., case-insensitive sort).
`hint`	String/Object	Force use of a specific index.
`comment`	String	Attach a label visible in logs and `$currentOp`. Useful for debugging.
`bypassDocumentValidation`	Boolean	Allow `$out`/`$merge` to insert docs that fail schema validation.
`let`	Object	Define variables accessible throughout the pipeline via `$$varName`.

Common Option Combinations

// Production-safe large aggregation
db.orders.aggregate(
  [ { $match: { ... } }, { $group: { ... } } ],
  {
    allowDiskUse: true,      // handle large datasets safely
    maxTimeMS: 30000,        // fail after 30 seconds
    comment: "monthly-sales-report"  // visible in slow query log
  }
)

// Debugging with explain
db.orders.aggregate(
  [ { $match: { status: "completed" } } ]
).explain("executionStats")

05

Key Limits

100MB per stage · 16MB per document · Null handling

limits

100MB Per-Stage RAM Limit

Any single pipeline stage ($group, $sort, $bucket, etc.) is limited to 100MB of RAM. Exceeding this limit throws an error unless allowDiskUse: true is set.

// Error: Exceeded memory limit for $group stage
// Fix: enable disk use
db.events.aggregate(
  [ { $group: { _id: "$userId", count: { $sum: 1 } } } ],
  { allowDiskUse: true }
)
// allowDiskUse: true allows temporary spill to disk at a performance cost
// This is a per-stage limit — two $sort stages = two separate 100MB limits

16MB BSON Document Limit

Any single document at any point during pipeline execution — including intermediate documents — cannot exceed 16MB. $lookup is the most common cause when the joined array is very large.

// PROBLEM — joining 50,000 orders into a user doc exceeds 16MB
db.users.aggregate([
  { $lookup: { from: "orders", localField: "_id", foreignField: "userId", as: "orders" } }
])

// FIX — pre-filter inside $lookup pipeline to keep the embedded array small
db.users.aggregate([
  {
    $lookup: {
      from: "orders",
      let: { uid: "$_id" },
      pipeline: [
        { $match: { $expr: { $eq: ["$userId", "$$uid"] }, status: "active" } },
        { $limit: 10 }
      ],
      as: "recentOrders"
    }
  }
])

Null and Missing Field Handling Across Stages

Stage	Missing/Null Field Behavior
`$match`	Missing field = no match (unless matching `null` or `$exists`)
`$group`	Null/missing group key → all grouped into one null group (`_id: null`)
`$project`	Projecting a missing field → omitted from output, no error
`$lookup`	Null/missing `localField` → matches null values in `foreignField`
`$unwind`	Null/missing/empty array → document silently dropped (unless `preserveNullAndEmptyArrays: true`)

06

find() vs aggregate()

When to use each — capabilities and trade-offs

comparison

	`find()`	`aggregate()`
Best for	Simple retrieval with filtering and field selection	Grouping, joins, computed fields, multi-stage transforms
Grouping	no	yes — $group
Joins	no	yes — $lookup
Computed fields	Limited (`$project` equivalent only)	yes — full expression language
Array deconstruction	no	yes — $unwind
Index usage	yes	yes (on first $match stage)
Overhead	Lower for simple queries	Higher — avoid for trivial lookups
Output	Cursor of collection documents	Cursor of transformed documents
Compare two fields	Only via `$where` (deprecated)	`$match: { $expr: { $gt: ["$a","$b"] } }`

TIP

Use find() for simple document retrieval. Switch to aggregate() the moment you need grouping, multi-field computation, joins, or any multi-stage transformation. aggregate() can do everything find() can, but with higher overhead for trivial cases.

$expr — Using Aggregation Expressions in find()

// Standard find() cannot compare two fields from the same document
// $expr bridges aggregation expressions into query context
db.orders.find({
  $expr: { $gt: ["$sellingPrice", "$costPrice"] }
})
// Finds orders where selling price > cost price
// Also works inside $match in an aggregation pipeline

07

Full Example

Top-3 categories by revenue — all concepts applied

example

// Find top 3 product categories by total revenue from completed orders
db.orders.aggregate([

  // Stage 1: Only completed orders — uses index on "status", reduces set early
  { $match: { status: "completed" } },

  // Stage 2: Group by category, compute totals
  {
    $group: {
      _id: "$category",
      totalRevenue: { $sum: "$amount" },
      orderCount:   { $sum: 1 }
    }
  },

  // Stage 3: Sort by highest revenue — Top-K optimization with $limit next
  { $sort: { totalRevenue: -1 } },

  // Stage 4: Keep only top 3 — triggers priority queue, not full-sort memory
  { $limit: 3 },

  // Stage 5: Rename _id to "category" for clean output
  {
    $project: {
      _id: 0,
      category:     "$_id",
      totalRevenue: 1,
      orderCount:   1
    }
  }

],
{ allowDiskUse: true, maxTimeMS: 15000, comment: "top-3-categories" })

// Sample output
[
  { "category": "Electronics", "totalRevenue": 84200, "orderCount": 312 },
  { "category": "Furniture",   "totalRevenue": 61500, "orderCount": 198 },
  { "category": "Clothing",    "totalRevenue": 43800, "orderCount": 540 }
]

TIP

Build pipelines incrementally. Start with one stage, verify the output, add the next. Never write a 10-stage pipeline and debug it as a whole. Use .explain("executionStats") to verify index usage after adding $match.