← back

Aggregation
Pipeline Intro

FILE  17_aggregation_intro
TOPIC  Pipeline Concept · Stage Order · Options · Limits · find() vs aggregate()
LEVEL  Foundation
01
What is the Pipeline
The aggregation framework conceptual model
concept

The Aggregation Pipeline is MongoDB's framework for data processing and transformation. It is modeled as a conveyor belt: documents enter the pipeline, pass through a series of sequential stages, and emerge as transformed output.

Each stage takes documents as input, applies one specific operation (filter, reshape, group, sort, join, etc.), and passes the result to the next stage.

Collection
$match
$group
$sort
$project
Output

Basic Syntax

db.collection.aggregate([
  { $stage1: { /* options */ } },
  { $stage2: { /* options */ } },
  { $stage3: { /* options */ } }
])
// Array of stage objects — executed top to bottom
// Output of each stage is the input of the next

Why Use the Aggregation Pipeline?

  • Group documents and compute summaries (GROUP BY equivalent)
  • Join multiple collections (LEFT JOIN equivalent)
  • Compute derived fields with expressions
  • Reshape documents — include, exclude, rename, compute fields
  • Deconstruct arrays into individual documents
  • Date/time extraction and formatting
  • Statistical operations (standard deviation, etc.)
NOTE
aggregate() returns a cursor, just like find(). Documents flow through the pipeline lazily — they are not all loaded into memory at once.
02
How It Works
Sequential execution, stage isolation, performance implications
execution

Execution is strictly sequential — top to bottom. The output of one stage becomes the direct input of the next. No stage has visibility into stages that follow it.

Key Implications

  • Stages that reduce data volume ($match) should appear as early as possible
  • Expensive operations ($lookup, $unwind) should operate on the smallest possible dataset
  • Stage order is the primary performance lever — a misplaced $match can turn a 10ms query into a 10-second query
  • Computed fields created in a stage are not available in prior stages

Recommended Stage Order

// Optimal pipeline ordering pattern
db.orders.aggregate([
  { $match: { ... } },        // 1. Filter first — enables index usage
  { $sort: { ... } },         // 2. Sort early if needed for $first/$last
  { $skip: N },               // 3. Skip (pagination)
  { $limit: N },              // 4. Limit — Top-K optimization with $sort
  { $lookup: { ... } },       // 5. Join — on smallest filtered set
  { $unwind: "$arr" },        // 6. Unwind — after filtering, before group
  { $group: { ... } },        // 7. Group — on reduced dataset
  { $project: { ... } }       // 8. Project last — shape final output
])
WARN
Placing $match after $group loses index access — MongoDB must scan every intermediate document. Always filter before grouping whenever possible.

Index Usage in the Pipeline

Indexes are exploitable only in the earliest stages, before any transformation occurs. Once documents have been transformed (e.g., field renamed via $project), the original index on that field can no longer be used.

// Index on "status" IS used — $match is first
db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$category", total: { $sum: "$amount" } } }
])

// Index on "status" is NOT used — $match is after $group
db.orders.aggregate([
  { $group: { _id: "$category", total: { $sum: "$amount" } } },
  { $match: { status: "completed" } }   // scans ALL grouped results
])
03
Stages Overview
Core stages with SQL equivalents
reference
OrderStageSQL EquivalentPurposePerformance Note
1 $match WHERE Filter documents by condition Always first. Enables index usage and reduces downstream load.
2 $sort ORDER BY Reorder documents Can use indexes if placed early, before transformation stages.
3 $skip OFFSET Skip first N documents O(n) — costly at large offsets. Prefer keyset pagination.
4 $limit LIMIT Restrict output to N documents With $sort enables Top-K memory optimization.
5 $lookup LEFT JOIN Join another collection Resource-intensive. Only join filtered documents.
6 $unwind Deconstruct array field into individual documents Multiplies document count. Apply $match before.
7 $group GROUP BY Aggregate documents into summaries 100MB RAM limit per stage.
8 $project SELECT Include, exclude, rename, or compute fields Usually last to shape final client output.
any $addFields / $set SELECT (add) Add computed fields without removing existing ones Alias of each other. Preferred when only adding fields.
any $count COUNT(*) Count documents entering the stage Outputs { fieldName: count }.
last $out INSERT INTO Write results to a collection (replace) Destroys and replaces the target collection.
last $merge MERGE / UPSERT Write results to a collection (upsert/merge) More flexible than $out; can merge into existing docs.
04
Pipeline Options
Second argument to aggregate()
options
db.collection.aggregate(
  [ /* pipeline stages */ ],
  { /* options object */ }
)
OptionTypeDescription
allowDiskUse Boolean Allow stages to spill to disk when 100MB RAM limit is hit. Default: false.
cursor Object Control cursor batch size. Example: { batchSize: 100 }.
maxTimeMS Number Abort if execution exceeds this duration (milliseconds).
collation Object Language-specific string comparison rules (e.g., case-insensitive sort).
hint String/Object Force use of a specific index.
comment String Attach a label visible in logs and $currentOp. Useful for debugging.
bypassDocumentValidation Boolean Allow $out/$merge to insert docs that fail schema validation.
let Object Define variables accessible throughout the pipeline via $$varName.

Common Option Combinations

// Production-safe large aggregation
db.orders.aggregate(
  [ { $match: { ... } }, { $group: { ... } } ],
  {
    allowDiskUse: true,      // handle large datasets safely
    maxTimeMS: 30000,        // fail after 30 seconds
    comment: "monthly-sales-report"  // visible in slow query log
  }
)

// Debugging with explain
db.orders.aggregate(
  [ { $match: { status: "completed" } } ]
).explain("executionStats")
05
Key Limits
100MB per stage · 16MB per document · Null handling
limits

100MB Per-Stage RAM Limit

Any single pipeline stage ($group, $sort, $bucket, etc.) is limited to 100MB of RAM. Exceeding this limit throws an error unless allowDiskUse: true is set.

// Error: Exceeded memory limit for $group stage
// Fix: enable disk use
db.events.aggregate(
  [ { $group: { _id: "$userId", count: { $sum: 1 } } } ],
  { allowDiskUse: true }
)
// allowDiskUse: true allows temporary spill to disk at a performance cost
// This is a per-stage limit — two $sort stages = two separate 100MB limits

16MB BSON Document Limit

Any single document at any point during pipeline execution — including intermediate documents — cannot exceed 16MB. $lookup is the most common cause when the joined array is very large.

// PROBLEM — joining 50,000 orders into a user doc exceeds 16MB
db.users.aggregate([
  { $lookup: { from: "orders", localField: "_id", foreignField: "userId", as: "orders" } }
])

// FIX — pre-filter inside $lookup pipeline to keep the embedded array small
db.users.aggregate([
  {
    $lookup: {
      from: "orders",
      let: { uid: "$_id" },
      pipeline: [
        { $match: { $expr: { $eq: ["$userId", "$$uid"] }, status: "active" } },
        { $limit: 10 }
      ],
      as: "recentOrders"
    }
  }
])

Null and Missing Field Handling Across Stages

StageMissing/Null Field Behavior
$matchMissing field = no match (unless matching null or $exists)
$groupNull/missing group key → all grouped into one null group (_id: null)
$projectProjecting a missing field → omitted from output, no error
$lookupNull/missing localField → matches null values in foreignField
$unwindNull/missing/empty array → document silently dropped (unless preserveNullAndEmptyArrays: true)
06
find() vs aggregate()
When to use each — capabilities and trade-offs
comparison
find()aggregate()
Best forSimple retrieval with filtering and field selectionGrouping, joins, computed fields, multi-stage transforms
Groupingnoyes — $group
Joinsnoyes — $lookup
Computed fieldsLimited ($project equivalent only)yes — full expression language
Array deconstructionnoyes — $unwind
Index usageyesyes (on first $match stage)
OverheadLower for simple queriesHigher — avoid for trivial lookups
OutputCursor of collection documentsCursor of transformed documents
Compare two fieldsOnly via $where (deprecated)$match: { $expr: { $gt: ["$a","$b"] } }
TIP
Use find() for simple document retrieval. Switch to aggregate() the moment you need grouping, multi-field computation, joins, or any multi-stage transformation. aggregate() can do everything find() can, but with higher overhead for trivial cases.

$expr — Using Aggregation Expressions in find()

// Standard find() cannot compare two fields from the same document
// $expr bridges aggregation expressions into query context
db.orders.find({
  $expr: { $gt: ["$sellingPrice", "$costPrice"] }
})
// Finds orders where selling price > cost price
// Also works inside $match in an aggregation pipeline
07
Full Example
Top-3 categories by revenue — all concepts applied
example
// Find top 3 product categories by total revenue from completed orders
db.orders.aggregate([

  // Stage 1: Only completed orders — uses index on "status", reduces set early
  { $match: { status: "completed" } },

  // Stage 2: Group by category, compute totals
  {
    $group: {
      _id: "$category",
      totalRevenue: { $sum: "$amount" },
      orderCount:   { $sum: 1 }
    }
  },

  // Stage 3: Sort by highest revenue — Top-K optimization with $limit next
  { $sort: { totalRevenue: -1 } },

  // Stage 4: Keep only top 3 — triggers priority queue, not full-sort memory
  { $limit: 3 },

  // Stage 5: Rename _id to "category" for clean output
  {
    $project: {
      _id: 0,
      category:     "$_id",
      totalRevenue: 1,
      orderCount:   1
    }
  }

],
{ allowDiskUse: true, maxTimeMS: 15000, comment: "top-3-categories" })
// Sample output
[
  { "category": "Electronics", "totalRevenue": 84200, "orderCount": 312 },
  { "category": "Furniture",   "totalRevenue": 61500, "orderCount": 198 },
  { "category": "Clothing",    "totalRevenue": 43800, "orderCount": 540 }
]
TIP
Build pipelines incrementally. Start with one stage, verify the output, add the next. Never write a 10-stage pipeline and debug it as a whole. Use .explain("executionStats") to verify index usage after adding $match.