29 · Embedding vs Referencing

01

The Core Decision

The fundamental schema question in MongoDB

concept

Every schema decision in MongoDB ultimately comes down to one question: should this related data live inside the same document (embedding) or in a separate collection linked by ID (referencing)?

Unlike relational databases where normalization is the default, MongoDB has no "correct" default. The right answer is entirely determined by how the data is accessed.

Embedding

Related data inside the same document
Single read → complete data
Atomic updates within one document
Risk: document growth, duplication

Referencing

Data in separate collections, linked by ObjectId
Two queries or $lookup to assemble
One source of truth, no duplication
Risk: slower reads, no atomic cross-doc update

TIP

The MongoDB golden rule: embed by default, reference only when you have a specific reason not to embed. Most beginner mistakes come from over-referencing (treating MongoDB like a relational DB) rather than over-embedding.

02

Embedding Deep Dive

When sub-documents are the right choice

embed

Embedding places the child data directly inside the parent document, either as a nested object or an array of objects.

// Embedded addresses (user has 1–3 addresses — bounded, always read together)
{
  _id: ObjectId("..."),
  name:  "Alice Johnson",
  email: "alice@example.com",
  addresses: [
    { type: "home",    street: "123 Main St", city: "NYC",    zip: "10001" },
    { type: "billing", street: "456 Oak Ave",  city: "Brooklyn", zip: "11201" }
  ]
}
// Single document read → user + all addresses, no second query
// Update address 1: single atomic write
db.users.updateOne(
  { _id: userId, "addresses.type": "home" },
  { $set: { "addresses.$.city": "Queens" } }
)

Strengths of Embedding

Read performance: one document read from a single B-Tree lookup — no joins, no network round-trips
Atomicity: updates to parent + child happen in a single write operation without transactions
Locality: related data stored physically adjacent on disk — cache-friendly
Simplicity: no foreign key lookups, no $lookup pipelines, no join boilerplate

When Embedding Breaks Down

Situation	Problem
Child array grows without bound	Document approaches 16MB → write error
Child data shared across many parents	Updates must be applied to every embedded copy (denormalization drift)
Child data queried independently	Must always load parent document even when only child data is needed
Large child payloads rarely needed	Wasted bandwidth and memory loading infrequently-read data

03

Referencing Deep Dive

Linking documents across collections

reference

Referencing stores a pointer (typically an ObjectId) to a document in another collection, mirroring the foreign-key pattern in relational databases.

// Parent-side reference: each order stores customerId
// orders collection:
{
  _id: ObjectId("order1"),
  customerId: ObjectId("cust1"),   // ← reference, not embedded
  status: "completed",
  total: 149.99
}

// customers collection:
{ _id: ObjectId("cust1"), name: "Alice", email: "alice@example.com" }

// To assemble: two queries
const order    = db.orders.findOne({ _id: orderId })
const customer = db.customers.findOne({ _id: order.customerId })

// Or: $lookup in aggregation (see section 05)

// Child-side reference: each comment stores postId (more common for 1:many)
// posts collection:
{ _id: ObjectId("post1"), title: "MongoDB Tips", body: "..." }

// comments collection:
{ _id: ObjectId("c1"), postId: ObjectId("post1"), user: "Bob", text: "Great!" }
{ _id: ObjectId("c2"), postId: ObjectId("post1"), user: "Carol", text: "Agreed!" }

// Fetch all comments for a post:
db.comments.find({ postId: ObjectId("post1") })
// Create index on postId for efficient lookup:
db.comments.createIndex({ postId: 1 })

Strengths of Referencing

Unbounded growth: child collection can grow to any size without affecting parent document
Single source of truth: one document to update when shared data changes
Independent queries: child data can be queried, paginated, and indexed on its own
Smaller parent documents: parent stays lean — less RAM to load for common parent-only reads

WARN

Cross-document updates are not atomic by default. If your write must update two documents together (e.g., decrement inventory AND create order), you need multi-document transactions (MongoDB 4.0+), which add overhead. Embedding eliminates this need by placing both pieces of data in one document.

04

Decision Factors

Six questions that determine embed vs reference

factors

1. Cardinality (Relationship Size)

Relationship	Example	Strategy
One-to-One	User ↔ UserProfile	Always embed (or merge into one doc)
One-to-Few (≤ ~20)	User → Addresses, Order → LineItems	Embed (bounded, predictable)
One-to-Many (hundreds)	BlogPost → Comments	Reference with child-side foreign key
One-to-Squillions (millions)	Server → Log entries	Reference with child-side key + bucket pattern
Many-to-Many	Students ↔ Courses	Reference (junction or array of IDs)

2. Read Frequency (Are They Accessed Together?)

// If you ALWAYS need both → embed
// Order display page always shows line items with the order → embed lineItems

// If they're USUALLY read separately → reference
// Product page shows product info; reviews loaded on demand → reference reviews

3. Update Frequency (Who Changes More?)

// If child rarely changes → safe to embed (historical snapshot is fine)
// Order stores customer name at purchase time → embed (historical)

// If child changes and ALL parents must reflect the update → reference
// Product price should be current on all orders → reference (or accept denorm)

4. Data Size (Will It Stay Bounded?)

// BOUNDED — safe to embed
// A user has at most 5 payment methods → embed

// UNBOUNDED — must reference or use subset/bucket pattern
// An influencer has 10M followers → never embed followers array

5. Atomicity Requirements

// Need atomic update of parent + child together? → embed (free atomicity)
// Order status + inventory count must change together
// → if embedded in same doc: single write, always consistent
// → if referenced: requires transaction (overhead)

6. Independent Query Needs

// Can child data ever be the SUBJECT of a query (not the object)?
// "Find all comments by user Alice across all posts" → reference
// Comments need their own collection with userId index

// "Show this order's line items" → embed (line items have no life outside order)

05

$lookup — MongoDB's JOIN

Assembling referenced collections at query time

$lookup

$lookup is MongoDB's aggregation-pipeline equivalent of a SQL LEFT OUTER JOIN. It joins documents from a foreign collection into the current pipeline's output. It is the primary tool for querying referenced data.

// Basic $lookup: join customer into each order
db.orders.aggregate([
  {
    $lookup: {
      from:         "customers",    // foreign collection
      localField:   "customerId",   // field in orders
      foreignField: "_id",          // field in customers
      as:           "customer"      // output array field name
    }
  },
  { $unwind: "$customer" }           // flatten array to object (1:1 join)
])
// Output: order document + customer object merged

// Pipeline $lookup: join with filter and projection (more efficient)
db.orders.aggregate([
  { $match: { status: "active" } },
  {
    $lookup: {
      from: "products",
      let:  { itemIds: "$lineItems.productId" },  // pass local vars
      pipeline: [
        { $match: { $expr: { $in: ["$_id", "$$itemIds"] } } },
        { $project: { name: 1, price: 1 } }        // only needed fields
      ],
      as: "productDetails"
    }
  }
])

$lookup Performance Considerations

Factor	Detail
Index requirement	`foreignField` MUST be indexed — otherwise full collection scan per document
Pipeline position	Put `$match` before `$lookup` to reduce the number of lookups performed
Memory limit	Each joined document counts toward the 100MB per-stage pipeline limit
vs embedding	`$lookup` is always slower than embedding — use it for infrequent joins or reporting pipelines

WARN

If you find yourself using $lookup on every single read operation for a hot endpoint, that is a signal that embedding (or the extended reference pattern) would be more appropriate. $lookup is suited for batch reporting, ad-hoc analytics, and infrequent joins — not high-frequency OLTP reads.

06

Mixed Strategies

Combining embedding and referencing in the same schema

mixed

Real schemas almost never use pure embedding or pure referencing. The most effective designs combine both, applying each where it fits best.

Strategy: Embed Summary, Reference Full

// Order: embed product summary, reference full product doc
{
  _id: ObjectId("order1"),
  lineItems: [
    {
      productId: ObjectId("p1"),       // reference to products collection
      // Embedded snapshot at purchase time (extended reference):
      name: "Wireless Headphones",
      sku:  "WH-2024",
      price: 79.99,                    // price locked at purchase time
      qty:  2
    }
  ],
  total: 159.98
}
// Render order receipt: single document read → complete
// Product price changes in products collection won't alter historical order

Strategy: Parent Reference + Child-Side Array of IDs

// Many-to-many: Students enrolled in Courses

// students collection — array of enrolled course IDs (bounded ≤ ~20 courses)
{ _id: ObjectId("s1"), name: "Alice", enrolledCourses: [ObjectId("c1"), ObjectId("c2")] }

// courses collection — no student array (would be unbounded)
{ _id: ObjectId("c1"), title: "MongoDB Fundamentals", instructor: "Dr. Smith" }

// Find Alice's courses:
db.courses.find({ _id: { $in: student.enrolledCourses } })

// Find all students in a course:
db.students.find({ enrolledCourses: ObjectId("c1") })
// Index enrolledCourses for this query:
db.students.createIndex({ enrolledCourses: 1 })

07

Decision Tree

Step-by-step guide to choosing embed vs reference

reference

// Ask these questions in order:

// Q1: Is the related data ALWAYS read with the parent?
//   YES → lean toward embedding
//   NO  → lean toward referencing

// Q2: Can the child array grow WITHOUT BOUND?
//   YES → must reference (or use subset/bucket)
//   NO  → embedding is safe

// Q3: Is the child data SHARED across many parents?
//   YES → reference (one source of truth)
//   NO  → embedding fine (data belongs to one parent)

// Q4: Do parent + child need ATOMIC updates together?
//   YES → strongly prefer embedding (free atomicity)
//   NO  → either works

// Q5: Is child data ever queried INDEPENDENTLY?
//   YES → reference (needs own collection + indexes)
//   NO  → embedding appropriate

Summary Lookup Table

Scenario	Strategy
User + 1-5 addresses	Embed
Order + line items (bounded)	Embed
User + 10M followers	Reference (separate follows collection)
Post + 1000s of comments	Reference — child-side postId
Order + customer (display only)	Mixed — embed name/email snapshot
Post + last-5 comments preview	Mixed — subset pattern
Student ↔ Course (many:many)	Reference — array of IDs in student
User + settings/preferences	Embed (or merge into user doc)