← back

Embedding
vs Referencing

FILE  29_embedding_vs_referencing
TOPIC  Decision Factors · Cardinality · $lookup · Mixed Strategy · Decision Tree
LEVEL  Intermediate
01
The Core Decision
The fundamental schema question in MongoDB
concept

Every schema decision in MongoDB ultimately comes down to one question: should this related data live inside the same document (embedding) or in a separate collection linked by ID (referencing)?

Unlike relational databases where normalization is the default, MongoDB has no "correct" default. The right answer is entirely determined by how the data is accessed.

Embedding

  • Related data inside the same document
  • Single read → complete data
  • Atomic updates within one document
  • Risk: document growth, duplication

Referencing

  • Data in separate collections, linked by ObjectId
  • Two queries or $lookup to assemble
  • One source of truth, no duplication
  • Risk: slower reads, no atomic cross-doc update
TIP
The MongoDB golden rule: embed by default, reference only when you have a specific reason not to embed. Most beginner mistakes come from over-referencing (treating MongoDB like a relational DB) rather than over-embedding.
02
Embedding Deep Dive
When sub-documents are the right choice
embed

Embedding places the child data directly inside the parent document, either as a nested object or an array of objects.

// Embedded addresses (user has 1–3 addresses — bounded, always read together)
{
  _id: ObjectId("..."),
  name:  "Alice Johnson",
  email: "alice@example.com",
  addresses: [
    { type: "home",    street: "123 Main St", city: "NYC",    zip: "10001" },
    { type: "billing", street: "456 Oak Ave",  city: "Brooklyn", zip: "11201" }
  ]
}
// Single document read → user + all addresses, no second query
// Update address 1: single atomic write
db.users.updateOne(
  { _id: userId, "addresses.type": "home" },
  { $set: { "addresses.$.city": "Queens" } }
)

Strengths of Embedding

  • Read performance: one document read from a single B-Tree lookup — no joins, no network round-trips
  • Atomicity: updates to parent + child happen in a single write operation without transactions
  • Locality: related data stored physically adjacent on disk — cache-friendly
  • Simplicity: no foreign key lookups, no $lookup pipelines, no join boilerplate

When Embedding Breaks Down

SituationProblem
Child array grows without boundDocument approaches 16MB → write error
Child data shared across many parentsUpdates must be applied to every embedded copy (denormalization drift)
Child data queried independentlyMust always load parent document even when only child data is needed
Large child payloads rarely neededWasted bandwidth and memory loading infrequently-read data
03
Referencing Deep Dive
Linking documents across collections
reference

Referencing stores a pointer (typically an ObjectId) to a document in another collection, mirroring the foreign-key pattern in relational databases.

// Parent-side reference: each order stores customerId
// orders collection:
{
  _id: ObjectId("order1"),
  customerId: ObjectId("cust1"),   // ← reference, not embedded
  status: "completed",
  total: 149.99
}

// customers collection:
{ _id: ObjectId("cust1"), name: "Alice", email: "alice@example.com" }

// To assemble: two queries
const order    = db.orders.findOne({ _id: orderId })
const customer = db.customers.findOne({ _id: order.customerId })

// Or: $lookup in aggregation (see section 05)
// Child-side reference: each comment stores postId (more common for 1:many)
// posts collection:
{ _id: ObjectId("post1"), title: "MongoDB Tips", body: "..." }

// comments collection:
{ _id: ObjectId("c1"), postId: ObjectId("post1"), user: "Bob", text: "Great!" }
{ _id: ObjectId("c2"), postId: ObjectId("post1"), user: "Carol", text: "Agreed!" }

// Fetch all comments for a post:
db.comments.find({ postId: ObjectId("post1") })
// Create index on postId for efficient lookup:
db.comments.createIndex({ postId: 1 })

Strengths of Referencing

  • Unbounded growth: child collection can grow to any size without affecting parent document
  • Single source of truth: one document to update when shared data changes
  • Independent queries: child data can be queried, paginated, and indexed on its own
  • Smaller parent documents: parent stays lean — less RAM to load for common parent-only reads
WARN
Cross-document updates are not atomic by default. If your write must update two documents together (e.g., decrement inventory AND create order), you need multi-document transactions (MongoDB 4.0+), which add overhead. Embedding eliminates this need by placing both pieces of data in one document.
04
Decision Factors
Six questions that determine embed vs reference
factors

1. Cardinality (Relationship Size)

RelationshipExampleStrategy
One-to-OneUser ↔ UserProfileAlways embed (or merge into one doc)
One-to-Few (≤ ~20)User → Addresses, Order → LineItemsEmbed (bounded, predictable)
One-to-Many (hundreds)BlogPost → CommentsReference with child-side foreign key
One-to-Squillions (millions)Server → Log entriesReference with child-side key + bucket pattern
Many-to-ManyStudents ↔ CoursesReference (junction or array of IDs)

2. Read Frequency (Are They Accessed Together?)

// If you ALWAYS need both → embed
// Order display page always shows line items with the order → embed lineItems

// If they're USUALLY read separately → reference
// Product page shows product info; reviews loaded on demand → reference reviews

3. Update Frequency (Who Changes More?)

// If child rarely changes → safe to embed (historical snapshot is fine)
// Order stores customer name at purchase time → embed (historical)

// If child changes and ALL parents must reflect the update → reference
// Product price should be current on all orders → reference (or accept denorm)

4. Data Size (Will It Stay Bounded?)

// BOUNDED — safe to embed
// A user has at most 5 payment methods → embed

// UNBOUNDED — must reference or use subset/bucket pattern
// An influencer has 10M followers → never embed followers array

5. Atomicity Requirements

// Need atomic update of parent + child together? → embed (free atomicity)
// Order status + inventory count must change together
// → if embedded in same doc: single write, always consistent
// → if referenced: requires transaction (overhead)

6. Independent Query Needs

// Can child data ever be the SUBJECT of a query (not the object)?
// "Find all comments by user Alice across all posts" → reference
// Comments need their own collection with userId index

// "Show this order's line items" → embed (line items have no life outside order)
05
$lookup — MongoDB's JOIN
Assembling referenced collections at query time
$lookup

$lookup is MongoDB's aggregation-pipeline equivalent of a SQL LEFT OUTER JOIN. It joins documents from a foreign collection into the current pipeline's output. It is the primary tool for querying referenced data.

// Basic $lookup: join customer into each order
db.orders.aggregate([
  {
    $lookup: {
      from:         "customers",    // foreign collection
      localField:   "customerId",   // field in orders
      foreignField: "_id",          // field in customers
      as:           "customer"      // output array field name
    }
  },
  { $unwind: "$customer" }           // flatten array to object (1:1 join)
])
// Output: order document + customer object merged
// Pipeline $lookup: join with filter and projection (more efficient)
db.orders.aggregate([
  { $match: { status: "active" } },
  {
    $lookup: {
      from: "products",
      let:  { itemIds: "$lineItems.productId" },  // pass local vars
      pipeline: [
        { $match: { $expr: { $in: ["$_id", "$$itemIds"] } } },
        { $project: { name: 1, price: 1 } }        // only needed fields
      ],
      as: "productDetails"
    }
  }
])

$lookup Performance Considerations

FactorDetail
Index requirementforeignField MUST be indexed — otherwise full collection scan per document
Pipeline positionPut $match before $lookup to reduce the number of lookups performed
Memory limitEach joined document counts toward the 100MB per-stage pipeline limit
vs embedding$lookup is always slower than embedding — use it for infrequent joins or reporting pipelines
WARN
If you find yourself using $lookup on every single read operation for a hot endpoint, that is a signal that embedding (or the extended reference pattern) would be more appropriate. $lookup is suited for batch reporting, ad-hoc analytics, and infrequent joins — not high-frequency OLTP reads.
06
Mixed Strategies
Combining embedding and referencing in the same schema
mixed

Real schemas almost never use pure embedding or pure referencing. The most effective designs combine both, applying each where it fits best.

Strategy: Embed Summary, Reference Full

// Order: embed product summary, reference full product doc
{
  _id: ObjectId("order1"),
  lineItems: [
    {
      productId: ObjectId("p1"),       // reference to products collection
      // Embedded snapshot at purchase time (extended reference):
      name: "Wireless Headphones",
      sku:  "WH-2024",
      price: 79.99,                    // price locked at purchase time
      qty:  2
    }
  ],
  total: 159.98
}
// Render order receipt: single document read → complete
// Product price changes in products collection won't alter historical order

Strategy: Parent Reference + Child-Side Array of IDs

// Many-to-many: Students enrolled in Courses

// students collection — array of enrolled course IDs (bounded ≤ ~20 courses)
{ _id: ObjectId("s1"), name: "Alice", enrolledCourses: [ObjectId("c1"), ObjectId("c2")] }

// courses collection — no student array (would be unbounded)
{ _id: ObjectId("c1"), title: "MongoDB Fundamentals", instructor: "Dr. Smith" }

// Find Alice's courses:
db.courses.find({ _id: { $in: student.enrolledCourses } })

// Find all students in a course:
db.students.find({ enrolledCourses: ObjectId("c1") })
// Index enrolledCourses for this query:
db.students.createIndex({ enrolledCourses: 1 })
07
Decision Tree
Step-by-step guide to choosing embed vs reference
reference
// Ask these questions in order:

// Q1: Is the related data ALWAYS read with the parent?
//   YES → lean toward embedding
//   NO  → lean toward referencing

// Q2: Can the child array grow WITHOUT BOUND?
//   YES → must reference (or use subset/bucket)
//   NO  → embedding is safe

// Q3: Is the child data SHARED across many parents?
//   YES → reference (one source of truth)
//   NO  → embedding fine (data belongs to one parent)

// Q4: Do parent + child need ATOMIC updates together?
//   YES → strongly prefer embedding (free atomicity)
//   NO  → either works

// Q5: Is child data ever queried INDEPENDENTLY?
//   YES → reference (needs own collection + indexes)
//   NO  → embedding appropriate

Summary Lookup Table

ScenarioStrategy
User + 1-5 addressesEmbed
Order + line items (bounded)Embed
User + 10M followersReference (separate follows collection)
Post + 1000s of commentsReference — child-side postId
Order + customer (display only)Mixed — embed name/email snapshot
Post + last-5 comments previewMixed — subset pattern
Student ↔ Course (many:many)Reference — array of IDs in student
User + settings/preferencesEmbed (or merge into user doc)