← back

Schema
Design

FILE  28_schema_design
TOPIC  Embed vs Reference · Hybrid · Patterns · 16MB Limit · Unbounded Arrays · Shard Keys
LEVEL  Intermediate/Advanced
01
Design Philosophy
Access-pattern-first, not normalisation-first
concept

MongoDB schema design starts from a fundamentally different premise than relational modelling. In SQL, you normalise first to eliminate redundancy, then add indexes to speed up joins. In MongoDB, you design around your access patterns first — the goal is to co-locate the data each query needs so that a single document read returns everything without joins.

Relational approach

  • Normalise to 3NF — eliminate all redundancy
  • Data is spread across many tables
  • Add JOIN queries to reassemble data at read time
  • Schema designed independently of query patterns

MongoDB approach

  • Design around how data is READ and WRITTEN
  • Co-locate related data — embed what's read together
  • Controlled redundancy is acceptable for speed
  • Schema evolves with access pattern changes

The Three Fundamental Questions

Before writing any schema, answer these three questions for each entity:

  • How is this data written? — insert rate, update frequency, write atomicity requirements
  • How is this data read? — which fields are fetched together, how often, by which query patterns
  • What is the data's lifecycle? — how large will it grow, does it get deleted, is it time-bounded?
TIP
The most important design insight: MongoDB's unit of atomicity is the document. Multi-document transactions exist but are slower. If two pieces of data must be updated together atomically (e.g., order total and line items), embed them in the same document — you get atomic updates for free without transaction overhead.
02
Embed vs Reference
When to put data inside the document vs link to another collection
decision

The core schema question in MongoDB: should related data live inside the same document (embedding) or in a separate document referenced by ID (referencing)?

Embedding

// Embedded: order with line items in same document
{
  _id: ObjectId("..."),
  orderId: "ORD-001",
  customerId: "C001",
  status: "completed",
  lineItems: [
    { productId: "P1", name: "Widget", qty: 2, price: 19.99 },
    { productId: "P2", name: "Gadget", qty: 1, price: 49.99 }
  ],
  total: 89.97
}
// Single document read → complete order data
// Atomic updates to order + line items — no transaction needed

Referencing

// Referenced: order links to customer by ID
{
  _id: ObjectId("..."),
  orderId: "ORD-001",
  customerId: ObjectId("..."),  // ← reference to customers collection
  status: "completed",
  total: 89.97
}
// Fetching order + customer requires two queries (or $lookup in aggregation)
// But: customer data stays up-to-date in one place — no duplication

Decision Matrix

FactorFavour EmbeddingFavour Referencing
Read patternData always read togetherData read independently
Update patternData updated together atomicallyChild data updated independently
Relationship typeOne-to-few (1:1, 1:few)One-to-many, many-to-many
Data sizeChild data is small and boundedChild data is large or unbounded
Duplication concernData is unique to this parentData is shared across many parents
16MB limitDocument stays well under 16MBEmbedding would breach 16MB risk
Query independenceChild never queried without parentChild queried independently
NOTE
There is no universally correct answer. A blog post's tags (bounded, always read with post) should be embedded. A blog post's comments (unbounded, may number thousands) should be referenced — or handled with the subset pattern.
03
The Hybrid Approach
Embed the summary, reference the detail — the MongoDB sweet spot
hybrid

Pure embedding (all data in one document) and pure referencing (everything normalised) are both extremes. The hybrid approach — the recommended MongoDB design pattern — embeds the most frequently accessed summary data directly in the document while keeping the full detail in its own collection.

// ❌ Pure embedding — works until blog gets popular
{
  _id: ObjectId("..."),
  title: "MongoDB Schema Design",
  body: "...",
  comments: [                     // UNBOUNDED — could be 10,000 comments
    { user: "Alice", text: "Great post!", date: ISODate("...") },
    { user: "Bob",   text: "Thanks!",     date: ISODate("...") },
    // ... thousands more
  ]
}

// ❌ Pure referencing — always two queries to render post + comments
// Post doc has no preview of comments — poor UX for comment count display

// ✅ Hybrid: embed summary in post, reference full comments
{
  _id: ObjectId("..."),
  title: "MongoDB Schema Design",
  body: "...",
  commentCount: 247,              // pre-computed — no $count query needed
  recentComments: [               // last 3 comments — bounded, for preview
    { user: "Carol", text: "Very clear!", date: ISODate("2024-03-01") },
    { user: "Dave",  text: "Bookmarked",  date: ISODate("2024-03-02") }
  ]
  // Full comments live in: db.comments.find({ postId: ObjectId("...") })
}
// Render post page: 1 document read → has title, body, count, preview
// Load all comments: second query only when user clicks "load comments"
TIP
The hybrid pattern maps directly to real page loads: the main document read satisfies 95% of page render needs with one query. The second collection is only queried on demand (e.g., clicking to expand, paginating). This is how most high-traffic MongoDB applications operate in practice.
04
Design Patterns
Extended Reference · Computed · Subset · Bucket · Outlier
patterns

Extended Reference Pattern

When documents reference another collection, embed the most frequently accessed fields from the referenced document directly in the referencing document. This eliminates the need for a second lookup for those fields.

// Problem: every order display requires fetching customer name + email
// Pure reference: two queries every time
{ orderId: "ORD-001", customerId: ObjectId("...") }
db.customers.findOne({ _id: ObjectId("...") })

// ✅ Extended Reference: embed the hot fields directly in the order
{
  orderId:    "ORD-001",
  customerId: ObjectId("..."),      // full reference still kept for joins
  customer: {                       // embedded snapshot of hot fields
    name:  "Alice Johnson",
    email: "alice@example.com"
  },
  total: 89.97
}
// Render order: single document → has customer name + email, no second query
// Tradeoff: if customer changes name/email, existing orders show old values
// This is acceptable for orders (historical snapshot is often correct anyway)

Computed / Pre-calculated Pattern

Instead of counting or summing at read time, maintain the aggregate value in the document and update it on write. Read operations are O(1); aggregation cost is paid once per write instead of on every read.

// ❌ Expensive: counting likes on every post view
db.likes.countDocuments({ postId: ObjectId("...") })   // O(N) on every render

// ✅ Computed: maintain likeCount in post, increment on like action
{
  _id:       ObjectId("..."),
  title:     "MongoDB Schema Design",
  likeCount: 1247,   // pre-computed — updated on every like/unlike
  viewCount: 58300
}

// On like action: atomic increment — O(1) write
db.posts.updateOne(
  { _id: postId },
  { $inc: { likeCount: 1 } }
)
// Read likeCount: single field access — O(1), no aggregation

Subset Pattern

For one-to-many relationships where you need to display only the N most recent or relevant items, embed the latest N items in the parent document and keep the full list in a child collection.

// Product with reviews — could have thousands of reviews

// ✅ Subset: embed last 5 reviews in product doc
{
  _id:      ObjectId("..."),
  name:     "Wireless Headphones",
  avgRating: 4.3,
  reviewCount: 2847,
  recentReviews: [             // BOUNDED array — always exactly 5
    { user: "Alice", rating: 5, text: "Excellent sound quality" },
    { user: "Bob",   rating: 4, text: "Good value for money" },
    // ... 3 more
  ]
}
// Full reviews: db.reviews.find({ productId: ObjectId("...") })

// On new review: $push with $slice to maintain exactly 5 recent reviews
db.products.updateOne(
  { _id: productId },
  {
    $push: {
      recentReviews: {
        $each:  [newReview],
        $sort:  { date: -1 },  // most recent first
        $slice: 5              // keep only 5
      }
    },
    $inc: { reviewCount: 1 },
    $set: { avgRating: newAverage }
  }
)

Bucket Pattern

For time-series data with high write throughput (IoT sensors, logs, metrics), group time-adjacent readings into a single "bucket" document instead of one document per reading. Dramatically reduces document count and index size.

// ❌ Naive: one document per sensor reading — millions of tiny docs
{ sensorId: "S1", timestamp: ISODate("..."), temp: 22.4 }
{ sensorId: "S1", timestamp: ISODate("..."), temp: 22.5 }
// 1 reading per minute = 525,600 documents per sensor per year

// ✅ Bucket: group one hour of readings per document
{
  sensorId:  "S1",
  hour:      ISODate("2024-03-01T10:00:00Z"),  // bucket key
  count:     60,
  readings:  [
    { minute: 0, temp: 22.4 },
    { minute: 1, temp: 22.5 },
    // ... 58 more (one per minute)
  ],
  stats: {
    min: 22.1,
    max: 23.0,
    avg: 22.6                 // pre-computed per bucket
  }
}
// 1 reading per minute = 8,760 documents per sensor per year (60x fewer)

// Add readings to current bucket with $push
db.sensor_data.updateOne(
  { sensorId: "S1", hour: currentHour, count: { $lt: 60 } },
  { $push: { readings: newReading }, $inc: { count: 1 } },
  { upsert: true }
)
NOTE
MongoDB has a native Time Series Collection type (v5.0+) which internally applies the bucket pattern automatically — preferred over manual bucket implementation for new IoT/metrics projects. Use db.createCollection("sensor_data", { timeseries: { timeField: "timestamp", metaField: "sensorId" } }).
05
16MB Limit & Unbounded Arrays
Hard document limit and the most common schema anti-pattern
limits

MongoDB enforces a hard 16MB maximum document size. Exceeding this limit causes a write error. The most common cause of hitting this limit is unbounded arrays — arrays that grow without limit as the application runs.

DANGER
The most dangerous schema anti-pattern: embedding an array that grows indefinitely. A comments: [] array on a popular post, a followers: [] array on a viral account, or an events: [] array on a long-lived document will eventually hit 16MB and crash all writes to that document with no recovery other than splitting the document.
// ❌ UNBOUNDED ARRAY ANTI-PATTERNS

// A user with 500K followers → 500K ObjectIds ≈ 6MB → approaches limit
{ _id: ObjectId("..."), username: "popular_user", followers: [ id1, id2, ... ] }

// A chat room open for 1 year → thousands of messages
{ _id: ObjectId("..."), roomId: "room-1", messages: [ { text: "hi", ... }, ... ] }

// An order with unlimited audit log entries
{ _id: ObjectId("..."), orderId: "ORD-1", auditLog: [ { action: "created" }, ... ] }

// ✅ FIXES

// 1. Separate collection — follower relationships
db.follows.insertOne({ followerId: ObjectId("..."), followeeId: ObjectId("...") })
db.follows.createIndex({ followeeId: 1 })    // count followers
db.follows.createIndex({ followerId: 1 })    // list who I follow

// 2. Bucket pattern — chat messages grouped by day/hour
db.messages.insertOne({ roomId: "room-1", date: "2024-03-01", messages: [ ... ] })

// 3. Capped/rotating — keep only last N (subset pattern + $slice)
db.orders.updateOne({ _id: id }, {
  $push: { auditLog: { $each: [entry], $slice: -100 } }  // keep last 100 only
})

Working Near the Limit

// Check document sizes in a collection
db.collection.aggregate([
  { $project: { size: { $bsonSize: "$$ROOT" } } },
  { $sort: { size: -1 } },
  { $limit: 10 }
])
// Documents approaching 10MB+ are at risk — refactor immediately

// $bsonSize is available from MongoDB 4.4+
// 16MB = 16,777,216 bytes
WARN
There is no warning before hitting 16MB — the write simply fails with BSONObj size: ... (0x...) is invalid. Size must be between 0 and 16793600. By the time this error appears in production, the document is already too large to update. Monitor document sizes proactively using $bsonSize aggregation on large collections.
06
Shard Key Design
Distribution, query isolation, and avoiding hotspots
sharding

In a sharded MongoDB cluster, the shard key determines how documents are distributed across shards. A poorly designed shard key is very difficult to change after collection creation (requires resharding in MongoDB 5.0+ or collection recreation in earlier versions).

Shard Key Requirements

RequirementWhy It Matters
High cardinalityToo few distinct values = too few chunks = cannot distribute evenly
Low frequencyIf one value dominates (e.g., 80% of docs have status: "active"), one shard becomes a hotspot
Non-monotonicObjectId/_id as shard key → all inserts go to the same "last" chunk → write hotspot
Query-based targetingQueries including the shard key go to 1-2 shards; without it, broadcast to all (scatter-gather)
// ❌ BAD shard key choices
{ shardKey: { _id: 1 } }           // ObjectId is monotonic → write hotspot on last shard
{ shardKey: { createdAt: 1 } }     // timestamp is monotonic → write hotspot
{ shardKey: { status: 1 } }        // low cardinality (3-5 values) → uneven distribution
{ shardKey: { country: 1 } }       // 200 countries, but 60% of data is "US" → hotspot

// ✅ GOOD shard key choices

// Hashed shard key — uniform distribution even for monotonic fields
sh.shardCollection("mydb.users", { userId: "hashed" })
// Tradeoff: range queries scatter-gather across all shards

// Compound shard key — high cardinality, query-targeted
sh.shardCollection("mydb.orders", { customerId: 1, createdAt: 1 })
// customerId: high cardinality, queries targeted to customer's shard
// createdAt: second field adds monotonicity spread within each customer's range

// Zone-based / geographic shard key
sh.shardCollection("mydb.sessions", { region: 1, userId: 1 })
// Route EU users to EU shards, US users to US shards
DANGER
The shard key is immutable per document once written (in MongoDB < 4.2, it cannot be changed at all; in 4.2+ only the value can be updated). The shard key field itself cannot be changed after shardCollection(). If you need to reshard, MongoDB 5.0+ supports online resharding — but it is expensive. Get the shard key right during design.
07
Anti-Patterns & Quick Reference
Common schema mistakes and design decision summary
reference

Top Schema Anti-Patterns

Anti-PatternProblemFix
Unbounded arraysEventual 16MB document failureSeparate collection, bucket pattern, or $slice capped array
Massive documentsFull document loaded into memory even for small field accessSplit into parent + child collection (subset/reference)
Normalised everythingEvery read requires $lookup — slow, defeats the document modelEmbed frequently-read-together data (extended reference)
No pre-computed fieldsExpensive $count/$sum on every readMaintain counts and sums with $inc on write (computed pattern)
Dynamic field names{ "2024-01": count, "2024-02": count } — cannot index field names, schema explodesUse array of objects: [{ month: "2024-01", count: 5 }]
Case-sensitive string IDs"User123" and "user123" treated as different — collision riskNormalise to lowercase on write, or use collation index
Monotonic shard keyAll writes to one shard = write hotspotUse hashed shard key or compound key with non-monotonic first field

Pattern Quick Reference

PatternUse WhenKey Mechanism
EmbeddingData always read/written together; 1:1 or 1:fewSub-document inside parent document
ReferencingData queried independently; 1:many or many:manyObjectId stored in child/junction doc
HybridDisplay summary + load detail on demandEmbed summary fields, reference full data collection
Extended ReferenceReference + frequently need 2-3 fields from referenced docEmbed hot fields inline alongside the reference ID
ComputedAggregate on read is expensive; data changes infrequentlyStore result in document; $inc/$set on every change
Subset1:many but only need N most recent at a glanceEmbed last-N with $push + $slice; full list in child coll
BucketHigh-frequency time-series writesGroup N time-adjacent records per document

Schema Design Checklist

// Before finalising any schema, verify:
// ✓ No unbounded arrays — all arrays are bounded or in separate collections
// ✓ All hot read paths served by single document reads where possible
// ✓ Atomic update requirements satisfied by embedding (or transactions documented)
// ✓ Heavy aggregations (counts, totals) pre-computed in document fields
// ✓ Indexes defined for all query patterns (verify with explain())
// ✓ Shard key designed for high cardinality and non-monotonic distribution
// ✓ Document size estimated at P99 — well under 16MB with room to grow
// ✓ Schema can be read and written in the same operation where needed