28 · Schema Design

Design Philosophy

Access-pattern-first, not normalisation-first

concept

MongoDB schema design starts from a fundamentally different premise than relational modelling. In SQL, you normalise first to eliminate redundancy, then add indexes to speed up joins. In MongoDB, you design around your access patterns first — the goal is to co-locate the data each query needs so that a single document read returns everything without joins.

Relational approach

Normalise to 3NF — eliminate all redundancy
Data is spread across many tables
Add JOIN queries to reassemble data at read time
Schema designed independently of query patterns

MongoDB approach

Design around how data is READ and WRITTEN
Co-locate related data — embed what's read together
Controlled redundancy is acceptable for speed
Schema evolves with access pattern changes

The Three Fundamental Questions

Before writing any schema, answer these three questions for each entity:

How is this data written? — insert rate, update frequency, write atomicity requirements
How is this data read? — which fields are fetched together, how often, by which query patterns
What is the data's lifecycle? — how large will it grow, does it get deleted, is it time-bounded?

TIP

The most important design insight: MongoDB's unit of atomicity is the document. Multi-document transactions exist but are slower. If two pieces of data must be updated together atomically (e.g., order total and line items), embed them in the same document — you get atomic updates for free without transaction overhead.

Embed vs Reference

When to put data inside the document vs link to another collection

decision

The core schema question in MongoDB: should related data live inside the same document (embedding) or in a separate document referenced by ID (referencing)?

Embedding

// Embedded: order with line items in same document
{
  _id: ObjectId("..."),
  orderId: "ORD-001",
  customerId: "C001",
  status: "completed",
  lineItems: [
    { productId: "P1", name: "Widget", qty: 2, price: 19.99 },
    { productId: "P2", name: "Gadget", qty: 1, price: 49.99 }
  ],
  total: 89.97
}
// Single document read → complete order data
// Atomic updates to order + line items — no transaction needed

Referencing

// Referenced: order links to customer by ID
{
  _id: ObjectId("..."),
  orderId: "ORD-001",
  customerId: ObjectId("..."),  // ← reference to customers collection
  status: "completed",
  total: 89.97
}
// Fetching order + customer requires two queries (or $lookup in aggregation)
// But: customer data stays up-to-date in one place — no duplication

Decision Matrix

Factor	Favour Embedding	Favour Referencing
Read pattern	Data always read together	Data read independently
Update pattern	Data updated together atomically	Child data updated independently
Relationship type	One-to-few (1:1, 1:few)	One-to-many, many-to-many
Data size	Child data is small and bounded	Child data is large or unbounded
Duplication concern	Data is unique to this parent	Data is shared across many parents
16MB limit	Document stays well under 16MB	Embedding would breach 16MB risk
Query independence	Child never queried without parent	Child queried independently

NOTE

There is no universally correct answer. A blog post's tags (bounded, always read with post) should be embedded. A blog post's comments (unbounded, may number thousands) should be referenced — or handled with the subset pattern.

The Hybrid Approach

Embed the summary, reference the detail — the MongoDB sweet spot

hybrid

Pure embedding (all data in one document) and pure referencing (everything normalised) are both extremes. The hybrid approach — the recommended MongoDB design pattern — embeds the most frequently accessed summary data directly in the document while keeping the full detail in its own collection.

// ❌ Pure embedding — works until blog gets popular
{
  _id: ObjectId("..."),
  title: "MongoDB Schema Design",
  body: "...",
  comments: [                     // UNBOUNDED — could be 10,000 comments
    { user: "Alice", text: "Great post!", date: ISODate("...") },
    { user: "Bob",   text: "Thanks!",     date: ISODate("...") },
    // ... thousands more
  ]
}

// ❌ Pure referencing — always two queries to render post + comments
// Post doc has no preview of comments — poor UX for comment count display

// ✅ Hybrid: embed summary in post, reference full comments
{
  _id: ObjectId("..."),
  title: "MongoDB Schema Design",
  body: "...",
  commentCount: 247,              // pre-computed — no $count query needed
  recentComments: [               // last 3 comments — bounded, for preview
    { user: "Carol", text: "Very clear!", date: ISODate("2024-03-01") },
    { user: "Dave",  text: "Bookmarked",  date: ISODate("2024-03-02") }
  ]
  // Full comments live in: db.comments.find({ postId: ObjectId("...") })
}
// Render post page: 1 document read → has title, body, count, preview
// Load all comments: second query only when user clicks "load comments"

TIP

The hybrid pattern maps directly to real page loads: the main document read satisfies 95% of page render needs with one query. The second collection is only queried on demand (e.g., clicking to expand, paginating). This is how most high-traffic MongoDB applications operate in practice.

Design Patterns

Extended Reference · Computed · Subset · Bucket · Outlier

patterns

Extended Reference Pattern

When documents reference another collection, embed the most frequently accessed fields from the referenced document directly in the referencing document. This eliminates the need for a second lookup for those fields.

// Problem: every order display requires fetching customer name + email
// Pure reference: two queries every time
{ orderId: "ORD-001", customerId: ObjectId("...") }
db.customers.findOne({ _id: ObjectId("...") })

// ✅ Extended Reference: embed the hot fields directly in the order
{
  orderId:    "ORD-001",
  customerId: ObjectId("..."),      // full reference still kept for joins
  customer: {                       // embedded snapshot of hot fields
    name:  "Alice Johnson",
    email: "alice@example.com"
  },
  total: 89.97
}
// Render order: single document → has customer name + email, no second query
// Tradeoff: if customer changes name/email, existing orders show old values
// This is acceptable for orders (historical snapshot is often correct anyway)

Computed / Pre-calculated Pattern

Instead of counting or summing at read time, maintain the aggregate value in the document and update it on write. Read operations are O(1); aggregation cost is paid once per write instead of on every read.

// ❌ Expensive: counting likes on every post view
db.likes.countDocuments({ postId: ObjectId("...") })   // O(N) on every render

// ✅ Computed: maintain likeCount in post, increment on like action
{
  _id:       ObjectId("..."),
  title:     "MongoDB Schema Design",
  likeCount: 1247,   // pre-computed — updated on every like/unlike
  viewCount: 58300
}

// On like action: atomic increment — O(1) write
db.posts.updateOne(
  { _id: postId },
  { $inc: { likeCount: 1 } }
)
// Read likeCount: single field access — O(1), no aggregation

Subset Pattern

For one-to-many relationships where you need to display only the N most recent or relevant items, embed the latest N items in the parent document and keep the full list in a child collection.

// Product with reviews — could have thousands of reviews

// ✅ Subset: embed last 5 reviews in product doc
{
  _id:      ObjectId("..."),
  name:     "Wireless Headphones",
  avgRating: 4.3,
  reviewCount: 2847,
  recentReviews: [             // BOUNDED array — always exactly 5
    { user: "Alice", rating: 5, text: "Excellent sound quality" },
    { user: "Bob",   rating: 4, text: "Good value for money" },
    // ... 3 more
  ]
}
// Full reviews: db.reviews.find({ productId: ObjectId("...") })

// On new review: $push with $slice to maintain exactly 5 recent reviews
db.products.updateOne(
  { _id: productId },
  {
    $push: {
      recentReviews: {
        $each:  [newReview],
        $sort:  { date: -1 },  // most recent first
        $slice: 5              // keep only 5
      }
    },
    $inc: { reviewCount: 1 },
    $set: { avgRating: newAverage }
  }
)

Bucket Pattern

For time-series data with high write throughput (IoT sensors, logs, metrics), group time-adjacent readings into a single "bucket" document instead of one document per reading. Dramatically reduces document count and index size.

// ❌ Naive: one document per sensor reading — millions of tiny docs
{ sensorId: "S1", timestamp: ISODate("..."), temp: 22.4 }
{ sensorId: "S1", timestamp: ISODate("..."), temp: 22.5 }
// 1 reading per minute = 525,600 documents per sensor per year

// ✅ Bucket: group one hour of readings per document
{
  sensorId:  "S1",
  hour:      ISODate("2024-03-01T10:00:00Z"),  // bucket key
  count:     60,
  readings:  [
    { minute: 0, temp: 22.4 },
    { minute: 1, temp: 22.5 },
    // ... 58 more (one per minute)
  ],
  stats: {
    min: 22.1,
    max: 23.0,
    avg: 22.6                 // pre-computed per bucket
  }
}
// 1 reading per minute = 8,760 documents per sensor per year (60x fewer)

// Add readings to current bucket with $push
db.sensor_data.updateOne(
  { sensorId: "S1", hour: currentHour, count: { $lt: 60 } },
  { $push: { readings: newReading }, $inc: { count: 1 } },
  { upsert: true }
)

NOTE

MongoDB has a native Time Series Collection type (v5.0+) which internally applies the bucket pattern automatically — preferred over manual bucket implementation for new IoT/metrics projects. Use db.createCollection("sensor_data", { timeseries: { timeField: "timestamp", metaField: "sensorId" } }).

16MB Limit & Unbounded Arrays

Hard document limit and the most common schema anti-pattern

limits

MongoDB enforces a hard 16MB maximum document size. Exceeding this limit causes a write error. The most common cause of hitting this limit is unbounded arrays — arrays that grow without limit as the application runs.

DANGER

The most dangerous schema anti-pattern: embedding an array that grows indefinitely. A comments: [] array on a popular post, a followers: [] array on a viral account, or an events: [] array on a long-lived document will eventually hit 16MB and crash all writes to that document with no recovery other than splitting the document.

// ❌ UNBOUNDED ARRAY ANTI-PATTERNS

// A user with 500K followers → 500K ObjectIds ≈ 6MB → approaches limit
{ _id: ObjectId("..."), username: "popular_user", followers: [ id1, id2, ... ] }

// A chat room open for 1 year → thousands of messages
{ _id: ObjectId("..."), roomId: "room-1", messages: [ { text: "hi", ... }, ... ] }

// An order with unlimited audit log entries
{ _id: ObjectId("..."), orderId: "ORD-1", auditLog: [ { action: "created" }, ... ] }

// ✅ FIXES

// 1. Separate collection — follower relationships
db.follows.insertOne({ followerId: ObjectId("..."), followeeId: ObjectId("...") })
db.follows.createIndex({ followeeId: 1 })    // count followers
db.follows.createIndex({ followerId: 1 })    // list who I follow

// 2. Bucket pattern — chat messages grouped by day/hour
db.messages.insertOne({ roomId: "room-1", date: "2024-03-01", messages: [ ... ] })

// 3. Capped/rotating — keep only last N (subset pattern + $slice)
db.orders.updateOne({ _id: id }, {
  $push: { auditLog: { $each: [entry], $slice: -100 } }  // keep last 100 only
})

Working Near the Limit

// Check document sizes in a collection
db.collection.aggregate([
  { $project: { size: { $bsonSize: "$$ROOT" } } },
  { $sort: { size: -1 } },
  { $limit: 10 }
])
// Documents approaching 10MB+ are at risk — refactor immediately

// $bsonSize is available from MongoDB 4.4+
// 16MB = 16,777,216 bytes

WARN

There is no warning before hitting 16MB — the write simply fails with BSONObj size: ... (0x...) is invalid. Size must be between 0 and 16793600. By the time this error appears in production, the document is already too large to update. Monitor document sizes proactively using $bsonSize aggregation on large collections.

Shard Key Design

Distribution, query isolation, and avoiding hotspots

sharding

In a sharded MongoDB cluster, the shard key determines how documents are distributed across shards. A poorly designed shard key is very difficult to change after collection creation (requires resharding in MongoDB 5.0+ or collection recreation in earlier versions).

Shard Key Requirements

Requirement	Why It Matters
High cardinality	Too few distinct values = too few chunks = cannot distribute evenly
Low frequency	If one value dominates (e.g., 80% of docs have status: "active"), one shard becomes a hotspot
Non-monotonic	ObjectId/_id as shard key → all inserts go to the same "last" chunk → write hotspot
Query-based targeting	Queries including the shard key go to 1-2 shards; without it, broadcast to all (scatter-gather)

// ❌ BAD shard key choices
{ shardKey: { _id: 1 } }           // ObjectId is monotonic → write hotspot on last shard
{ shardKey: { createdAt: 1 } }     // timestamp is monotonic → write hotspot
{ shardKey: { status: 1 } }        // low cardinality (3-5 values) → uneven distribution
{ shardKey: { country: 1 } }       // 200 countries, but 60% of data is "US" → hotspot

// ✅ GOOD shard key choices

// Hashed shard key — uniform distribution even for monotonic fields
sh.shardCollection("mydb.users", { userId: "hashed" })
// Tradeoff: range queries scatter-gather across all shards

// Compound shard key — high cardinality, query-targeted
sh.shardCollection("mydb.orders", { customerId: 1, createdAt: 1 })
// customerId: high cardinality, queries targeted to customer's shard
// createdAt: second field adds monotonicity spread within each customer's range

// Zone-based / geographic shard key
sh.shardCollection("mydb.sessions", { region: 1, userId: 1 })
// Route EU users to EU shards, US users to US shards

DANGER

The shard key is immutable per document once written (in MongoDB < 4.2, it cannot be changed at all; in 4.2+ only the value can be updated). The shard key field itself cannot be changed after shardCollection(). If you need to reshard, MongoDB 5.0+ supports online resharding — but it is expensive. Get the shard key right during design.

Anti-Patterns & Quick Reference

Common schema mistakes and design decision summary

reference

Top Schema Anti-Patterns

Anti-Pattern	Problem	Fix
Unbounded arrays	Eventual 16MB document failure	Separate collection, bucket pattern, or $slice capped array
Massive documents	Full document loaded into memory even for small field access	Split into parent + child collection (subset/reference)
Normalised everything	Every read requires $lookup — slow, defeats the document model	Embed frequently-read-together data (extended reference)
No pre-computed fields	Expensive $count/$sum on every read	Maintain counts and sums with $inc on write (computed pattern)
Dynamic field names	`{ "2024-01": count, "2024-02": count }` — cannot index field names, schema explodes	Use array of objects: `[{ month: "2024-01", count: 5 }]`
Case-sensitive string IDs	"User123" and "user123" treated as different — collision risk	Normalise to lowercase on write, or use collation index
Monotonic shard key	All writes to one shard = write hotspot	Use hashed shard key or compound key with non-monotonic first field

Pattern Quick Reference

Pattern	Use When	Key Mechanism
Embedding	Data always read/written together; 1:1 or 1:few	Sub-document inside parent document
Referencing	Data queried independently; 1:many or many:many	ObjectId stored in child/junction doc
Hybrid	Display summary + load detail on demand	Embed summary fields, reference full data collection
Extended Reference	Reference + frequently need 2-3 fields from referenced doc	Embed hot fields inline alongside the reference ID
Computed	Aggregate on read is expensive; data changes infrequently	Store result in document; `$inc`/`$set` on every change
Subset	1:many but only need N most recent at a glance	Embed last-N with `$push + $slice`; full list in child coll
Bucket	High-frequency time-series writes	Group N time-adjacent records per document

Schema Design Checklist

// Before finalising any schema, verify:
// ✓ No unbounded arrays — all arrays are bounded or in separate collections
// ✓ All hot read paths served by single document reads where possible
// ✓ Atomic update requirements satisfied by embedding (or transactions documented)
// ✓ Heavy aggregations (counts, totals) pre-computed in document fields
// ✓ Indexes defined for all query patterns (verify with explain())
// ✓ Shard key designed for high cardinality and non-monotonic distribution
// ✓ Document size estimated at P99 — well under 16MB with room to grow
// ✓ Schema can be read and written in the same operation where needed