Design Dropbox / Google Drive (File Storage & Sync)
Step-by-step guide to the Dropbox / Google Drive system design interview question: requirements, capacity estimation, chunking and deduplication, delta sync, metadata schema, conflict handling, and blob storage at scale. Includes an interactive chunking demo.
Why interviewers ask this question
Dropbox looks like "upload files, download files" — a glorified FTP server. That framing fails the interview. The real problem is sync: a user edits a 2 GB video on their laptop, and thirty seconds later the updated file is on their desktop, their phone, and their teammate's machine — without re-uploading 2 GB. That gap between the naive design and the real one is exactly what interviewers are probing.
This question tests three things at once: large binary data handling (you cannot stuff files in a database), bandwidth-efficient protocols (chunking, hashing, delta sync), and multi-device consistency (what happens when two devices edit the same file offline?). Junior candidates are expected to separate metadata from blob storage; senior candidates are expected to design the chunking pipeline and defend a conflict-resolution strategy.
The 30-second answer
Step 1 — Requirements
Functional requirements
- Upload and download files from any device.
- Automatic sync: a change on one device propagates to all of the user's devices.
- Offline edits: users can modify files without connectivity; changes sync on reconnect.
- File versioning: restore a previous version of a file.
- (Clarify) Sharing files/folders with other users — call it out, then de-scope it.
- (Clarify) Max file size — assume up to a few GB; it justifies chunking.
Non-functional requirements
- Durability above all. Losing a user's only copy of their thesis is an extinction-level event for this product. Aim for eleven nines.
- Eventual consistency across devices is acceptable — a few seconds of sync lag is fine; losing an edit is not.
- Bandwidth efficiency. Users edit large files constantly; re-uploading whole files would destroy mobile data plans and your egress bill.
- High availability for the sync path; reads and writes are roughly balanced (unlike TinyURL, this is not read-dominated).
Step 2 — Capacity estimation
Assume a mid-size Dropbox:
- Users: 50M registered, 10M daily active.
- Storage: 50M users × 10 GB average = 500 PB raw. This single number reshapes the design — you are not building a database, you are building on top of an object store.
- Deduplication saves real money: identical chunks (shared files, common installers, re-uploads, small edits to big files) mean typical dedupe ratios of 25–50%. At 500 PB, a 30% saving is 150 PB — tens of millions of dollars a year. Dedupe is not an optimization; it is the business model.
- Upload traffic: 10M DAU × 2 file edits/day ≈ 20M uploads/day ≈ ~230 uploads/sec (peak ~1,000/sec). With delta sync, each "upload" averages a few chunks, not a whole file.
- Metadata: 50M users × ~1,000 files × ~1 KB of metadata ≈ ~50 TB — big, but a sharded relational database handles it. Metadata QPS (sync checks, listing folders) dwarfs file traffic: budget ~10× the upload rate.
The estimate that matters most
Loading visualization...
Step 3 — API and metadata schema
The upload flow is a two-phase conversation: the client tells the server which chunk hashes it has, the server replies with which ones it is missing, and the client uploads only those. That negotiation is the API's whole personality.
POST /api/v1/files/{fileId}/commit
{
"name": "thesis.docx",
"parentFolderId": "f_8821",
"size": 12582912,
"chunks": [
{ "order": 0, "hash": "sha256:a3f1..." },
{ "order": 1, "hash": "sha256:9c2e..." },
{ "order": 2, "hash": "sha256:d410..." }
],
"baseVersion": 6 // version this edit was made against
}
200 OK
{
"missingChunks": ["sha256:d410..."], // server already has the rest
"uploadUrls": { // pre-signed S3 URLs
"sha256:d410...": "https://s3.../chunks/d410?X-Amz-Signature=..."
}
}
// Client PUTs missing chunks directly to blob storage, then:
POST /api/v1/files/{fileId}/commit/complete
{ "baseVersion": 6 }
201 Created
{ "version": 7 }Note two deliberate choices: chunks go directly to blob storage via pre-signed URLs (your app servers never proxy petabytes), and the commit carries a baseVersion so the server can detect concurrent edits (Step 5).
The metadata schema is three tables — files, versions, and chunks:
CREATE TABLE files (
file_id BIGINT PRIMARY KEY,
owner_id BIGINT NOT NULL,
parent_folder_id BIGINT,
name VARCHAR(255) NOT NULL,
current_version INT NOT NULL,
is_deleted BOOLEAN DEFAULT false,
updated_at TIMESTAMP DEFAULT now()
);
CREATE TABLE file_versions (
file_id BIGINT NOT NULL,
version INT NOT NULL,
size_bytes BIGINT NOT NULL,
device_id BIGINT, -- which device produced it
created_at TIMESTAMP DEFAULT now(),
PRIMARY KEY (file_id, version)
);
CREATE TABLE version_chunks (
file_id BIGINT NOT NULL,
version INT NOT NULL,
chunk_order INT NOT NULL,
chunk_hash CHAR(64) NOT NULL, -- SHA-256, FK into chunk store
PRIMARY KEY (file_id, version, chunk_order)
);
CREATE TABLE chunks (
chunk_hash CHAR(64) PRIMARY KEY,
size_bytes INT NOT NULL,
ref_count BIGINT NOT NULL DEFAULT 1 -- for garbage collection
);Versioning falls out for free
Step 4 — Chunking and deduplication (the core of the interview)
This is where the interview is won or lost. Chunking solves four problems with one mechanism:
- Resumable uploads — a dropped connection at 95% of a 2 GB file loses one chunk, not the whole transfer.
- Delta sync — edit page 3 of a document and only the chunks covering page 3 change; the rest hash identically and never leave the device.
- Deduplication — chunks are identified by their SHA-256 content hash. If any user anywhere has already uploaded a chunk with that hash, it is never stored (or transferred) again.
- Parallelism — chunks upload and download over concurrent connections, saturating the pipe.
Dropbox uses 4 MB chunks. The size is a trade-off: smaller chunks dedupe better and localize edits more tightly, but every chunk costs a metadata row and a request — a 2 GB file at 64 KB chunks is 32,768 rows. 4 MB keeps that same file at 512 chunks while still isolating most edits.
The subtle question is where the chunk boundaries go:
| Approach | How it works | Pros | Cons |
|---|---|---|---|
| Fixed-size chunking | Cut the byte stream every 4 MB, regardless of content | Dead simple, fast, predictable chunk count | Insert one byte at the start and every boundary shifts — all downstream chunks re-hash as "new", killing dedupe |
| Content-defined chunking (CDC) | Slide a rolling hash (e.g., Rabin fingerprint) over the data; cut a boundary where the hash matches a pattern, averaging ~4 MB | Boundaries stick to content — an insertion changes only nearby chunks; dedupe survives shifts | More CPU per file, variable chunk sizes, harder to implement and explain |
A strong answer: start with fixed-size 4 MB chunks — it captures most of the win for append-heavy and in-place edits — then name the boundary-shift problem and offer content-defined chunking as the fix for insert-heavy workloads. Knowing why rsync and modern backup tools use rolling hashes is a senior signal; deriving Rabin fingerprints on the whiteboard is not required.
One caveat worth volunteering: cross-user dedupe has a privacy wrinkle. If the server admits "I already have that chunk," an attacker can probe whether anyone has stored a known file. Per-user dedupe or convergent encryption are the standard mitigations — one sentence on this earns real credit.
Loading visualization...
Step 5 — Sync and consistency across devices
The sync loop has four players:
- Client watcher — a daemon on each device watches the sync folder (inotify/FSEvents), chunks and hashes changed files, and talks to the API from Step 3. It keeps a local database of known hashes so offline edits queue up cleanly.
- Metadata service — owns the metadata DB, which is the single source of truth. A file version does not exist until its commit lands here; blob storage is just a bag of chunks.
- Notification service — when a commit lands, it pushes an invalidation ("file X is now version 7") to the user's other devices. Use long polling or WebSockets: with millions of mostly-idle devices, having clients poll every few seconds would melt the metadata tier, and long polling gets near-instant sync at a fraction of the request volume. The notification is deliberately tiny — just "something changed" — and the device then fetches the new chunk list and downloads only the chunks it lacks.
- Blob storage — stores chunks keyed by hash; covered in Step 6.
Conflicts. Two devices edit budget.xlsx offline, then both reconnect. Each commit carries the baseVersion it was edited against — the server accepts the first commit (version 6 → 7) and rejects the second because its base (6) is no longer current. Now what?
| Strategy | Behavior | When it fits |
|---|---|---|
| Last-writer-wins | Second commit silently overwrites — the losing edit is gone | Ephemeral data (presence, cursors). Catastrophic for user files: it deletes someone's work without telling them |
| Conflicted copies | Keep both: the second commit becomes "budget (conflicted copy — Bob's MacBook).xlsx" beside the original | Opaque binary files where the system cannot merge — Dropbox's actual choice |
| Operational merge (OT/CRDT) | Merge concurrent edits at the operation level | Structured, format-aware data — Google Docs, not arbitrary files |
Why Dropbox creates conflicted copies
Step 6 — Architecture at scale
Putting it together:
- Blob storage (S3 or equivalent) holds every chunk, keyed by content hash. It gives you eleven-nines durability, effectively infinite capacity, and pre-signed URLs so app servers stay out of the data path. Interestingly, Dropbox famously ran on S3 for years, built its own storage layer (Magic Pocket) at extreme scale, and that is a cost-at-scale story worth a sentence — not a design you should propose.
- Cold storage tiering. Most chunks are written once and never read again — old versions, forgotten files. Lifecycle rules move chunks untouched for 90+ days to infrequent-access or Glacier tiers, cutting storage cost 2–5× on the long tail. At 500 PB, tiering is worth more than most performance optimizations.
- Metadata DB: sharded relational (e.g., MySQL/Postgres sharded by
owner_idso one user's sync touches one shard), with the version tables from Step 3. Transactions matter here — a commit must atomically bumpcurrent_versionand insert the chunk rows. - Garbage collection: when versions are pruned, decrement
ref_counton their chunks; a background job deletes chunks that hit zero. Getting refcounting right (races between upload and GC) is a classic follow-up.
One design decision deserves an explicit trade-off — who does the chunking?
Tradeoff: Client-side chunking (vs server-side)
- Delta sync works: the client hashes locally and skips unchanged chunks — the only way to avoid uploading data the server already has
- Resumable uploads for free: retry individual chunks, not whole files
- Offloads CPU (hashing, compression) to millions of client devices instead of your fleet
- Chunks can go straight to blob storage via pre-signed URLs, keeping app servers stateless and thin
- Requires a smart client — a heavyweight desktop daemon, not just a browser form (web uploads need a server-side fallback path)
- Clients lie: every hash and size must be re-verified server-side or the chunk store gets poisoned
- Chunking logic ships in the client, so protocol changes wait on slow client-upgrade cycles
- Cross-platform consistency (macOS/Windows/Linux/mobile) multiplies engineering cost
Common mistakes that cost offers
- Storing files in the database. BLOBs in Postgres at 500 PB is an instant red flag. Metadata in the DB, bytes in object storage — draw that line early.
- Uploading whole files on every edit. If your design re-transfers 2 GB because one paragraph changed, you have missed the point of the question. Chunking and delta sync are the question.
- No conflict story. "The last write wins" for user files means silently deleting someone's work. Interviewers push on offline edits precisely to see if you have thought about this.
- Proxying file bytes through app servers. Pre-signed URLs exist so your stateless tier never touches chunk data. Streaming petabytes through app servers is an expensive architectural mistake.
- Polling for changes. Ten million devices polling every five seconds is 2M QPS of "anything new?" — long polling or WebSockets cuts that by orders of magnitude.
- Ignoring durability. Never saying the words "replication" or "durability" on a question about storing people's only copy of their data is disqualifying at senior level.
Frequently asked questions
Why chunk files instead of uploading whole files?
Chunking gives you four wins at once: resumable uploads (a dropped connection retries one chunk, not the whole file), delta sync (only the chunks that changed are re-uploaded after an edit), deduplication (chunks identified by content hash are stored once even if a million users upload them), and parallel transfers. For a sync product where users repeatedly edit large files, transferring whole files would waste enormous bandwidth and storage.
How does Dropbox handle two people editing the same file?
Each commit records which version it was edited against. The first commit wins and creates the new version; the second is rejected because its base version is stale, and the client saves it as a separate conflicted copy file next to the original. Dropbox cannot merge arbitrary binary formats, so keeping both versions and letting a human resolve the conflict is safer than silently overwriting one person's work.
Should I use SQL or NoSQL for Dropbox metadata?
Relational is the stronger default here. Committing a new file version must atomically update the current version pointer and insert the chunk list, which is a natural multi-row transaction, and the data is relational: files reference versions, versions reference chunks. Shard by owner ID so each user's sync traffic hits one shard. A NoSQL store works too, but you must then explain how you get atomic commits, which is harder to defend.
How big should file chunks be?
Around 4 MB, which is what Dropbox uses. Smaller chunks deduplicate better and localize edits more precisely, but each chunk costs a metadata row, a hash, and a request, so tiny chunks explode overhead — a 2 GB file at 64 KB chunks is over 32,000 chunks versus 512 at 4 MB. 4 MB balances dedupe granularity against metadata and request overhead for typical file sizes.
Reading only gets you halfway
Practice designing Dropbox / Google Drive (File Storage & Sync) step by step with an AI interviewer that evaluates your answers — free, no credit card.
Practice this problem free