A visual explainer · 11 sections

How a file system turns a billion bytes into a billion files.

A modern computer hands you a disk that is, fundamentally, a numbered list of bytes. It does not know what a file is, or a folder, or a name. The job of turning that list into something you can save, open, and trust falls to a piece of software that runs underneath every program you use.

This page builds a file system from scratch — first the universal pieces, then the four classics in detail (FAT32, ext4, ZFS, Btrfs), and finally a quick visit with NTFS and APFS. Every diagram is interactive; drag the sliders, flip the toggles, and watch the bytes move.

Drag, scrub, and toggle the demos as you read.

01 · THE FLOOR

The disk underneath

Before any file system can begin, something has to physically remember the bytes. Almost every storage device, from a 1980s floppy to a 2026 NVMe SSD, agrees on a surprisingly simple shape for that memory.

A storage device exposes itself to your computer as a long, ordered list of fixed-size cells called sectors. The drive doesn't know what's in any sector — only how to read or write the cell at index N. To the operating system, asking for sector 1 048 576 is no different from asking for sector 0; both are simple numeric lookups.

Demo · A disk is just numbered cells drag the slider

That's all the hardware promises: random-access cells of fixed size. A spinning hard drive achieves this with concentric tracks under a moving head; an SSD with billions of NAND transistors arranged in pages and erase-blocks; the abstraction your software sees is the same. Pick a number, get back exactly 512 — or, on modern drives, 4 096 — bytes.

What you do not get is anything like a file. The drive will happily store the bytes P, N, G followed by an image of your dog's birthday party — but it has no idea those bytes form a file called maple-bday.png, or that they belong together. From the drive's point of view, your dog and your tax return are an undifferentiated, addressable smear of bytes across millions of identical cells.

Everything on this page is built out of that single primitive.

02 · UNITS

Sectors and blocks

A file system rarely talks to the disk one sector at a time. It bundles sectors into a slightly larger unit it calls a block, and from that point on it lives in block-numbered land.

A sector is the hardware's word — usually 512 bytes on legacy drives, 4 096 bytes on every drive made in the last decade. A block (sometimes called a cluster) is the file system's word, and it is one or more whole sectors glued together. A typical Linux box uses 4 KiB blocks; a flash drive formatted FAT32 might use 32 KiB; a database might be persuaded to use 16 KiB.

Why bigger blocks? Tracking is cheaper. If a file system has to remember whether each cell is free, it has fewer cells to remember when each cell is bigger. Why not use 1 MiB blocks then? Because a tiny file (say, a 60-byte shortcut) will still cost a whole block, wasting the rest. Block size is a balance between bookkeeping cost and the wasted tail at the end of every file — what we'll later call internal fragmentation.

Demo · Sectors fold into blocks try other sizes

From here on, when this page draws a small numbered square, that square is one block. The first job of any file system is to decide which blocks belong to which file.

03 · BOOKKEEPING

Tracking free space

If every block is identical to its neighbour, the file system needs a way to remember which ones are taken. The simplest answer is also the one the original designers reached for: a long line of yes/no bits.

This map is called a free-space bitmap. Reserve one bit per block: 0 means free, 1 means in use. A 1 TiB disk with 4 KiB blocks has 268 million blocks, so the bitmap is 32 MiB — a third of a percent of the disk, well worth the cost. To allocate a new block you scan the bitmap for a zero, flip it to one, and hand the index back.

Demo · A free-space bitmap click cells to allocate

This much we have for free. The interesting decisions begin when the file system has to choose which free block to hand out. The naïve answer — "the first free one" — quickly produces a mess: blocks belonging to a single file end up scattered across the platter, and reading that file later means a great deal of seeking. The good answer is to allocate blocks near each other and near related files; we will see that intuition harden into formal rules in ext4 and Btrfs.

An aside on extents. Tracking each block individually is expensive once files get large. Modern file systems prefer extents: a contiguous run of blocks described by a single (start, length) pair. A 4 GiB movie sitting in one piece costs one extent record instead of a million block pointers. We'll return to this in the ext4 and Btrfs sections.

04 · NAMING

Names, and the tree that holds them

A bitmap and a sea of blocks gets you raw storage, not files. To find anything again you need a name, and to keep your hundred thousand names organised, you need a tree.

A directory is, at heart, a small lookup table. Each row maps a human-readable name (resume.pdf) to a pointer that says where the rest of that file lives — a block number, a record number, an inode index, a B-tree key. The exact pointer differs between file systems, but the directory's job — name → location — does not.

A directory can contain other directories, and so a single root directory grows downwards into a tree. Every operating system you have ever used presents this tree in some form: /home/you/Documents/resume.pdf on Unix, C:\Users\You\Documents\resume.pdf on Windows. Walking that path is a sequence of name lookups, one per level.

Demo · A directory tree click folders to expand

One subtlety: the directory entry does not usually store the file's data, or even most of its metadata. It stores a name and a pointer to a separate record that describes the file. This separation is what lets multiple names point at the same underlying file (a hard link), or at a different file every time the link is followed (a symbolic link). On many file systems, including ext4, every entry in /home/you/Documents/ is just name + inode number, and the inode is where the real story lives.

05 · DESCRIPTORS

Metadata, the file's identity card

A file is more than its bytes. Somebody has to remember how big it is, when it was made, who's allowed to read it, and exactly which blocks on disk belong to it.

Every file system keeps a small fixed-size record for each file — Unix calls it an inode, NTFS calls it an MFT entry, FAT calls it a directory entry. The names differ, but they all answer the same questions:

Size. How many bytes are valid in the file?
Type. Is this a regular file, a directory, a device, a symbolic link?
Permissions and ownership. Who can read, write, execute, change permissions?
Timestamps. When was it created, last modified, last accessed?
Block pointers. The actual list of blocks (or extents) that store the bytes.

Demo · An inode-shaped record it points at the bytes

With these five pieces — a sea of blocks, a free-space map, a way to chain blocks into files, a tree of names, and a record per file — you have the skeleton of every file system in this guide. The differences from here on are about how those pieces are encoded and combined, and how they cope when things go wrong.

06 · THE CLASSIC

FAT32: a chain of breadcrumbs

FAT, the File Allocation Table, has been on PCs since 1977. The 32-bit version, born in 1996, is still on the SD card in your camera, the EFI partition that boots your laptop, and probably every USB stick you have ever owned. Its design is gloriously, almost defiantly, simple.

FAT32's central idea is a single giant array, stored on the disk, with one entry per cluster. (FAT calls its blocks clusters; same idea.) Each entry is just a 32-bit integer, and it answers exactly one question: "After cluster N, which cluster comes next in the same file?"

If the entry holds a number M, then cluster M is the next piece of the file.
If it holds the special value 0xFFFFFFFF (call it EOF), this is the last cluster of the file.
If it holds 0x00000000, this cluster is free.
A few other reserved values mark bad or reserved clusters.

That's the whole trick. To find the contents of a file, you look up its first cluster in the directory entry, then walk the FAT like a linked list — cluster 5 → 6 → 12 → 13 → 14 → EOF — reading the data of each cluster in turn.

Demo · Writing a five-cluster file scrub through time

FAT32 is also the simplest possible answer to "where do I write directories?" A directory is just a file whose contents happen to be a list of fixed-size 32-byte entries — name, attributes, first cluster, size. The root directory is the same shape as any other; it just lives at a known starting cluster (usually 2). To list a folder you read its blocks; to add a file you write a new 32-byte row.

Demo · A FAT32 directory entry name + first cluster + size

Deleting is barely deleting

Deleting a file in FAT32 is shockingly cheap, and equally lossy. The directory entry's first byte is overwritten with 0xE5, a special "this slot is empty" marker — but the rest of the name, the size, and the first-cluster number are all still there. The FAT entries for the file's clusters are zeroed (marked free), but the data in those clusters is untouched until something else overwrites them. This is why undelete utilities work so well on FAT, and why secure-deletion takes more than a quick rm.

Demo · The same file, deleted scrub the timeline

What FAT32 cannot do

FAT32 is the simplest design in this guide, and the simplicity costs:

4 GiB file ceiling. The size field in the directory entry is 32 bits unsigned. One byte more and the file system genuinely cannot represent the size.
2 TiB volume ceiling. Cluster numbers are 28 bits, and clusters max out at 32 KiB.
No permissions. Eight attribute bits — read-only, hidden, system, archive — and that's it. No user, no group, no ACL.
No journal. If the power dies between two FAT updates, you can lose chains of clusters or scribble two files onto the same one. This is what chkdsk is for.
Fragmentation. Allocate small files, delete some, allocate larger ones — the FAT chain ends up zig-zagging across the disk, and reading a file means a great deal of seeking on an HDD.

And yet FAT32 endures. Its on-disk format is so simple that every microcontroller, every camera, every BIOS, every printer firmware can read and write it without a 200 KB driver. When you reach for a USB drive that "just works in everything", you are reaching for FAT32 (or its modern sibling exFAT, which raises the size limits but keeps the spirit).

07 · THE WORKHORSE

ext4: inodes, extents, journal

If FAT32 is the family pickup truck, ext4 is the long-haul tractor unit that has been quietly carrying Linux for fifteen years. It learned from ext2 and ext3 and the BSD Fast File System before them, and the result is a design that punches enormously above its conceptual weight.

An ext4 volume is divided into block groups — typically 128 MiB each. Every block group carries its own copy of the bookkeeping it needs: a block bitmap, an inode bitmap, a chunk of the inode table, and finally the data blocks themselves. The point of this division is locality. When you create a file inside a directory, the file system tries hard to place the new inode and its data in the same block group as the directory. Related things end up physically close, which means short head seeks on an HDD and good prefetch behaviour on an SSD.

Demo · ext4 block-group layout hover a group

The inode

Every file in ext4 is described by an inode — a fixed-size record (256 bytes by default) containing the file's metadata and the location of its data. The directory does not own the file's bytes; it owns a name and a 32-bit inode number. Look up the inode, and you find the size, the timestamps, the owner, the mode bits, the extended attributes — and somewhere in there, the addresses of the data.

From indirect blocks to extents

The original ext2 design listed data blocks one by one, with a few clever tricks to handle large files. Each inode held twelve direct pointers (12 × 4 KiB = 48 KiB straight away), then one indirect pointer to a block full of pointers, then a double-indirect, then a triple-indirect. A 1 GiB file required reading the inode plus a triple-indirect block plus a double plus a single — and somewhere in the bookkeeping was a quarter-million 32-bit pointer values. It worked. It was also fantastically wasteful.

ext4 replaced this with extents. An extent is a (start-block, length-in-blocks, logical-offset) tuple: it says "the next 32 768 blocks of this file live contiguously starting at block 5 002 113." A 1 GiB file in one piece is one extent record, four numbers. The inode itself can hold four extents inline; if the file has more, those extents live in a small B-tree rooted in the inode.

Demo · Indirect pointers vs extents flip the toggle

The journal

A file system without a journal is one that, when the lights go out mid-write, can corrupt itself. Imagine ext4 is in the middle of "delete file X": it has marked X's blocks free in the bitmap, but it has not yet removed the inode from its directory. Power dies. Now a directory entry points at an inode whose blocks the file system thinks are free — and the next write will scribble over X's old data through the still-living directory entry.

ext4's journal (inherited and refined from ext3) makes the file system crash-consistent. Before any change touches the real metadata, ext4 writes a description of the change to a special journal area. If the system crashes, the journal is replayed on the next mount: every change in it either fully takes effect or is rolled back. The disk is never left mid-thought.

ext4 lets you choose what to journal:

writeback — only metadata is journalled. Fastest, but a crash can leave files containing whatever garbage was on the freshly-allocated blocks.
ordered (the default) — metadata is journalled, but only after the file's data has been safely written. No garbage, no double-write penalty.
journal — both data and metadata go through the journal. Slowest, safest; everything is written to disk twice.

Demo · Crash, with and without a journal flip the journal off, then crash

A few other ext4 niceties worth a paragraph: HTree directories hash filenames into a small B-tree so a folder with a million files is still fast to look up; delayed allocation waits before picking blocks for a write so it can pick a long contiguous extent in one shot; persistent preallocation lets a program ask for "give me 4 GiB now, I'll fill it later," which avoids fragmenting growing files. None of this is glamorous; all of it is why ext4 is the default on most Linux distributions.

08 · THE FORTRESS

ZFS: copy-on-write, end-to-end

ZFS, born inside Sun Microsystems in the early 2000s, is the most ambitious file system in this guide. Its designers asked: what if the file system, the volume manager, and the RAID controller were one thing? What if every block carried a checksum? What if every write produced a new tree, leaving the old one untouched?

The starting point is the storage pool. A traditional Unix system has a partition for /, another for /home, another for /var; each is fixed-size, each is its own headache. ZFS takes a heap of physical disks, organises them into vdevs (a mirror, a RAID-Z group, a single drive), and pools all the vdevs into one big zpool. Datasets — what you see as filesystems — carve out chunks of that pool dynamically. There are no fixed partitions; every dataset can grow to fill any free space.

Copy-on-write, all the way down

The defining act of ZFS is that it never overwrites a live block. When you change a single byte in a file, ZFS does not seek to the existing block and rewrite it. Instead it:

writes the new data to a brand-new free block;
writes a new copy of the block pointer that used to point at the old data, now pointing at the new block — and stores it, too, in a new free block;
repeats this all the way up the tree, until it reaches the root.

The very last write, atomically, swaps in the new root. Everything below it is consistent or it is invisible — the disk never holds a partial tree. This is copy-on-write (COW), and it is what makes ZFS crash-safe without a separate journal: the only place a crash can happen is between two atomic root updates, and the previous root is still there, intact.

Demo · A copy-on-write update scrub the write

Snapshots: free, instant, immutable

Once everything is copy-on-write, snapshots are almost free. A snapshot is just the previous root pointer, preserved. Normally, when ZFS finishes a transaction, it adds the old root's blocks to the free list. To take a snapshot, ZFS holds onto that root pointer and refuses to free its blocks. The snapshot and the live filesystem now share every unchanged block; they only diverge as new writes create new branches of the tree.

Taking a snapshot of a 10 TB pool takes microseconds and zero new bytes. Holding it costs nothing until something changes; even then, only the changed blocks are duplicated. You can keep a thousand of them.

Demo · A snapshot is a held-onto root write, then snap, then write

Checksums and the bit-rot defense

Every block pointer in ZFS contains, alongside the disk address of the child, a checksum of the child's contents. Reading a block, ZFS verifies the checksum stored in its parent. If the checksum does not match, ZFS knows — without any doubt — that the block has been corrupted, whether by a flaky cable, a cosmic ray, or a slowly failing drive head. On a mirror or RAID-Z pool it then fetches a good copy from another vdev and writes it back, healing the volume in place.

Because parents checksum children, and the parents themselves are in blocks whose checksums sit in their parents, the entire pool is a Merkle tree. The root's checksum (called the uberblock) verifies, transitively, the entire pool. Bit rot anywhere is detectable, and on redundant vdevs, automatically reparable.

Demo · A flipped bit, caught by Merkle flip a bit, watch checksums fail

RAID-Z and the write hole

Classic RAID-5 has a famous flaw called the write hole: a small write must update both a data stripe and the parity stripe, and if the system crashes between those two writes, the parity is wrong and the array can return bad data without knowing it. ZFS sidesteps this with RAID-Z, which writes variable-width stripes: every write is a complete stripe, parity included, allocated as one atomic copy-on-write event. There is no half-updated parity to worry about, because there is no in-place parity update.

The price of all this protection is real. ZFS likes a lot of RAM (its ARC cache replaces the kernel's page cache); deduplication tables can be enormous; the on-disk format makes shrinking a pool hard. But on a fileserver where you care about your data, ZFS is hard to beat.

09 · THE B-TREE

Btrfs: the file system as one shape

Btrfs (the "B-tree filesystem", pronounced however you like) is Linux's answer to ZFS. It steals the same big ideas — copy-on-write, snapshots, checksums, integrated volume management — and arranges them around a single, repeated structural primitive: the B-tree.

In ext4, every kind of metadata has its own data structure. The block bitmap is a bitmap; the inode table is an array; directory entries live in linear-then-HTree blocks; extents form their own little B-tree per file. In Btrfs, almost everything is a B-tree. The same code that searches the file tree searches the extent allocation tree searches the directory tree searches the checksum tree.

The fs tree stores inodes and the extents that belong to each one.
The extent tree tracks every allocated extent and how many references point at it (essential for snapshots and reflinks).
The chunk tree maps logical block addresses to physical ones, replacing the volume manager.
The checksum tree stores a per-extent checksum, separately from the data.
And so on.

All of these B-trees are themselves copy-on-write: any modification rewrites the path from leaf to root, just like ZFS. The difference is the building block. Where ZFS's tree is hand-rolled per metadata type, Btrfs's is uniform — one B-tree implementation, parameterised by what it stores in its leaves.

Demo · A copy-on-write B-tree insert step through

Subvolumes and snapshots

A subvolume in Btrfs is a separately-rooted fs tree inside the same pool. You can make a hundred of them; each gets its own root in the master tree of trees. A snapshot is a subvolume that starts life pointing at the same root as another subvolume — and, because COW, immediately diverges as either side writes. Btrfs snapshots are writable by default (ZFS snapshots are read-only; you'd take a clone for that). The implementation is the same trick: keep the old root, and let the new root drift away one COW-write at a time.

Reflinks: a hard link for the data

Because every extent has a reference count, Btrfs offers something a hard link cannot: a reflink copy. cp --reflink creates a new file that initially shares all of its extents with the source. No data is copied. As either file is modified, only the changed extents diverge — exactly the snapshot trick, applied at the file level. This is why a Btrfs cp of a 4 GiB virtual machine image can return in milliseconds.

Online everything

Because Btrfs's logical-to-physical layer is itself a tree, you can rearrange the pool while it is mounted: add a disk, remove one, change RAID level, defragment, balance. ext4 needs to be unmounted (or at least carefully wrangled) for these operations; Btrfs handles them online, transparently, at the cost of a great deal of internal complexity.

Btrfs has had a bumpier ride than ZFS. Its RAID-5/6 implementation was unsafe for years and is still considered fragile; its performance under heavy small-file workloads has been criticised. But on a single disk or a mirror, with snapshots and reflinks turned on, it is by far the easiest way to get ZFS-like ergonomics on a stock Linux kernel — and openSUSE, Fedora's desktop variants, and Synology's NAS line all default to it.

10 · CODA

NTFS: every file is a record in one big table

NTFS, the New Technology File System, has been the default on Windows since Windows 2000. Its central idea is funny when you first hear it: everything is a file, and every file is a row in one big table called the MFT.

The Master File Table is a flat array of 1 KiB records. Record 0 describes the MFT itself (yes, the table that lists files lists itself); record 1 is its mirror; record 5 is the root directory. Your resume.docx sits in some later record, and so does every other file on the volume. Each record holds the file's attributes — name, timestamps, security descriptor, and, crucially, the file's data or pointers to the file's data.

For very small files, NTFS does something clever: it just puts the data inside the MFT record. A 200-byte text file has no separate data blocks at all; the bytes live in the same KiB as the metadata. This is called a resident attribute, and it means that for a directory full of tiny files, NTFS reads the MFT and is done.

Demo · The MFT, with a resident small file drag the size slider

Once a file outgrows its record, the data attribute is moved out to ordinary disk runs (NTFS's word for extents) and the record stores their (start, length) tuples. The file system has the same job as ext4 here; only the bookkeeping shape differs.

NTFS also has features that have proven useful, then occasionally infamous: alternate data streams let a single filename hold multiple parallel streams of bytes (used by Windows for "downloaded from the internet" tagging, and by malware authors for hiding); $LogFile is a metadata journal not unlike ext4's; change journals, shadow copies, and file-level compression are all baked in. The on-disk format is, by file-system standards, surprisingly poorly documented; most non-Microsoft implementations are reverse-engineered.

11 · CODA

APFS: containers, clones, and crypto

Apple's APFS replaced HFS+ in 2017, simultaneously, on every Mac, iPhone, and Apple Watch on Earth — one of the largest live-file-system migrations in history. Its design borrows enthusiastically from ZFS and Btrfs, and adapts the result to the specific needs of flash storage and consumer devices.

An APFS disk holds a single container, which in turn holds one or more volumes. All volumes in a container share the same pool of free space — which is why on a Mac you don't decide how much of your 1 TB SSD goes to the system volume vs. the data volume. They both grow into the same pool. ZFS pioneered this; APFS made it the default for everyone.

APFS is copy-on-write, just like ZFS and Btrfs: never overwrite a live block, swap a new root in atomically. Snapshots fall out of this for free, and Time Machine on modern macOS uses APFS snapshots under the hood instead of the older hard-link forest.

The most distinctive feature is cloning. cp on an APFS volume does not copy data. The new file shares every extent with the original; only when one of the two is modified does the relevant extent diverge. This is exactly Btrfs's reflink, deployed by default at the OS level; copying an 8 GiB video file on macOS therefore takes a few milliseconds and consumes effectively zero new bytes until you edit it.

Demo · A clone shares extents edit, then diverge

APFS also makes full-disk encryption first-class: keys are tracked per file and per extent, not just per volume, and the same disk can hold encrypted and unencrypted volumes side by side. The on-disk structure is a forest of B-trees with copy-on-write semantics, not unlike Btrfs — though the resemblance is more architectural than literal, and the implementation is closed-source.

What APFS does not have, and probably never will, is ZFS-style end-to-end checksumming of file data. Apple checksums metadata but trusts the underlying flash to flag bad reads. On a phone or a laptop, where the storage is one chip and you have backups elsewhere, this is a defensible trade. On a server with thirty disks, it would not be — which is why the file systems on this page exist for different worlds, even when they share most of the same ideas.

Where this leaves us

Run your eye back over the eleven sections. The base layer hasn't changed in fifty years: a list of fixed-size cells, a way to mark some of them taken, a tree of names, a record per file. What every file system from FAT32 to APFS does is add answers to a small set of follow-up questions: how do we keep the bookkeeping cheap, how do we recover from a crash, how do we scale to billions of blocks, how do we know a block is still the block we wrote?

FAT32's answer is "we don't, mostly, and that's the point." ext4's is "track everything, journal the metadata, group things by locality." ZFS's is "trust nothing, copy on write, build a Merkle tree of the world." Btrfs's is "do all of that with one B-tree shape." NTFS's is "put the file inside the file system's table when you can." APFS's is "do most of the above, plus make every cp a clone." The space of sensible designs is small, the trade-offs are timeless, and once you've seen the demos, you can mostly guess where any new file system is going to land.