BTRFS misuses and misinformation

July 18, 2024•1,190 words

Understanding BTRFS: Common Misconceptions and Best Practices

BTRFS is a powerful filesystem with built-in features for data integrity and management. However, these features are often misunderstood or misused. This article addresses common misconceptions that frequently appear in online discussions about BTRFS.

Misuse #1: using BTRFS on top of LVM/mdadm/hardware/JBOD RAID

BTRFS provides data integrity through copy-on-write operations and checksumming. When properly configured, it can both detect and repair corrupted data by managing redundancy itself. However, this capability is often compromised when users layer BTRFS on top of other block device abstraction systems.

For example, LVM can lead to data loss despite offering RAID 1/10/5/6 redundancy. While LVM can recover from complete drive failures, it cannot handle failed blocks - a much more common issue. When LVM encounters mismatched copies of a block, it cannot determine which copy is correct. Read this chapter on LVM scrubbing for example.

BTRFS, on the other hand, maintains checksums of blocks. This allows it to identify good copies and repair bad blocks in-place. While "in-place" repairs might sound risky, BTRFS uses a modern approach of relying on the remapping done by modern drives, rather than the traditional bad-blocks list used by filesystems like ext4.

However, if you put a layer under BTRFS to abstract the topology away (thus presenting one virtual block device instead of each individual drive), BTRFS won't be able to repair data for you. This negates one of its primary benefits.

Note that using BTRFS on top of a per-drive layer like dm-crypt for encryption is perfectly fine, as this doesn't interfere with BTRFS's ability to manage redundancy across drives.

Misuse #2: using BTRFS to host VM disks

Virtual machine disk images present a particular challenge for BTRFS, resulting in significantly reduced performance. VM images contain their own filesystem and typically generate numerous random read and write operations.

Reading a random block is going to be costly because it won't be passed to the VM until its checksum is also read, from a different block, in the checksum tree.

Write operations are even more problematic due to BTRFS's copy-on-write nature. Each write requires:

Writing a new data block
Writing a new checksum
Updating the file btree data structure
Potentially reading, modifying, and relocating a full extent

These operations occur at non-contiguous locations, making them especially slow on HDDs. For more details about these data structures, see the BTRFS data-structures in the documentation.

Misinformation: BTRFS 5/6 is totally broken (and ZFS isn't)

Here we're talking about using BTRFS 5/6 for data, metadata should be kept on RAID1 or RAID1c3 respectively instead.

This misconception is widespread, appearing even in respected documentation. For instance, you read stuff like

The RAID 5 and RAID 6 modes of Btrfs are fatally flawed, and should not be used for "anything but testing with throw-away data".

The official documentation takes a more measured approach:

There are some implementation and design deficiencies that make it unreliable for some corner cases and the feature should not be used in production, only for evaluation or testing.

What is the actual problem?

The official documentation explains:

The write hole problem. An unclean shutdown could leave a partially written stripe in a state where the some stripe ranges and the parity are from the old writes and some are new. The information which is which is not tracked. Write journal is not implemented.

In practical terms, an unclean shutdown (power failure, drive disconnect, kernel crash) during a write operation could result in stripe mismatches. However, BTRFS's checksumming (maintained through RAID1 metadata) ensures such issues would be detected during scrub operations or reads.

What do they mean by use in "production"?

Consider what happens during an unclean shutdown with any filesystem or RAID mode: incomplete write operations never make it to disk. The key difference with BTRFS RAID 5/6 is that you might have inconsistent data on disk (that would be detected as such) rather than, say, an old version of the file, or a mix of old and new blocks / extents depending on the technology, but without indication that something went wrong.

This becomes problematic primarily in distributed systems where other components remain operational and expect specific file versions or states. For typical NAS or file server usage, the practical response to a power failure would be the same - verifying and possibly re-copying affected files.

What about ZFS?

Alexey Gubin, who writes file system recovery tools for a living, including one for ZFS, wrote a very detailed article about RAID modes on ZFS (so-called RAIDZ) in which he discusses the write hole problem on ZFS. I recommend you go read the full article but to summarise:

ZFS works around the write hole by embracing the complexity. So it is not like RAIDZn does not have a write hole problem per se because it does. However, once you add transactions, copy-on-write, and checksums on top of RAIDZ, the write hole goes away.

The overall tradeoff is a risk of a write hole silently damaging a limited area of the array (which may be more critical) versus the risk of losing the entire system to a catastrophic failure if something goes wrong with a ZFS pool. Of course, ZFS fans will say that you never lose a ZFS pool to a simple power failure, but empirical evidence to the contrary is abundant.

What about the data corruption bug?

A significant criticism came from Zygo Blaxell in 2020 on the BTRFS mailing-list. The most concerning piece is this one:

btrfs raid5 does not provide as complete protection against on-disk data corruption as btrfs raid1 does.

When data corruption is present on disks (e.g. when a disk is temporarily disconnected and then reconnected), bugs in btrfs raid5 read and write code may fail to repair the corruption, resulting in permanent data loss.

Often, the quote stops there, but here is the next paragraph:

btrfs raid5 is quantitatively more robust against data corruption than ext4+mdadm (which cannot self-repair corruption at all), but not as reliable as btrfs raid1 (which can self-repair all single-disk corruptions detectable by csum check).

(A list of these bugs and discussions is available in a separate well put together email, still by Blaxell).

Importantly, the main issues were addressed during the 6.2 kernel cycle (December 2022), when developers claimed to have fixed the source of "all the reliability problems" (in this context of course).

Current development focuses on refactoring and cleanup (see 6.3), including backward incompatible changes (raid-stripe-tree) to address what John Duncan famously called "fatally flawed" back in 2016.

Update Jan 2025: heavy work in the raid-stripe-tree area is still happening (see 6.13 and see 6.14)

Recommended Use Cases

BTRFS excels in scenarios that don't require intensive random write operations. Ideal use cases include:

File sharing and NAS applications (primarily sequential operations)
Backup solutions (leveraging snapshots and checksumming)
Large media storage (photos, videos, music)
System root partitions (utilizing subvolumes for rollback)
Any scenario where data integrity takes priority over random write performance

BTRFS misuses and misinformation

Understanding BTRFS: Common Misconceptions and Best Practices

Misuse #1: using BTRFS on top of LVM/mdadm/hardware/JBOD RAID

Misuse #2: using BTRFS to host VM disks

Misinformation: BTRFS 5/6 is totally broken (and ZFS isn't)

What is the actual problem?

What do they mean by use in "production"?

What about ZFS?

What about the data corruption bug?

Recommended Use Cases

More from F Guerraz
All posts

Creating SECCOMP profiles for docker containers

BTRFS misuses and misinformation

Understanding BTRFS: Common Misconceptions and Best Practices

Misuse #1: using BTRFS on top of LVM/mdadm/hardware/JBOD RAID

Misuse #2: using BTRFS to host VM disks

Misinformation: BTRFS 5/6 is totally broken (and ZFS isn't)

What is the actual problem?

What do they mean by use in "production"?

What about ZFS?

What about the data corruption bug?

Recommended Use Cases

More from F GuerrazAll posts

Creating SECCOMP profiles for docker containers

More from F Guerraz
All posts