BTRFS misuses and misinformation
July 18, 2024•1,351 words
I'm writing this mostly so I can reference it when "people on the Internet" come-up with the same misconceptions over and over again about BTRFS.
Misuse #1: using BTRFS on top of LVM/mdadm/hardware/JBOD RAID
For whatever reason, I see a lot of people wanting to use BTRFS on top of another block device abstraction layer. Now the whole point of using a file system like BTRFS is that thanks to its copy-on-write capabilities, it can guarantee data integrity thanks to reliable check-summing of the data and even repair corrupted data if you let it manage redundancy.
First, LVM is the best way of losing data. It offers redundancy in raid 1/10/5/6 and can recover from a failed drive but not from failed blocks, which is much more frequent than a whole drive failure. With LVM, in case of a failed block on a drive, you’ll have two copies that don’t match without being able to know which one is the right one. Read this chapter on LVM scrubbing for example.
Because BTRFS has a checksum of the blocks, it is able to tell which copy is good, and will re-write the bad block in-place from the good copy. "In-place" sounds like a bad idea, and with LVM and file systems like ext4, you would give the file system a list of blocks to avoid (so-called "badblocks"). BTRFS use the more modern approach or relying on the remapping done by modern drives. But remember, BTRFS can only do this if it knows about the multiple copies, so it can identify which one is the right one.
But if you put a layer under BTRFS to abstract the topology away (thus presenting one virtual block device instead of each individual drive), BTRFS won't be able to repair data for you, so what is the point?
Now it's absolutely fine to use BTRFS on top of a per-drive layer, like dm-crypt, if you want to encrypt your data.
Misuse #2: using BTRFS to host VM disks
It's going to be soooo slooooow.
VM images have a file-system on it, and you're almost guaranteed to do a lot of random (as in not sequential) reads and writes from it.
Reading a random block is going to be costly because it won't be passed to the VM until its checksum is also read, from a different block, in the checksum tree.
Writing is way worse because the file-system is copy-on-write, so BTRFS will not only write a new block, and a new check-sum btree but also the file btree data structure, and may require data from a full extent to be read, modified, and written somewhere else, so it can be committed to disk in one consistent operation. All these write locations are not contiguous and will be especially time consuming to write to on a HDD. Read more about BTRFS data-structures in the documentation.
Misinformation: BTRFS 5/6 is totally broken (and ZFS isn't)
Here we're talking about using BTRFS 5/6 for data, metadata should be kept on RAID1 or RAID1c3 respectively instead.
Even on otherwise very respected sources of documentation, you read stuff like
The RAID 5 and RAID 6 modes of Btrfs are fatally flawed, and should not be used for "anything but testing with throw-away data".
Let's not even mention Reddit...
First, it's worth noting that the official documentation is a lot milder:
There are some implementation and design deficiencies that make it unreliable for some corner cases and the feature should not be used in production, only for evaluation or testing.
BTRFS was not developed by amateurs so let's stop a moment to think about that sentence.
What is the actual problem?
Again quoting the official documentation:
The write hole problem. An unclean shutdown could leave a partially written stripe in a state where the some stripe ranges and the parity are from the old writes and some are new. The information which is which is not tracked. Write journal is not implemented.
So if there is an "unclean shutdown" (power failure, drive physical disconnect, kernel crash) while you are writing some data, you could be unlucky enough to have a stripe mismatch.
It is worth remembering that we still have check-summing in place (as metadata should be kept on RAID1) and that such problem would be detected during a scrub or a read, and we would know about it.
What do they mean by "production"?
Now let's imagine that at the time of such an "unclean shutdown", you were writing to another file-system or in another RAID mode that doesn't have this issue. What would the difference be? Would your write operation have finished? No! The only difference is that you would have consistent data on disk, like maybe an old version of the file, but the file you were writing would still not be on the disk.
In order to notice the difference, the file-system would have to be part of a distributed larger system, of which some of the components do not suffer the same "unclean shutdown" and are able to track file versions or state (as they have expectations about the data on disk that won't be fulfilled),
So what they mean about "production" is a very serious use case that you find in big infrastructures with a lot of complexity. In all other use cases, like building a NAS, what are you going to do in case of a power failure anyway? If you were in the middle of copying files, you'd probably use some common sense and do the copy again if you can.
What about ZFS?
Alexey Gubin, who writes file system recovery tools for a living, including one for ZFS, wrote a very detailed article about RAID modes on ZFS (so-called RAIDZ) in which he discusses the write hole problem on ZFS. I recommend you go read the full article but to summarise:
ZFS works around the write hole by embracing the complexity. So it is not like RAIDZn does not have a write hole problem per se because it does. However, once you add transactions, copy-on-write, and checksums on top of RAIDZ, the write hole goes away.
The overall tradeoff is a risk of a write hole silently damaging a limited area of the array (which may be more or less critical) versus the risk of losing the entire system to a catastrophic failure if something goes wrong with a ZFS pool. Of course, ZFS fans will say that you never lose a ZFS pool to a simple power failure, but empirical evidence to the contrary is abundant.
What about the data corruption bug?
The most serious criticism I've read (which often quoted, like at the top of this sticky thread on Reddit), is the one from Zygo Blaxell in 2020 on the BTRFS mailing-list. The most concerning piece is this one:
- btrfs raid5 does not provide as complete protection against on-disk data corruption as btrfs raid1 does.
When data corruption is present on disks (e.g. when a disk is temporarily disconnected and then reconnected), bugs in btrfs raid5 read and write code may fail to repair the corruption, resulting in permanent data loss.
Often, the quote stops there, but here is the next paragraph:
btrfs raid5 is quantitatively more robust against data corruption than ext4+mdadm (which cannot self-repair corruption at all), but not as reliable as btrfs raid1 (which can self-repair all single-disk corruptions detectable by csum check).
(A list of these bugs and discussions is available in a separate well put together email, still by Blaxell).
But that was 2020, now the main culprits have been fixed during the 6.2 kernel cycle (December 2022), when they claimed to have fixed the source of "all the reliability problems" (in this context of course).
Now the developers are doing refactoring and cleanup (see 6.3), and are starting to make backward incompatible changes (raid-stripe-tree) paving the way to not just fixing bugs, but to correct what John Duncan famously called "fatally flawed" back in 2016.