MD RAID or DRBD can be broken from userspace when using O_DIRECT

saurik 2 hours ago

So... one can, on a filesystem that is mirrored using MD RAID, from userspace, and with no special permissions (as it seems O_DIRECT does not require any), create a standard-looking file that has two different possible contents, depending from which RAID mirror it happens to be read from today? And, this bug, which has been open for a decade now, has, somehow, not been considered to be an all-hands-on-deck security issue that undermines the integrity of every single mechanism people might ever use to validate the content of a file, because... checks notes... we should instead be "teaching [the attacker] not to use [O_DIRECT]"?

(FWIW, I appreciate the performance impact of a full fix here might be brutal, but the suggestion of requiring boot-args opt-in for O_DIRECT in these cases should not have been ignored, as there are a ton of people who might not actively need or even be using O_DIRECT, and the people who do should be required to know what they are getting into.)

weinzierl an hour ago

Linus very much opposed O_DIRECT from the start. If I remember correctly he only introduced it at the pressure from the "database people" i.e. his beloved Oracle.
No wonder O_DIRECT never saw much love.
"I hope some day we can just rip the damn disaster out."
-- Linus Torvalds, 2007
https://lkml.org/lkml/2007/1/10/235
- jandrewrogers 14 minutes ago
  
  This is one of several examples where Linus thinks something is bad because he doesn't understand how it is used.
  Something like O_DIRECT is critical for high-performance storage in software for well-understood reasons. It enables entire categories of optimization by breaking a kernel abstraction that is intrinsically unfit for purpose; there is no way to fix it in the kernel, the existence of the abstraction is the problem as a matter of theory.
  As a database performance enjoyer, I've been using O_DIRECT for 15+ years. Something like it will always exist because removing it would make some high-performance, high-scale software strictly worse.
karmakaze 5 minutes ago

This is nuts. I've used both MD RAID and O_DIRECT though luckily not together on the same system. One system was with btrfs so may have been spared anyway. Footguns/landmines.
vbezhenar 2 hours ago

Please note that some filesystems, namely bcachefs, btrfs, zfs seem to be immune to this issue, probably because they don't just directly delegate writes to the block layer with O_DIRECT flag. But it is important to be aware of this issue.
- saurik an hour ago
  
  While those are all "filesystems", they are also (internally) alternatives to MD RAID; like, you could run zfs on top of MD RAID, but it feels like a waste of zfs (and the same largely goes for btrfs and bcachefs). It thereby is not at all clear to me that it is the filesystems that are "immune to this issue" rather than their respective RAID-like behaviors, as it seems to be the latter that the discussion was focussing on (hence the initial mention of potentially adding btrfs to the issue, which did not otherwise mention any filesystem at all). Put another way: if you did do the unusual thing of running zfs on top of MD RAID, I actually bet you are still vulnerable to this scenario.
  (Oh, unless you are maybe talking about something orthogonal to the fixes mentioned in the discussion thread, such as some property of the extra checksumming done by these filesystems? And so, even if the disks de-synchronize, maybe zfs will detect an error if it reads "the wrong one" off of the underlying MD RAID, rather than ending up with the other content?)
summa_tech 2 hours ago

Wouldn't the performance impact be that of setting the page read-only when the request is submitted, then doing a copy-on-write if the user process does write it? I mean, that's nonzero, TLB flushes being what they are. But they do happen a bunch anyway...

rwaksmunski 2 hours ago

This, fsync() data corruption, BitterFS issues, lack of Audit on io_uring, triplicated EXT2,3,4 code bases. For the past 20 years, every time I consider moving mission critical data from FreeBSD/ZFS something like this pops up.

zokier an hour ago

Personally I think these problems are a sign that posix fs/io apis are just not that good. Or rather they have been stretched and extended way past their original intent/design. Stuff like zenfs give interesting glimpse of what could be.