It isn't, because otherwise it would be showing the ~same performance with and without sync commands, as I showed in the thread. There is a significant performance loss for every drive, but Apple's is way worse.
There is no real excuse for a single sector write to take ~20ms to flush to NAND, all the while the NAND controller is generating some 10MB/s of DRAM traffic. This is a dumb firmware design issue.
It seems to be pretty apples to apples, they're running the same benchmark using equivalent data storage APIs on both systems. What are you thinking might be different? The Linux+WD drive isn't making the data durable? Or that OSX does something stupid which could be the cause of the slowdown rather than the drive? Both seem implausible.
This affects T2 Macs too, which use the same NVMe controller design as M1 Macs.
We've looked at NVMe command traces from running macOS under a transparent hypervisor. We've issued NVMe commands outside of Linux from a bare-metal environment. The 20ms flush penalty is there for Apple's NVMe implementation. It's not some OS thing. And other drives don't have it. And I checked and Apple's NVMe controller is doing 10MB/s of DRAM memory traffic when issued flushes, for some reason (yes, we can get those stats). And we know macOS does not properly flush with just fsync() because it actively loses data on hard shutdowns. We've been fighting this issue for a while now, it's just that it only just hit us yesterday/today that there is no magic in macOS - it just doesn't flush, and doesn't guarantee data persistence, on fsync().
Ive just been scanning through linux kernel code (inc ext4). Are you sure that its not issuing a PREFLUSH? What are your barrier options on the mount? I think you will find these are going to be more like F_BARRIERFSYNC.
Those are Linux concepts. What you're looking for is the actual NVMe commands. There's two things: FLUSH (which flushes the whole cache), and a WRITE with the FUA bit set (which basically turns that write into write-through, but does not guarantee anything about other commands). The latter isn't very useful for most cases, since you usually want at least barrier semantics if not a full flush for previously completed writes. And that leaves you with FLUSH. Which is the one that takes 20ms on these drives.
> Those are Linux concepts. What you're looking for is the actual NVMe commands.
Im not sure what commands are being sent to the NVMe drive. But what you are describing as a flush would be F_BARRIERFSYNC - NOT the F_FULLFSYNC which youve been benchmarking.
Sigh, no. A barrier is not a full flush. A barrier does not guarantee data persistence, it guarantees write ordering. A barrier will not make sure the data hits disk and is not lost on power failure. It just makes sure that subsequent data won't show up and not the prior data, on power failure. NVMe doesn't even have a concept of barriers in this sense. An OS-level barrier can be faster than a full sync only because it doesn't need to wait for the FLUSH to actually complete, it can just maintain a concept of ordering within the OS and make sure it is maintained with interleaved FLUSH calls.
I don't know why you keep pressing on this issue. macOS has the same performance with F_FULLFSYNC as Linux does with fsync(). Why would they be different things? We're getting the same numbers. This entire thing started because fsync() on these Macs on Linux was dog slow and we couldn't figure out why macOS was fast. Then we found F_FULLFSYNC which has the same semantics as fsync() on Linux. And now both OSes perform equally slowly on this hardware. They're obviously doing the same thing. And the same thing on Linux on non-Apple SSDs is faster. I'm sure I could install macOS on this x86 iMac again and show you how F_FULLFSYNC on macOS also gives better performance on this WD drive than on the M1, but honestly, I don't have the time for that, the isssue has been thoroughly proved already.
Actually, I have a better one that won't waste as much of my time.
Plugs in a shitty USB3 flash drive into the M1.
224 IOPS with F_FULLFSYNC. On a shitty flash drive.
58 IOPS with F_FULLFSYNC. On internal NVMe.
Both FAT32.
Are you convinced there's a problem yet?
(I'm pretty sure the USB flash drive has no write cache, so of course it is equally fast/slow with just fsync(), but my point still stands - committing writes to persistent storage is slower on this NVMe controller than on a random USB drive)
It isn't, because otherwise it would be showing the ~same performance with and without sync commands, as I showed in the thread. There is a significant performance loss for every drive, but Apple's is way worse.
There is no real excuse for a single sector write to take ~20ms to flush to NAND, all the while the NAND controller is generating some 10MB/s of DRAM traffic. This is a dumb firmware design issue.
It may be interpreting it differently. You arent comparing apples to apples, quite literally.
Why not compare macOS and linux on approved x86 mac hardware. i.e. fusion drive or whatever.
Also, as suggested - try F_BARRIERFSYNC, which flushes anything before the barrier (used for WAL IIRC).
It seems to be pretty apples to apples, they're running the same benchmark using equivalent data storage APIs on both systems. What are you thinking might be different? The Linux+WD drive isn't making the data durable? Or that OSX does something stupid which could be the cause of the slowdown rather than the drive? Both seem implausible.
This affects T2 Macs too, which use the same NVMe controller design as M1 Macs.
We've looked at NVMe command traces from running macOS under a transparent hypervisor. We've issued NVMe commands outside of Linux from a bare-metal environment. The 20ms flush penalty is there for Apple's NVMe implementation. It's not some OS thing. And other drives don't have it. And I checked and Apple's NVMe controller is doing 10MB/s of DRAM memory traffic when issued flushes, for some reason (yes, we can get those stats). And we know macOS does not properly flush with just fsync() because it actively loses data on hard shutdowns. We've been fighting this issue for a while now, it's just that it only just hit us yesterday/today that there is no magic in macOS - it just doesn't flush, and doesn't guarantee data persistence, on fsync().
Ive just been scanning through linux kernel code (inc ext4). Are you sure that its not issuing a PREFLUSH? What are your barrier options on the mount? I think you will find these are going to be more like F_BARRIERFSYNC.
I couldnt find much info about it - but the official docs are here: https://kernel.org/doc/html/v5.17-rc3/block/writeback_cache_...
Those are Linux concepts. What you're looking for is the actual NVMe commands. There's two things: FLUSH (which flushes the whole cache), and a WRITE with the FUA bit set (which basically turns that write into write-through, but does not guarantee anything about other commands). The latter isn't very useful for most cases, since you usually want at least barrier semantics if not a full flush for previously completed writes. And that leaves you with FLUSH. Which is the one that takes 20ms on these drives.
> Those are Linux concepts. What you're looking for is the actual NVMe commands.
Im not sure what commands are being sent to the NVMe drive. But what you are describing as a flush would be F_BARRIERFSYNC - NOT the F_FULLFSYNC which youve been benchmarking.
Sigh, no. A barrier is not a full flush. A barrier does not guarantee data persistence, it guarantees write ordering. A barrier will not make sure the data hits disk and is not lost on power failure. It just makes sure that subsequent data won't show up and not the prior data, on power failure. NVMe doesn't even have a concept of barriers in this sense. An OS-level barrier can be faster than a full sync only because it doesn't need to wait for the FLUSH to actually complete, it can just maintain a concept of ordering within the OS and make sure it is maintained with interleaved FLUSH calls.
I don't know why you keep pressing on this issue. macOS has the same performance with F_FULLFSYNC as Linux does with fsync(). Why would they be different things? We're getting the same numbers. This entire thing started because fsync() on these Macs on Linux was dog slow and we couldn't figure out why macOS was fast. Then we found F_FULLFSYNC which has the same semantics as fsync() on Linux. And now both OSes perform equally slowly on this hardware. They're obviously doing the same thing. And the same thing on Linux on non-Apple SSDs is faster. I'm sure I could install macOS on this x86 iMac again and show you how F_FULLFSYNC on macOS also gives better performance on this WD drive than on the M1, but honestly, I don't have the time for that, the isssue has been thoroughly proved already.
Actually, I have a better one that won't waste as much of my time.
Plugs in a shitty USB3 flash drive into the M1.
224 IOPS with F_FULLFSYNC. On a shitty flash drive. 58 IOPS with F_FULLFSYNC. On internal NVMe.
Both FAT32.
Are you convinced there's a problem yet?
(I'm pretty sure the USB flash drive has no write cache, so of course it is equally fast/slow with just fsync(), but my point still stands - committing writes to persistent storage is slower on this NVMe controller than on a random USB drive)
Thank you, you've made this very clear for me.
OK - thanks for humouring me marcan. Sorry to waste your time. Clearly something is not right here.