points by hyc_symas 7 years ago

> Doesn't this affect all databases?

Didn't affect LMDB. If an fsync fails the entire transaction is aborted/discarded. Retrying was always inherently OS-dependent and unreliable, better to just toss it all and start over. Any dev who actually RTFM'd and followed POSIX specs would have been fine.

LMDB's crash reliability is flawless.

mehrdadn 7 years ago

> Any dev who actually RTFM'd and followed POSIX specs would have been fine.

So I'm trying to do that [1] and it seems to me the documentation directly implies that a second successful call to fsync() necessarily implies that all data was transferred, even if the previous call to fsync() had failed.

I say this because the sentence says "all data for the open file descriptor" is to be transferred, not merely "all data written since the previous fsync to this file descriptor". It follows that any data not transferred in the previous call due to an error ("outstanding I/O operations are not guaranteed to have been completed") must now be transferred if this call is successful.

What am I missing?

[1] https://pubs.opengroup.org/onlinepubs/009695399/functions/fs...

  • hyc_symas 7 years ago

    Technically, there is nothing besides "all data written since the previous fsync" since the previous fsync already wrote all the outstanding data at that point in time. (Even if it actually failed.) I.e., there is never any to-be-written data remaining after fsync returns. Everything that was queued to be flushed was flushed and dequeued. Whether any particular write failed or not doesn't change that fact.

    • mehrdadn 7 years ago

      > the previous fsync already wrote all the outstanding data at that point in time. (Even if it actually failed.)

      I'm sorry, but this... just doesn't make sense. An event can't both happen and also fail to happen.

      Also, the queue business is an implementation detail. Our debate is over what the spec mandates, not how a particular implementation behaves.

      • hyc_symas 7 years ago
          The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to
          the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.
          The fsync() function shall not return until the system has completed that action or until an error is detected.
        

        All of the data was transferred to the device. The device may have failed to persist some or all of what was transferred, but it all got transferred.

        There's no mention of the OS retrying, or leaving the system in a state that a subsequent fsync can retry from where it left off. So you can't assume anything along those lines.

        • mehrdadn 7 years ago

          So you're suggesting "storage device" here isn't the same thing as "storage medium"? i.e. that merely transferring the data to a volatile buffer on the device is also considered being transferred to the "storage device", rather than to the (nonvolatile) storage medium?

          By that logic even after a successful call you can't rely on the data being persisted, which goes squarely against the entire point of the function and its documentation: "The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk."

          So this interpretation is wrong...

          • hyc_symas 7 years ago

            Nonsense. If the device returns "success" then everything was persisted. If it returns an error, then some/all of it was not persisted. There is no way for you to determine whether it is some or all or which. The only safe action for a user is to assume all of it failed.

            And by the way, there are devices out there that lie, and claim the data is successfully persisted even though it only made it into a volatile cache.

            • mehrdadn 7 years ago

              Everything you said in this comment is consistent with what I've been saying. Nowhere did I suggest the user can assume data was written if an fsync call fails. I'm saying if an fsync call fails, the documentation of fsync (which I quoted multiple times) implies the next call will attempt to write data that wasn't successful it written on the last call.

              • hyc_symas 7 years ago

                Yeah, and we still disagree there. Nothing in the doc implies that just calling fsync() again will do anything useful, there is no implication of retryability.

                • mehrdadn 7 years ago

                  > Nothing in the doc implies that just calling fsync() again will do anything useful

                  "Nothing" in the doc? More like everything in the doc? It literally says "all data for the open file descriptor is to be transferred" and "If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed." Together these are literally saying that all data that wasn't transferred due to a failure in one call will be transferred in the next successful call.

                  I could maybe buy it if the doc described a notion of "dequeueing" that was separate from writing, but it doesn't. It just talks about queued operations completing. So either they complete successfully (and are subsequently dequeued because that is common sense) or they don't.

                  Like if your boss had assigned you tasks A, B, and then C, and then he ordered you to finish all your tasks, and you failed to do B, and then he made the same request again, you wouldn't then suddenly turn to him and say "I have nothing to do because I already popped B and C off my to-do list". You'd get fired for that (at least if you persisted) because it literally wouldn't make sense.

                  • hyc_symas 7 years ago

                    You're going in circles.

                    https://news.ycombinator.com/item?id=19133714

                    You are equating "transferred" with "completed" but they are clearly not the same.

                    All the data was transferred. After transfer, the data on the device may have been written or may have been lost. The OS doesn't remember what has been transferred after the transfer - there is no language about this anywhere in this text.

      • geocar 7 years ago

        > I'm sorry, but this... just doesn't make sense. An event can't both happen and also fail to happen.

        Consider a multithreaded application where a and b point to the same file. Note this isn't exactly what Postgres does:

            T1                   T2
            -------------------- --------------------
            write(a,x) start
            write(a,x) success
        
            fsync(a) start
                                 read(b,x) start
                                 read(b,x) success
            fsync(a) failed
                                 write(b,y) start
                                 write(b,y) success
        
                                 fsync(b) start
                                 fsync(b) success
        
        

        y has a data dependancy on x (for example: table metadata), and the fsync() for y succeeded, so is T2 given to expect that x was recorded on disk correctly?

        • mehrdadn 7 years ago

          No? It says "all data for the open file descriptor", not "all data for all open file descriptors".

          • geocar 7 years ago

            The file descriptors point to the same file and can even be the same descriptor.

            Maybe this is clearer/simpler, writing x and y to the same offset:

                T1                   T2
                -------------------- --------------------
                                     write(b,y) start
                                     write(b,y) success
                write(a,x) start
                write(a,x) success
            
                fsync(a) start
                fsync(a) failed
                                     fsync(b) start
                                     fsync(b) success
            

            At this point, T2 is probably surprised to discover the original contents of the file on the disk. Even after a successful fsync()!

            • mehrdadn 7 years ago

              I'm confused, where is the reading happening that is causing T2 to be surprised? But in any case, I'm saying that if a and b are the same descriptor, then fsync(b)'s success should imply both a and b are written to disk, by the specification of fsync. I don't see where you're seeing a contradiction.

              • geocar 7 years ago

                I understand what you're saying. If a and b are the same descriptor, then fsync(b)'s success DOES NOT mean that it was written to because the error was already reported by fsync(a)'s failure report. I think you're missing that there are two fsync() calls, one that fails and the other that doesn't.

                And FYI: Even if they're not the same descriptor, you still have risk.

                • mehrdadn 7 years ago

                  You're telling me what's actually happening, but that's not where we're disagreeing. I agree that's what happens. What I'm saying is that it's wrong. It goes against the spec of fsync. It all data is to be written, not merely all data written since the last call to fsync. Your example didn't show any contradiction in this reasoning.

                  • geocar 7 years ago

                    > It [says] all data is to be written, not merely all data written since the last call to fsync.

                    No it doesn't.

                    The spec† says: If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed. That clearly permits the Linux behaviour; There is no "data" that hasn't been sent to the "transferred to the storage device" at this point. You can see it from the perspective of the Linux kernel:

                        write() = success
                        fsync() = error: outstanding I/O not guaranteed††.
                        fsync() = success: no outstanding I/O
                    
                    

                    †: http://pubs.opengroup.org/onlinepubs/9699919799/functions/fs...

                    ††: It would've been a different story if the spec said, If "the fsync() function fails, all future fsync() to the same file descriptor must fail." but it doesn't say this.

                    • mehrdadn 7 years ago

                      No, the spec is longer than that one sentence. That sentence of the spec is merely saying that if fsync fails then the I/O operations haven't necessarily been completed. That's completely true and obvious and yes, it's also consistent with Linux. That's not the problem. The problem is the other sentence I already quoted which is that "all data for the open file descriptor is to be transferred to the storage device". This includes any data that hasn't been written, which includes data that has failed to write on a previous call to fsync. Like I already said once, you can't claim data that was sent to the device not failed to write to persistent storage is considered "transferred" since by that logic fsync would never guarantee any data was actually written to persistent storage (read the entire spec). I already explained all this but I'm repeating myself since it seems like every time you respond you're only choosing to read half of the spec and half of what I'm writing, but I'm tired of continuing.

                      • geocar 7 years ago

                        But the "spec" uses that line, so you can't very well ignore it either. Most unixes also have this behaviour, so this isn't some nonconformant outlier, this is the standard behaviour. It might not be ideal, but most of the UNIX IO model and heritage isn't.

                        Apologies for assuming you didn't understand the behaviour when you said it "doesn't make sense": I'd assumed you meant that you were having trouble understanding the behaviour, rather than "all the application developers and unix implementors" that expect this behaviour, not understanding "the spec".

macdice 7 years ago

But the spec doesn't imply in any way that you'll need to write() again if fsync() fails. And dropping dirty flags seems way out of spec because now you can read data from cache that is not on disk and never will be even if future fsync() succeeds.

So I don't buy the spec argument.

You have me curious now... I don't know anything about LMDB but I wonder if its msync()-based design really is immune... Could there be a way for an IO error to leave you with a page bogusly marked clean, which differs from the data on disk?

  • hyc_symas 7 years ago

    The spec leaves the system condition undefined after an fsync failure. The safe thing to do is assume everything failed and nothing was written. That's what LMDB does. Expecting anything else would be relying on implementation-specific knowledge, which is always a bad idea.

    > I don't know anything about LMDB but I wonder if its msync()-based design really is immune.

    By default, LMDB doesn't use msync, so this isn't really a realistic scenario.

    If there is an I/O error that the OS does not report, then sure, it's possible for LMDB to have a corrupted view of the world. But that would be an OS bug in the first place.

    Since we're talking about buffered writes - clearly it's possible for a write() to return success before its data actually gets flushed to disk. And it's possible for a background flush to occur independently of the app calling fsync(). The docs specifically state, if an I/O error occurs during writeback, it will be reported on all file descriptors open on that file. So assuming the OS doesn't have a critical bug here, no, there's no way for an I/O error to leave LMDB with an invalid view of the world.

    • nh2 7 years ago

      > The spec leaves the system condition undefined [...]. The safe thing to do is assume everything failed

      This is key.

      Often programmers do 'assumption based programming'.

      "Surely the function will do X, it's the only reasonable thing to do, right?". As much it is human, this is bad practice and leads to unreliable systems.

      If the spec doesn't say it, don't assume anything about it, and keep asking. To show that this approach is feasible for anyone, here is an example:

      Recently I needed to write an fsync-safe application. The question of whether close()+re-open()+fsync() is safe came up. I found it had been asked on StackOverflow (https://stackoverflow.com/questions/37288453/calling-fsync2-...) but received no answers for a year. I put a 100-reputation bounty on it and quickly got a competent reply quoting the spec and summarising:

      > It is not 100% clear whether the 'all currently queued I/O operations associated with the file indicated by [the] file descriptor' applies across processes. Conceptually, I think it should, but the wording isn't there in black and white.

      With the spec being unclear, I took the question to the kernel developers (https://marc.info/?m=152535098505449), and was immediately pointed at the Postgres fsyncgate results.

      So by spending a few hours on not believing what I wished was true, I managed to avoid writing an unsafe system.

      Always ask those in the know (specs and people). Never assume.

loeg 7 years ago

> Any dev who actually RTFM'd and followed POSIX specs would have been fine.

Yeesh. The POSIX manual on fsync is intentionally vague to hide cross-platform differences. There are basically no guarantees once an error happens. I guess that's one interpretation of RTFMing, but... clearly it doesn't match user expectations.

evdubs 7 years ago

Mr. Chu, I hope you never lose your tenacity with respect to writing blurbs on LMDB's performance and reliability. I have enjoyed the articles comparing LMDB with other databases' performance and hope you continue to point the spotlight on the superior design decisions of LMDB.

  • hyc_symas 7 years ago

    Thanks, glad you're getting something out of those writeups. Hopefully it helps other developers learn a better path.

    • diroussel 7 years ago

      What if another process calls fsync? It will get the error. Then when LMDB calls fsync no error will be reported. And thus the transaction will not be retried. Is this scenario dealt with?

      • anarazel 7 years ago

        Newer versions of linux (but not plenty of other OSs) guarantee that each write failure will be signalled once to each file descriptor open before the error occurred. So that ought to be handled, unless the FD is closed inbetween.

      • hyc_symas 7 years ago

        LMDB is single-writer. Only one process can acquire the writelock at a time, and only the writer can call fsync() so this scenario doesn't arise.

        If you open the datafile with write access from another process without using the LMDB API, you're intentionally shooting yourself in the foot and all bets are off.

        • _urga 7 years ago

          On some systems, fsync is system-wide. Another process fsyncing an unrelated file descriptor can still consume the error meant for the LMDB file descriptor. Same thing goes for a user running the sync command from the terminal. A write lock won't protect you from this, unless you can prevent all other processes from calling fsync. It's got nothing to do with opening the LMDB datafile concurrently. If you share a physical disk device with any other process, you're at risk.

          • hyc_symas 7 years ago

            Yes, acknowledged.

            https://news.ycombinator.com/item?id=19128325

            fsync() is not documented to be system-wide (while sync() is). That behavior would also be an OS bug. The question this person asked was specifically about fsync().

            • _urga 7 years ago

              Sorry, that was confusing, I meant "system-wide" as in file system, not OS, i.e. "if you share a physical disk device with any other process".

              When you flush a particular FD, some file systems just flush every dirty buffer for the entire disk. I wouldn't actually be surprised though if there are some kernels that flush all disks either, regardless of whether it's considered a bug or not.

              "The question this person asked was specifically about fsync()."

              Sure, but as you acknowledged elsewhere, if sync is called, it flushes everything, and this impacts upon the person who "asked specifically about fsync" since there's a chance on some kernels that sync will eat the error that fsync was expecting.

pgaddict 7 years ago

I'm pretty sure it does affect any database/application relying on buffered I/O. Even if you use the fsync() interface correctly, you're still affected by the bugs in error reporting.

  • hyc_symas 7 years ago

    Technically there is no bug in error reporting. fsync() reported an error as it should. The application continued processing, instead of stopping. fsync() didn't report the same error a second time, which leads to the app having problems.

    The application should have stopped the first time fsync() reported an error. LMDB does this, instead of futile retry attempts that Postgres et al do. Fundamentally, a library should not automatically retry anything - it should surface all errors back to the caller and let the caller decide what to do next. That's just good design.

lmb 7 years ago

We've been using LMDB at Cloudflare to store small-ish configuration data, it has been rock solid.

Thank you and the rest of the contributors for such a great library.

grandinj 7 years ago

in the reported case, fsync does not persist all data to disk, but it reports success. How does LMDB deal with that situation?

  • hyc_symas 7 years ago

    In the reported case, fsync reported an error, then (more data may or may not have been written), then fsync is tried again and reports a success, which masks the fact that data from the previous fsync didn't get fully written.

    As I already wrote - in LMDB, after fsync fails, the transaction is aborted, so none of the partially written data matters.

    • jhallenworld 7 years ago

      Hold on, they are saying that if sync fails (for example if someone types "sync" at the console), then the database calls fsync() it will not fail even though the data is gone. I don't see how any database the uses the buffer cache could guard against this case.

      The kernel should never do this. If sync fails, all future syncs should also fail. This could be relaxed in various ways: sync fails, but we record which files have missing data, so any fsync for just those files also fails.

      (Otherwise I agree with LMDB- there should be no retry business on fsync fails).

      • hyc_symas 7 years ago

        You're right, that was another error case in the article that I missed the first time.

        In LMDB's case you'd need to be pretty unlucky for that bug to bite; all the writes are done at once during txn_commit so you have to issue sync() during the couple milliseconds that might take, before the fsync() at the end of the writes. Also, it has to be a single-sector failure, otherwise a write after the sync will still fail and be caught.

        E.g.

           LMDB       other
          write
          write
          write       sync
          write
          fsync
        

        If the device is totally hosed, all of those writes' destination pages would error out. In that case, the failure can only be missed if you're unlucky enough for the sync to start immediately after the last write and before the fsync. The window is on the order of nanoseconds.

        If only a single page is bad, and the majority of the writes are ok, then you still have to be unlucky enough for the sync to run after the bad write; if it runs before the bad write then fsync will still see the error.

garyclarke27 7 years ago

LMDB claimed speed and reliability seems remarkable (from a quick glance). I would guess is easier to achieve such, for a KV store, than for much more complex Relational Database. Got me thinking though. Mayby Postgres could take advantage of LMDB? Mayby by using LMDB as it’s cache? instead of using OS page cache, maybe writing the WAL to LMDB?

  • lmb 7 years ago

    LMDB is not a great fit for something like the WAL, where new data is written at the end, and old data discarded at the start. It leads to fragmentation (especially if WAL entries are larger than a single page).

  • hyc_symas 7 years ago

    LMDB itself only uses the OS page cache. The way for LMDB to improve an RDBMS is for it to replace the existing row and index store, and eliminate any WAL. This is what SQLightning does with SQLite.

    Have looked at replacing InnoDB in MySQL, but that code is a lot harder to read, so it's been slow going. Postgres doesn't have a modular storage interface, so it would be even uglier to overhaul.

    • garyclarke27 7 years ago

      Thanks, make sense, I think Postgres are planning go have pluggable storage interface in nest version 12, would that help? Also nobody has mention data checksum added v9.3, do you know if this helps avoid this kind of fsync related corruption?

      • anarazel 7 years ago

        > Also nobody has mention data checksum added v9.3, do you know if this helps avoid this kind of fsync related corruption?

        Not really, I think. Page-level checksums don't protect against entire writes going missing, unfortunately.