Animats 6 years ago

Mainframe designers had this problem under control by 1970. Mainframes had, and have, "channels". A channel is part of the processor architecture. It takes commands, sends them to a peripheral, and manages the data transfer in both directions. Channels have some privileged functions through which the OS tells them where the data is supposed to go in memory. The architecture of channels is well defined, and peripherals are built to talk to channels. The CPU has I/O instructions to control channels in a well defined way.

The peripheral never has access to main memory. There is no peripheral-controlled "direct memory access" (DMA). So it's possible to give control of a channel to a userland program without a memory security risk.

Minicomputers of the 1970s had low transistor counts and slow CPUs. So peripherals were usually put directly on the memory bus, with full access to memory. I/O operations were performed by storing into memory addresses, which caused bus transactions detected by the peripheral device. There were no CPU I/O instructions.

Microprocessors copied the minicomputer model. IBM's people knew this was a bad idea, and in the IBM PS/2, they introduced the "microchannel". Peripheral vendors, facing a new architecture that required more transistors, screamed. IBM backed down and went back to bus-oriented peripherals.

That model persists today, even though the few thousand transistors required for a channel controller are nothing today. Even though most modern CPUs have I/O channel like machinery, it's exposed to the program as registers the program stores into and memory accesses by the peripheral device.

So there's no standardization on how to talk to devices at the hardware level. Some CPUs have protection systems, an "I/O MMU", and there have been various channel-like interfaces, especially from Intel, but they have never caught on.

Instead, we mostly have heavy kernel mediation between the hardware and the user program. And way too many "drivers". This has become a problem with "solid state disk", which is really a random access memory device that doesn't write very fast. Mostly, it's used to emulate rotating disks.

Samsung makes a key/value store device which uses SSD-type memory devices but manages the key/value store itself. But you need a kernel between the device and the user program. You can't just open a channel to it and let the user program access it.

  • michaelmu 6 years ago

    Thanks for this explanation. Super interesting!

  • kazinator 6 years ago

    > Channels have some privileged functions through which the OS tells them where the data is supposed to go in memory.

    So the channel controller system has DMA; just not the peripheral.

    > Minicomputers of the 1970s had low transistor counts and slow CPUs. So peripherals were usually put directly on the memory bus, with full access to memory.

    That's "bus mastering" DMA. There is such a thing as "third party" DMA, which uses generic DMA controller, which is programmed to push data back and forth between memory and relatively dumb peripherals.

    That approximates the channel concept.

    PC's have had DMA controllers since 1980 something. The IBM PC had it in the form of the Intel 8237 chip. This is documented as having four "channels", wouldn't you know it.

    • Animats 6 years ago

      Those are more like add-on features a driver can use if present. They don't push peripheral interfaces into a standard channel-like format. Nor are they close to one that can be exposed to application programs.

      • dkersten 6 years ago

        So would it be correct to say that functionally, it’s similar to DMA, but the API is different because “channels” expose a consistent interface while DMA doesn’t?

        • kazinator 6 years ago

          Channels only expose a consistent interface as long as there is no other mainframe other than an IBM one, or compatible.

          Once you have two, and you want portable applications, using channel I/O instructions directly inline goes out the window; you need an API.

          • marshray 6 years ago

            One could say this about any library or OS feature. Berkeley Sockets, for example.

            Porting code is easier when there are portable APIs. But that doesn't diminish from the quality of the abstraction.

      • kazinator 6 years ago

        We can't expose a DMA controller directly to programs, but an API could be devised for it. Such a thing is needed even just from the mere portability point of view: making different DMA controllers on different systems look the same.

        Applications will typically specify transfers using virtual addresses, but DMA controllers will understand physical memory (usually, but maybe there are virtually mapped ones out there). Plus there are other issues like restricted ranges: DMA that cannot acces all physical memory but just a certain range. The API will have to convert ranges of virtual addresses to one or more physical ranges, which are then queued as one or more DMA operations and somehow deal with the problem that not just any old memory allocation in the application is DMA-able.

        Applications talking to channel hardware directly using dedicated CPU instructions is only possible in some vendor-locked IBM mainframe world. It's not otherwise feasible simply for business/market/economic reasons having nothing to do with technical feasibility.

        An operating system API is basically a set of extensions to the instruction set available in the application's virtual machine; it's no different from some dedicated I/O instructions, just perhaps less efficient (which may or may not matter).

        • Unklejoe 6 years ago

          > but DMA controllers will understand physical memory

          Yup, this is what I’ve found on many embedded SoCs. It’s unfortunate. The fact that they need physical addresses makes them not worth using in many cases (mainly when trying to move user buffers around).

          • hyc_symas 6 years ago

            Sun supported DMA to virtual addresses in SunOS / Solaris. Made some aspects of drivers easier.

      • CamperBob2 6 years ago

        They don't push peripheral interfaces into a standard channel-like format.

        How does this not describe a PCIe bus master?

        Frankly I don't understand any of this. High-performance I/O works by accessing main memory directly, just like it always has. The CPU then has to wait on main memory, just like it always has. Saying that the CPU is somehow the bottleneck seems to fall into the "not even wrong" area. There is no danger of I/O bandwidth approaching L1 or even L2 speeds anytime soon.

        • ncmncm 6 years ago

          400 Gbps is 50 GBps, or 50 bytes per ns. A 100-byte packet comes every 2ns. The bottleneck is not cache bandwidth, it is the number of instructions you have time to execute per unit of I/O: in 6 cycles you will not get to run 24 instructions, especially if you have to wait 3 cycles for your L1 cache.

          • CamperBob2 6 years ago

            But you aren't waiting on L1, except in the sense that you're waiting on line fills from main memory. If the problem is that you don't have enough time to do something with the data you're getting, how is that solvable by architectural changes external to the CPU?

            In the real world, if you need to deal with 400 Gb/s, you aren't going to send it to a single general-purpose CPU using any type of bus or channel. The CPU won't see it until another ASIC (or FPGA, I suppose) crunches it first.

            Yes, that may impose a limit on the speed of an external network that a CPU can deal with, but that's the way it goes. MCA was never going to save us at this end of the Moore's Law curve.

            • ncmncm 6 years ago

              It is only solvable by architectural changes external to the CPU. Even at 40Gbps, which we live with now, that's 20 ns per packet, or 80 ns at a pokey 10Gbps, certainly not enough time to do a system call per packet.

              So you need the hardware to divvy traffic up to multiple queues -- rings, really -- and a core for each ring. The packets just show up in memory, and the cores had just better keep up. If that looks to you like the same old POSIX architecture, I don't know what to say.

              It was around 2010 that the network pipes got faster than the SAN file servers, and suddenly file output capacity had to be incorporated into network flow control.

  • dbcurtis 6 years ago

    That is an IBM- and Univac-centric view of I/O. Control Data Corp had the "peripheral processor", or PP. PP's only ran OS code, usually called "driver overlays".

    It was actually very elegant. There were 10 copies of PP state (20 in the 7600) and only 1 actual execution unit (2 in 7600). Hardware multi-threading in 1959! So there were 10 PP's executing PP overlay code (drivers) at 1/10 the instruction rate of the main CPU. Each PP had 4K of 12-bit words, which served for both PP code and I/O buffer space. The main memory was 60 bits wide (12*5) and the addresses were 18 bits, so the PP's had 18 bit accumulators for computing addresses.

    Since PP's ran only trusted code, they were allowed to scribble anywhere in main memory that they wanted to. At the end of the I/O operation, the PP computed an address for the main CPU and that directly became the interrupt vector address. This meant that the CPU never had to deal with low level interrupts, only the much less frequent I/O operation completion interrupt at the end of a long operation.

    (In a past life, I did system software at CDC, and CPU logic design at Sperry-Univac and Amdahl.)

    • timzentu 6 years ago

      Oh man, CDS/CDC (Control Data Systems/ Control Data Corporation) is where my Dad worked. Don't know if you knew him (Ed), but he was one of the last 2-4 employees for the company. He was working up in Michigan until 1996. I still remember we took a vacation to DC and in the Smithsonian they had 2 of the machines he worked on. [Him swearing at some hardware while in a museum because it had cut him is one of my best memories. Asked him how he could tell and it had a screwdriver ding from shutting it).

      • watersb 6 years ago

        Friend of mine ran the CDC 6600 at the University of Texas in Austin.

        Now I need to ask him about DMA architecture...

  • dooglius 6 years ago

    This sounds pretty similar to what goes on now, except that the channel processor sits on the other side of the (e.g.) PCIe connector.

  • kabdib 6 years ago

    IIRC the IBM Microchannel stuff was perceived as a ploy to recover control of the PC industry by licensing a new bus. The licensing terms were pretty bad (it could be that "free" was the expected bar at the time, though) and the implementation was expensive, too. Unfortunately for IBM the industry was pretty happy using the existing buses, and when more performance was needed the industry was equally happy to invent new buses that didn't involve paying royalties to IBM.

    So regardless of MC's attractiveness from a systems standpoint, it was a nonstarter in an industry that was headed towards an ecology of clones and racing to the bottom in manufacturing costs. Many PC manufacturers ditched parity memory, too.

  • temac 6 years ago

    The PS/2 was not successful not because of technicalities but because IBM did not want it to be cloned. Third parties decided what the PC should be (mostly continue to be, there was less differences between cloners and the original PC than with the PS/2), and IBM followed. In insight, the PS/2 could not win in this context, because IBM did not matter anymore very early on for the history of the PC, regardless of peripheral vendors. This market was never driven by peripherals anyway, especially not internal extension boards.

    You can use VFIO, yes it needs an IOMMU but that basically what you ask for: hardware support - and yeah, I would like IOMMU to be generalized, also because it is dangerous to have some external ports without one, nowadays (even without thinking about efficient access from userspace). And because of that you can not delegate the core of the thing to "peripheral" anymore, basically you need IOMMU and once you have it, well what else do you need?

    On the storage level, you can mmap on Optane on DDR4 without intermediate software copy I think. But that's merely an implementation detail anyway. When you think about the complexities of modern controllers and the diversity of options to implement them - and at the same time the most important latencies; well the only thing that matter is the end result, and obviously it can be better to not burn some CPU cycles in some cases (for some very specific workloads), but it is not trivial what to put in HW or SW to achieve that: RAM is already crazy slow, and the faster SSD is still even slower: so you will need to be smart in SW anyway to not be bitten by latencies. So in a sense; no! modern CPU are certainly not slow!

    I also think you misrepresent the common modern interfaces availables for (even interoperable) SSDs; they are certainly not "emulat[ing] rotating disks".

    I'd like that you give the following some thought to finish: 1/ HW backward compat is not a prime requirement anymore (because a/ smartphone dev model b/ even PC model has shifted enough so that it actually does not support complete backward compat anymore, mostly due to SoC integration). 2/ There is "sufficient" competition, where sufficient can be defined as effective to drive innovations because they can bring market advantages, and there are a non trivial number of competitors on the market (I'll mix a little but this is still relevant to what we are talking about: Intel, AMD, Arm, nVidia, Apple, Samsung, etc.) 3/ The designers working on those chips and systems are not stupid. What we have now is of course affected by technical history, but what we have now is already immensely different from what we had during the PS/2 era. Some of the historical impacts are possibly confined in less than a square millimeter on chips, and not only your PCI was not your ISA but your modern PCIe is also already quite different from your PCI... So the impact of peripheral vendors in the PS/2 era today for all of that: absolutely null, without a doubt.

  • transpute 6 years ago

    > So there's no standardization on how to talk to devices at the hardware level. Some CPUs have protection systems, an "I/O MMU", and there have been various channel-like interfaces, especially from Intel, but they have never caught on.

    From the paper:

    > NVMe SSDs perform I/O faster than the OS can accept new I/O requests and notify their completion. Furthermore, the current POSIX AIO implementations in Linux are ugly and adhoc, and they have limited support from file systems. This has lead to new AIO interfaces that eliminate system calls and leverage polling. However, polling for completion is not suitable for large I/O transfers.

    AWS Nitro SR-IOV I/O virtualization [1] uses Intel VT-d Posted Interrupts [2] to avoid polling, but this CPU feature is price-segmented to Xeon E5 and higher. For other CPUs, polling [3] is necessary to achieve high IOPS with NVME. Intel and AMD should consider making this feature available on all CPUs, to support NVME and NVDIMM.

    [1] SR-IOV: https://www.snia.org/sites/default/files/RonEmerick_PCI_Expr... & https://www.twosixlabs.com/running-thousands-of-kvm-guests-o...

    [2] IOMMU: https://www.linux-kvm.org/images/7/70/2012-forum-nakajima_ap...

    [3] Polling: https://events.static.linuxfound.org/sites/events/files/slid... & https://www.snia.org/sites/default/files/SDC/2018/presentati...

    • _msw_ 6 years ago

      Disclosure: I work at AWS and played a part in building the Nitro system

      The Nitro system does use SR-IOV, and it can take advantage of hardware virtualization features like VT-d posted interrupts to lower interrupt virtualization overhead. But VT-d posted interrupts isn't a factor in avoiding polling. I'd go further to say that the Two Six Labs blog post has some inaccuracies, so I wouldn't depend on the information it contains much.

      Polling is often more efficient, and lower latency, than interrupt driven operation. If a CPU is mostly doing IO, polling for work can be a much better approach. Jens Axboe has been working on adding polling to the I/O stack for a while [1], and has recently added a new API for submitting I/O that is very promising [2] (25-33% performance improvements on I/O heavy workloads).

      [1] https://lwn.net/Articles/705315/

      [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

  • bogomipz 6 years ago

    >"Mainframe designers had this problem under control by 1970. Mainframes had, and have, "channels". A channel is part of the processor architecture."

    What are some examples of these processors?

    I would be interested in reading more about these mainframe processors architecture. Might you or anyone else have some links?

    • neop1x 6 years ago

      It is always surprising for me to see how mainframes have always been one step further. I don't have any experience with them but I've learned that there have been for example live VM migrations between hosts for a long time already. And that there was a cool system called IBM AS/400 introduced in 1988 with integrated database and "everything is an object" design and peripherals with processors and which is still being used for some critical applications today. Our x86 clouds with Kubernetes may sometimes feel like a bunch of cheap toys in a way. :P

  • stcredzero 6 years ago

    Mainframes had, and have, "channels". A channel is part of the processor architecture. It takes commands, sends them to a peripheral, and manages the data transfer in both directions. Channels have some privileged functions through which the OS tells them where the data is supposed to go in memory. The architecture of channels is well defined, and peripherals are built to talk to channels. The CPU has I/O instructions to control channels in a well defined way.

    I need these! Not for communication between the program and peripherals. I'd like to have these for communication between processes, threads, and software Actors. I'd like to be able to map such hardware channels to channels in Golang. What we have today are software channels implemented with, "heavy kernel mediation between the hardware and the user program," which means that one may be required to think a bit too much about how processes communicate with each other, and the performance implications and how these mechanisms can break down.

  • p_l 6 years ago

    MCA was normal DMA bus, including extensions to use it as memory bus (which added few pins to help recognize a memory card and manage it, with plain MCA transfers for access)

    Closest to complex "IO Channel Program" on PCs was I2O which largely tanked despite leaving huge mark on SCSI controller design.

jandrewrogers 6 years ago

At least in database kernels, we noticeably reached this threshold around five years ago with typical server hardware. This is an interesting computer science problem in that virtually all of our database literature is based on the presumption that I/O is much slower than CPU.

If you cleanroom a database kernel design based on the assumption that I/O performance is not the bottleneck, you end up with an architecture that looks very different than the classic model you learn at university. It is always a tradeoff of burning a resource you have in abundance to optimize utilization of a resource that is scarce, and older database architectures are quite wasteful of resources that have become relatively scarce on newer hardware.

  • aloer 6 years ago

    What was the situation like before SSDs? Were the fastest hard drives at the time (15k rpm?) faster than the fastest CPUs?

    • insulanus 6 years ago

      Let's take a look at the throughput of a server hard disk vs. a typical server CPU:

      https://serverfault.com/questions/190451/what-is-the-through...

      https://www.anandtech.com/show/9185/intel-xeon-d-review-perf...

      Now, that 45.8 GB/sec can't usually be fully utilized by the application, but it's a lot higher than the 200MB/sec of the server hard drive.

      There's also the complication of RAID, max memory fetch rate of a single core, etc. And the fact that the database has to do some processing besides moving data around.

      But it seems to me that server CPU bandwidth of the last generation is significantly higher than spinning disks.

    • blihp 6 years ago

      Individually no, collectively (i.e. storage arrays) yes.

      • loeg 6 years ago

        Sort of. Intel's server parts top out at on the order of 48 PCIe 3 lanes (Skylake-SP). Those parts have 6-channel DDR4 memory (rated to DDR4-2666), giving a theoretical DRAM throughput limit of about 128 GB/s.

        Meanwhile, a PCIe 3.0 lane has a theoretical throughput limit of about 985MB/s; 48 lanes makes 47 GB/s.

        Both are theoretical numbers and in practice will be lower, and you're right that it doesn't give you a lot of free CPU to do any kind of processing on the data, but it seems like the CPU can keep up (as long as Intel continues to have an anemic number of PCIe lanes and until PCIe 4 comes along).

        (FWIW, on a dual-socket Skylake-SP, Anandtech only achieved 122 GB/s (of theoretical 256 GB/s DRAM maximum). But Intel claims the dual-socket can reach 199 GB/s with an AVX-512 workload.[1] Scaling both of these numbers down by a factor of two gives a very crude estimate for single-socket DRAM bandwidth on a conventional workload (61 GB/s) and AVX-512 (99 GB/s).)

        Meanwhile, AMD's Zen server parts have more than double the PCIe lanes but only a few more memory channels, so I expect you're correct there — huge/fast PCIe arrays could be bottlenecked on DRAM on that platform.

        [1]: https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...

        • blihp 6 years ago

          The question was about spinning rust drives ('before SSDs') and the corresponding CPUs that were available at the same time.

  • owaislone 6 years ago

    Are you aware of any efforts by existing databaes or entirely new databases built with this realization in mind?

    • jchrisa 6 years ago

      NoSQL is based on this realization. Rather than optimize for IO with normalization, denormalize and pipe data more or less directly from storage to clients, simplifying the compute and consistency model to make distributed data easier.

      • mike_ivanov 6 years ago

        The point of normalization is logical consistency, not IO optimization.

        • jchrisa 6 years ago

          Sure. But the point of (early) NoSQL was IO optimization, not logical consistency.

          • bogomipz 6 years ago

            I was curious about the parenthetical "early" in your comment. Is this optimization no longer the case then? Could you elaborate on recent developments regarding this? I've been out of the NoSQL loop for some time and am genuinely curious.

            • tracker1 6 years ago

              The problem is, it varies... you have what are effectively key/value stores to document databases, column stores and everything in between. You have systems built on other systems. RethinkDB and Cockroach have different approaches than Redis, Mongo, Cassandra or others.

              CockroachDB gives an SQL interface over the top of a distributed data store with better consistency and relations. Cassandra has no real relations over a BigTable/Column-Store solution. It really just depends.

              I think with what's coming out of SSD/NVMe and even Optain DIMMS, that there will be databases directly tuned to control/set their own data storage in these environments.

            • jchanimal 6 years ago

              By early NoSQL, I mean the projects that are basically an API wrapped around a data structure, that push the understanding of the data structure onto the developer. Newer products (like my employer FaunaDB) prioritize correctness and offer abstract query interfaces like GraphQL.

          • bunderbunder 6 years ago

            This one gets hard to pick into, given how nebulous the term NoSQL is.

            One could probably argue, quite convincingly, that the point of truly early NoSQL was that SQL hadn't been invented yet.

          • acdha 6 years ago

            I never saw it portrayed that way: it was always higher-level programmer ease of use pitches along the lines of “isn’t this easier than doing a JOIN?” pitch shortly before the speaker had a valuable learning experience about the value of data integrity and consistency.

            • virgilp 6 years ago

              You really don't remember the "web scale" memes about Mongo? It was always primarily about "performance".

              • acdha 6 years ago

                I don't remember seeing that predicated as a low-level I/O savings — it was always things about being easy to cluster or avoiding the performance overheads of ACID. The pitch was usually “it's easier and you don't need a skilled DBA”.

                • adrianhel 6 years ago

                  What I heard was "You can practically scale indefinitely". Which I guess is more true of NoSQL. As is I/O savings. But it's not something I've ever heard it advertised for.

              • rhizome 6 years ago

                Performance was an imposed callsign of Mongo, the reason everybody started using it was the declarative schema stuff and so the dev didn't have to know how to start a terminal, which was a not-insignificant hurdle. It doesn't have significant performance differences over other simple KVs. "Faster" was primarily a measurement of learning curve.

        • sieabahlpark 6 years ago

          It's both.

          • mike_ivanov 6 years ago

            No, it is not. Normalization is an artefact of Relational Algebra (a set theory being applied to tabular data). It has nothing to do with concrete implementations and their concerns (such as IO). http://wiki.c2.com/?RelationalModel.

            Moreover, ALL existing relational databases do denormalize data behind the scenes (think about caches and indexes) for the sake of IO optimization, which means that normalization does not achieve the IO optimization goal by itself (otherwise, why bother?)

            • bunderbunder 6 years ago

              It really is.

              Originally, schema normalization was just about logical consistency. But that was before query optimizers were invented, and long before they grew to be so very optimized for working with normalized schemata.

              Nowadays, even someone who cares 100% about performance and 0% about data integrity (And is somehow still using a relational database, yeah. So what? It's MY hypothetical.) still has good reason to default to something like 3NF. Insert famous Knuth quote here.

              Caches and indexes and all that stuff may technically count as denormalizations, but, insofar as the definition of 3NF doesn't mention them, that's a tangential issue.

              • crazygringo 6 years ago

                > Nowadays, even someone who cares 100% about performance and 0% about data integrity (And is somehow still using a relational database, yeah. So what? It's MY hypothetical.) still has good reason to default to something like 3NF

                That's just... wrong.

                Best practices with database design are to start with a totally normalized schema because it reduces scope for errors/inconsistencies.

                And then to intentionally denormalize where required to increase performance, in order to reduce lookups from joins or computing aggregate functions -- at the cost of having to manually ensure duplicated data remains consistent. Query optimizers can only do so much.

                (If it weren't to increase performance, I can't really think of any reason why you ever would denormalize on a single database server, unless you have some extremely complicated calculations that SQL isn't capable of expressing?)

                • bunderbunder 6 years ago

                  > And then to intentionally denormalize where required to increase performance

                  Hopefully under profiler guidance, like for any other optimization. For my part, the frequency that I find that trying a denormalization helps performance has been on a downward trend for a while now. Mostly down to just straight-up duplicating data at this point. Databases really are getting very good at what they do.

              • ncmncm 6 years ago

                This makes more sense than it seems to.

                MongoDB's insight was that the overwhelming majority of data stores don't matter enough to justify worry about correctness. It doesn't go faster than other DBs if you turn on the "do it right" flags, but their customers mostly don't.

                Would you be willing to wait an extra five seconds for page load to be sure the banner at the bottom of "people who looked at this also looked at these" list is fully up to date and correct? Or is filling the boxes with any old crap good enough?

                • acdha 6 years ago

                  > Would you be willing to wait an extra five seconds for page load to be sure the banner at the bottom of "people who looked at this also looked at these" list is fully up to date and correct?

                  This is a separate question — do you do a live query or pre-compute it? — and it’s not really significantly easier to do that with Mongo’s model.

                  The pitch I saw was usually based on it being easier than having to think about your data model in advance, which is relevant to your example: everyone I know who picked it did something like that, thought it was less work and that was great, and then had some problem which came down to copies of data getting out of sync and so e.g. the “also looked at” box had the wrong price or a typo which had been fixed elsewhere. Fixing that usually cancelled out what claimed performance advantages and then some.

      • marshray 6 years ago

        I see your analogy, but (NoSQL and its clients) are conceptually much higher up on the stack.

        Interestingly, the paper does say:

            [eBPF] enables applications to perform up to L7 protocol (e.g., HTTP or Memcached) processing. 
            Furthermore, some programmable NICs are able to execute eBPF programs directly on the NIC hardware.
    • dustingetz 6 years ago
      • sdegutis 6 years ago

        Datomic is just a Prolog logical layer on top of traditional RDBMS (e.g. Postgres, H2, etc).

        • dustingetz 6 years ago

          No, Datomic storage is unbundled, prod configurations target basically DynamoDB.

        • lame88 6 years ago

          I'd agree it probably doesn't quite apply to this case, but there are similarities from what I've researched, and I think that description doesn't quite do it justice. First being that datomic is temporal with full on-demand history, which I think is a big deal; and has a serialized writer, n-reader architecture. As to how it may relate to this problem, I believe peer applications (in-process) for on-prem are able to exploit locality with the full indexes in-memory without round-trip queries - also the case for ions with datomic cloud. I'll let anyone more experienced clarify or correct me. The concept at least is really cool; I hope to give it a real test drive some time soon.

    • ignoramous 6 years ago

      CMU's Andy Pavlo is probably at the forefront in this area, here's what they were working on (they're re-doing it all over again now as a yet to be announced project): https://news.ycombinator.com/item?id=14391523 Also, I'd encourage you to go through their Advanced Database class: https://www.youtube.com/playlist?list=PLSE8ODhjZXjYutVzTeAds...

      Michael Stonebreaker, a Turing awardee, called for complete destruction of old database order in 2008: https://www.dbms2.com/2008/02/18/mike-stonebraker-calls-for-... post which, I think, they created VoltDB (H-Store) and SciDB.

      Interesting to note that Andy Pavlo worked on Stonebreaker's H-Store.

      • hueving 6 years ago

        Why do you suppose nothing useful has come out of the CMU group for actual use? Is it just a case of limited payoff for marginal gain?

        • ignoramous 6 years ago

          I'm in no position to comment on that, unfortunately. Not sure if Andy's group has detailed why their first-take didn't come off apart from this terse note on GitHub which notes that building an autonomous-db is what they're main goal is (rather than a nvme-first db, which seems like a sub-goal):

          > The Peloton project is dead. We have abandoned this repository and moved on to build a new DBMS. There are a several engineering techniques and designs that we learned from this first system on how to support autonomous operations that we are doing a much better job at implementing in the second system.

          Here's their alleged unannounced second-take: https://github.com/cmu-db/terrier

    • Jach 6 years ago

      ScyllaDB is probably in the right direction. It at least bypasses the kernel for networking (uses Intel's DPDK).

      • PeterCorless 6 years ago

        Yes. Pekka works at ScyllaDB, and the paper references Avi Kivity's work at Scylla twice in the endnotes, as well as the underlying Seastar engine that Scylla is built upon (that's where DPDK is embedded).

    • jandrewrogers 6 years ago

      Yes, there are efforts in this area. I started a first-principles database kernel design 4-5 years ago which has a couple different licensees and implementations, and should be showing up in a broad platform relatively soon. In open source, ScyllaDB is a credible implementation that goes half-way to addressing this (as an intentional reimplementation of Cassandra, it inherits some of the old style architectural features of Cassandra for better and worse). I am aware of a few other efforts at tech companies but in my opinion these are a mixed bag because they are borrowing too heavily from existing classic architectures for the sake of expediency to fully realize the benefits.

      The archetypical differentiating features, relative to classic designs, of these future-looking (for lack of a better term) kernel designs is the prodigious exploitation of user space I/O, dearth of multithreaded synchronization, and the paucity of classic tree-like data structures (not necessarily hash tables, just not trees). The industry will get here eventually, some emerging tech sectors absolutely require it for their data models.

    • AtlasBarfed 6 years ago

      IMO the biggest problem is that SQL joins don't scale on commodity hardware networks due to the CAP theorem, assuming you want availability. Network IO is a lot better these days, and per Scylla it can get to a pretty small instruction window, but still it isn't what XPoint and SSD crunch.

      Database rearchitecture at the single node level can help optimize this, and maybe improve the megamachine SQL node to scale up to fatter sizes, but really big data is still a network IO and reliability challenge.

      And it may be true we can pump a single fat node up to really huge throughput once some db rearchitecture fully leverages SSDs and I/O, but that just exposes more data to downtime if there is a network partition.

  • shereadsthenews 6 years ago

    I don't know ... we used to interleave our hard disk formats because a hard disk could stream data faster than an i386 CPU could ingest it, and there was plenty of database research done prior to 1990.

    • masklinn 6 years ago

      I assume the IO you're talking about is (/was) sequential. And DBs are specifically engineered to sequentialise their IO (e.g. clustered indexes).

      But even if SSDs do blow HDDs out the water on sequential IO, it's on random IO that the difference is most stark, the cost difference between sequential and random is much lower on SSD than on HDDs, and both random IO throughput and random IOPS shoot through the roof relative to spinning rust.

      An SSD might be an order of magnitude better than an HDD on sequential IO, but will often be 2 or 3 on random IO. That means the loss of throughput from sequential IO to random IO (while still there due to commands overhead and the like) is much, much, much smaller on an SSD than on an HDD. And the story is similar on the latency front, an HDD might have a command latency in the low tens milliseconds, an SSD in the low-mid tens microseconds.

      • dorlaor 6 years ago

        With LSM approach (log structured merge tree) we have at Scylla, we make all the writes batched/sequential and receive an awesome boost.

        Some reads can be sequential too if you scan through the clustering keys.

      • temac 6 years ago

        > low-mid tens microseconds

        From a modern CPU point of view, that's an eternity.

        • warrenm 6 years ago

          Still 1000x faster than milliseconds for HDD

    • jandrewrogers 6 years ago

      Database algorithms are typically designed for worst case storage throughput i.e. the bandwidth floor, which for hard disks is not sequential access patterns. In the spinning hard disk days you assumed a few MB/s per disk, because you would often see that in real systems. Even with a disk array, it usually wasn't enough to keep the CPU busy if real-world access patterns kept disk throughput near the floor.

  • sytelus 6 years ago

    There are plenty of in-memory database architectures that takes full advantage of fast random access offered by DRAMs. I think paper focuses on OS design which is still tied with assumption of slow disks.

    • jandrewrogers 6 years ago

      In-memory databases are not economically scalable. To a first approximation, database volume is the integral of database throughput, so databases that require extreme I/O throughput also tend to require extremely high volumes of online data. Many typical sensor data models, which drive much of the need for high-throughput I/O, grow by petabytes per day so the multiple orders of magnitude difference in media cost make DRAM storage infeasible.

      If the bottleneck is not storage bandwidth, and it isn't for many recent systems and modern software architectures, then there is no practical performance advantage to keeping your data purely in-memory. This has been demonstrable ever since PCIe HBAs with arrays of cheap SSDs became practical.

      • gambler 6 years ago

        >In-memory databases are not economically scalable.

        SAP HANA is widely used for very large data sets.

        • jandrewrogers 6 years ago

          How are you defining "very large"? I've used HANA and even SAP doesn't make claim of real scalability -- the practical limitations are in their own documentation. And in real scalability testing, it struggles long before you reach those limits. For sensor data, I wouldn't use it for more than a few terabytes, and even then it has a few sharp edges when it comes to performance if you are not careful. No one is going to be putting petabytes into it, which isn't that much operational data these days.

          And to my point you quoted, SAP HANA is extremely expensive to operate compared to alternatives. The licensing costs alone will kill you, never mind the hardware requirements.

          • PeterCorless 6 years ago

            This checks out. A single 12 TiB RAM u-12tb1.metal on AWS EC2 goes for $30.539/hr. For a month for one (24 * 30), that's a bill of $21,988.08 USD.

            A single i3.16xlarge, with 15.2TB NVMe SSD goes for $4.992/hr, or $3,594.24 USD a month. 16.33% of the price.

            I'm not going to argue whether an in-memory database (measured in ns) wouldn't give you performance improvements over fetching data from an SSD (measured in ms). But not everyone needs that speed or can afford it for that price.

            Sources: https://aws.amazon.com/blogs/aws/now-available-amazon-ec2-hi...

            https://aws.amazon.com/ec2/instance-types/i3/

            https://aws.amazon.com/ec2/pricing/on-demand/

            • K0SM0S 6 years ago

              At a high-level, it seems expected that since the outside world (around the DB) operates on the same tiered storage (cache > RAM > disk > backup(>n)), economically it doesn't make sense to replace any tier by its direct superior, because the bottleneck becomes the outside world (any machine/infra that's not able to keep up anyway).

              Not with storage but I/O, that's what the makers of the Cell processor (PlayStation 3) had in mind originally (it was culled before the final design though): huge IO, non-stop feeding the beast. The arch-goal was to be able to link Cells together to make a "network of CPUs" (network becomes sockets link) able to parallel executions. A hard (wild?) computer science dream/problem. (the final Cell CPU has none of that iirc). I think NVLink is a decent comparison of that purpose/design.

              Now running a whole infra on RAM, not just a DB, that makes more sense on paper, and I guess that's what devs do daily with e.g. test environments, like a bunch of containers over a ton of RAM. At a small enough scale, upgrading storage tier makes sense as cost becomes negligible.

              For a business / at scale, short of extremely specific applications where you'd indeed have not just a DB but whatever calls it also on RAM --- and remembering that this is not economical to serve "more" users or "faster", since you'd scale horizontally for that --- the use-case or endgame of whatever this DB serves should have that speed as a hard requirement. Likely to be 'one' monster itself, like, a supercomputer? Assembling deep learning datasets from real-time feeds on-the-fly? Skynet? Big brother? :) Jokes aside, one needs a beast to feed that'll take no less to justify a 7x more expensive RAM-based anything at scale.

              Normal folks, I think we'll do with caching the hell out of our data for cheap, for now at least. Until RAM becomes abundant and CPU/storage/IO extremely expensive by comparison. (was RAM ever abundant? I can't seem to remember a time when I could just buy without counting, unlike storage or FLOPS relatively to everything else).

          • gambler 6 years ago

            I missed the part about "petabytes a day". I'm curious, which companies stores petabytes of data a day in a single database?

            • PeterCorless 6 years ago

              Facebook was hitting 4 petabytes a day of user data back in 2014. Back then, there were only a handful of giants needing that. Today, multiple systems are now being scaled to handle that sort of data volume. Think automotive fleet systems and other IIoT applications. Mobile networks (and security systems for them) managing tens of millions of devices.

              https://research.fb.com/facebook-s-top-open-data-problems/

            • cachemiss 6 years ago

              Generally these tend to be systems that work with machine generated data, my experience is with sensor data generated by automobiles (automated car efforts).

              Naive solutions tend to either summarize the data, store as logs and then run batch processes to index in some form (or leave unindexed and just brute force the computation), or limit the incoming data rate to whatever could be indexed.

              These can work for some use cases, but make it very difficult to operationalize these data sources (i.e use them to make real-timeish decisions).

              Even human generated data sources (fb / twitter etc.) can generate something close to that data rate.

    • dorlaor 6 years ago

      The advantage of in-memory it's also its biggest disadvantage - it's costly. NVMe can serve almost as fast as ram and is 100x cheaper. Scylla offers single digit msec (usually 1msec) 99% latency and way more cost effective

      • fulafel 6 years ago

        Most databases are small though.

        • AtlasBarfed 6 years ago

          And don't need performance, so clearly we aren't talking about an Access database on a laptop.

  • riazrizvi 6 years ago

    I can see how the memory hierarchy will have layers removed, but it still exists because registers are where operations on data predominantly take place. So you still need to identify small working sets of variables that you extract return values from. I can see how a lot of caching logic will be obviated but not the principle of memory hierarchy, which says ‘load small batches of problem set, solve, store results and repeat’.

  • stallmanifold 6 years ago

    In what ways do the database architecture change when one discards the assumption that I/O is the bottleneck? Does it obviate the relational model? I always found the relational model to fit working with data sets very well since it's nice and declarative. Does that turn out to be a bottlneck assuming fast I/O?

    • garmaine 6 years ago

      Why would the model be affected?

thaumaturgy 6 years ago

The paper reads like it's suggesting moving the burden of complexity in dealing with varying hardware interfaces from the kernel to userland so that userland can take direct advantage of higher performance hardware when it's available.

I could see that for some very small niches, but in general I think it would be a terrible development for the industry.

Hardware vendors don't like to share. They don't share code, they don't share common interfaces, they don't even share documentation. As it is now, these are all problems which most userland developers don't have to care about -- those problems get dealt with in the kernel, by developers who specialize in building support for uncooperative hardware.

The average application developer doesn't want to have to figure out how many queues are supported by a NIC just to open a connection on the network. Further: the average application developer isn't experienced enough to do this correctly.

Given the niche where these tradeoffs make sense, I'm not sure why the paper bothers to emphasize security at all.

  • johnm1019 6 years ago

    Disclaimer: IANA kernel developer.

    Is there any benefit to having those specialized developers create frameworks or libraries in Userland which other developers can leverage? This way they remain the interface to uncooperative hardware, but the code is in Userland so the bold folks can try their own approach.

    • loeg 6 years ago

      That's basically DPDK / SDPK.

    • dorlaor 6 years ago

      Seastar.io is another one, it's an async engine that utilizes all of the cores in a modern system and is the heart of Scylla's DB. However, it's complex to use it correctly

  • Q6T46nT668w6i3m 6 years ago

    I enjoyed the paper. My impression is that you’d shift the burden to the runtimes that, for many applications, currently sit between POSIX and applications (e.g. see the Q&A about POSIX).

    • asn1parse 6 years ago

      Considering the poor quality of userland software, this seems like a terrible idea.

      • ncmncm 6 years ago

        Anyone willing to spend the money can have top quality userland software.

        Getting a better OS is not always accessible, even with deep pockets.

  • josephg 6 years ago

    I don’t think anyone is suggesting that every application explicitly code for each network device. All of that tricky logic could be put into a userland library instead of the kernel. If we wanted, we could even replicate a kernel device driver style API in userland that network device drivers program against, done as a set of userland library files loaded dynamically based on detected hardware.

    The tricky part wouldn’t be sharing code between applications. We know how to do that. The hard part would be figuring out a clean way to share the hardware between all running applications, given that any app could be terminated at any time, apps might be mutually untrustworthy and apps would have to play nice to share resources. I can imagine a hybrid approach where the kernel allocates network queues to applications and suggests userland device drivers. While running, the apps would have direct access to the hardware. And when the app is terminated the kernel would reclaim the assigned hardware for reuse by other applications.

    • sp33der 6 years ago

      Microkernels tend to do a lot of what you're talking about. They accomplish this via interrupts and virtual mapping of the device memory to userspace for the userland app to access.

      "If we wanted, we could even replicate a kernel device driver style API in userland that network device drivers program against, done as a set of userland library files loaded dynamically based on detected hardware."

      This is how many microkernel based embedded RTOS's work. Common API's written for a hardware device, say a NOR flash chip, and the driver for the hardware onboard implements the interface specific to that hardware. You can then dynamically link the driver specific to your hardware. That's what I was doing as an embedded C developer some 10 years ago, and that solution had already been implemented for at least a decade (or 2) before that.

  • robbyt 6 years ago

    As I was reading this, I remembered the days of my youth setting the IRQ and DMA address for my soundblaster (compatible) soundcard.

    • organsnyder 6 years ago

      Be sure to avoid IRQ7. That conflict with LPT1 and/or LTP2 can be brutal.

  • baruch 6 years ago

    At least for NVMe SSDs that is already solved, there is a standard that all NVMe SSDs implement and you only need one driver for all of them. If you want you can use SPDK or one of the few other drivers and you get the full speed block access.

    What you don't get however is sharing the disk between multiple processes.

  • saltcured 6 years ago

    There is a constant dance at the fringe of high performance systems. It leads to a recurring pattern of "revolutionizing" with some kind of bypass or coprocessor architecture, then eventually reverting to traditional structures as the new performance realities reach commodity levels.

    Part of it is the economics at the fringe can pursue speed at any cost. And part is the heady appeal of doing things differently for researchers and advanced practitioners. But, in the long view, I think you are right that it is a bad idea. If you care about maintenance and sustainability, you usually find that these bypass solutions get abandoned as soon as the more conventional approaches can approximate their speed on newer commodity hardware. So there is huge churn in these specialist devices with specialist APIs and tooling.

    There is a recurring theme in high performance networking where crazy things are tried and all sorts of fancy protocol offloading written, then eventually deprecated because it is seen as a support burden and a source of bugs. Because each of these specialized stacks has a smaller user base, they are have less economy of scale to invest in maintenance and stabilization.

    • baybal2 6 years ago

      > There is a constant dance at the fringe of high performance systems. It leads to a recurring pattern of "revolutionizing" with some kind of bypass or coprocessor architecture, then eventually reverting to traditional structures as the new performance realities reach commodity levels.

      Truly this is what DC hardware been about for 20+ years.

      Look at HPC space, and than look for "uncle Joe hosting." The moment Joe starts looking for HPC and invest heavily for some super duper RDMA powered disk, a mainstream hardware maker comes and spoils everything by releasing a mass market product doing just that if not better, thus ruining the grand scheme of gaining competitive advantage through some "secret sauce hardware"

      Look at video transcoding or that new "AI" neural network thing. After few years of entertaining themselves with idea of purpose made hardware, big boys simply went the route of just using a lot of mainstream hardware.

  • baybal2 6 years ago

    > Hardware vendors don't like to share. They don't share code, they don't share common interfaces, they don't even share documentation. As it is now, these are all problems which most userland developers don't have to care about -- those problems get dealt with in the kernel, by developers who specialize in building support for uncooperative hardware.

    Thus more money for us :) I think this fact almost begs to be taken advantage of. See, how things work in corporate storage products, with EMC being the prime example

  • TheSoftwareGuy 6 years ago

    An OS is more than just a kernel.

    OS’s would simply ship with user land drivers instead of kernel space drivers

  • atq2119 6 years ago

    You should take a look at GPUs.

    All of the complexity of 3D rendering is implemented in userspace, and that has been the case universally in production for more than 5 years now -- closer to 10, really. If you replace "all" by "almost all", you can go back much further than that, really all the way to the beginning of GPUs' existence. And yet normal application developers don't have to care, because the driver does it for them.

    The point is, drivers don't have to live in kernel space, they can live in user space as well. Networking folks may start being more serious about this nowadays, but it has been the reality in GPUs for a long time.

yingw787 6 years ago

There was a great blog post I read a while back about constructing a caching layer across network by Dan Luu: https://danluu.com/infinite-disk/

I asked a friend who works in a quant firm and he was like yes it’s true, and it is pretty insane.

I think there’s research Microsoft and Google are doing for RDMA over 100G Ethernet for intra data center communication as well. Pretty neat.

  • HillaryBriss 6 years ago

    neat blog post! i like the table at the top comparing latencies. at one point the post says:

    I'm paying $60/month for 100Mb, and if the trend of the last two decades continues, we should see another 50x increase in bandwidth per dollar over the next decade.

    he seems to have more confidence in his bandwidth provider than I have in mine!

  • baybal2 6 years ago

    I worked on something similar like that 3 years ago as a sub-sub-subcontractor for a company making DCs for Alibaba. It took them almost 2.5 years since me signing off on work to roll it out in a limited commercial trial in their alicloud hosting.

    The original idea was to let purpose made hardware be distributed across DC rather than every server having to carry it: video codecs on FPGAs, hardware wire speed crypto/compression, databases and k/v stores exposed over RDMA, and remote block storage on SSDs.

    I was invited to the opening ceremony for the DC. When company's bosses were showed sfx infused 3D graphs allegedly representing their AI things running on it, I was unable to restrain myself from ruining the atmosphere by asking how it is running when all servers in the DC were shut down :D

  • theincredulousk 6 years ago

    Yes! Was surprised the paper didn't specifically mention RDMA, or to a lesser degreee SRIOV, with all their focus on NICs.

    Also, there may be ongoing research, but it isn't theory at all. HPCs, HFTs, and the could providers have been leveraging RDMA for a long time - e.g. Infiniband. Doing it over Ethernet (RRoCE) is relatively new, and it isn't necessarily any big leap that is happens over 100G instead of 40 or 1.

    However, an interesting point as network links go to 100G+ (esp. for RDMA) is again on the storage/processing side. E.g. a wireshark capture on a 100G connection? ~12.5 GB/second, near max bandwidth for DDR3, and can fill 64GB of RAM in about 5 seconds at full fire-hose. So again the hot-potato of bottleneck will be passed, at least for maximum sustained performance situations.

    Side note, AFAIK RoCE exists mostly due to non-technical arguments, particularly the inertia created by existing familiarity and deployment of Ethernet in data centers. I think Microsoft was the one flexing on a standards-body to push it through. It is somewhat of a kludge as Ethernet wasn't designed with RDMA in mind - no guaranteed predictable latency, frames can and will disappear if switch buffers overflow, etc. So IMO "research" into the topic isn't super profound - akin to studying how your sedan might be heavily modified to go off-roading almost (but not quite) as well as a pick-up truck.

    Even now many that have the luxury are just going Infiniband from the get-go if RDMA/latency are the key priorities rather than tacked on later.

    • Veedrac 6 years ago

      The paper does mention RDMA.

  • AtlasBarfed 6 years ago

    all that huge scale distributed storage is going to run into CAP concerns with imperfect network links, nodes, storage units. I get a lot of that is consumer/social network crapdata with low guarantees to the end user. Dropbox would probably like a little more of a guarantee, and enterprise even more so.

deRerum 6 years ago

In the past (like around the time most programming languages were invented) memory speeds were faster than processor speeds. So all variable accesses were instantaneous. Languages like C did not have to worry about memory hierarchies.

If memory speed is 100ns then you would notice the memory bottleneck around the time when your processor speed is 10Mhz. This point was reached in the mid 1980s with the 286 processor. Yet through the addition of cache memory this bottleneck was hidden from most software. They continued to operate in a bubble as if they were still running on the hardware of the 1980s.

It’s a bit like life itself...we land mammals carry around bags of water under our skin and our cells are still batched in fluids as if we are still living in the environment of the oceans hundreds of millions of years ago.

Many programming languages have been invented since the 90s but as far as I know none of them explicitly model memory latency and make reference to memory hierarchies. It’s as if they still need to maintain the illusion that they are running on the hardware of the past.

(Note: I once read about a language called Sequioa developed at Stanford that explicitly modelled the memory hierarchy. I don’t know what happened to it).

  • bogomipz 6 years ago

    >"If memory speed is 100ns then you would notice the memory bottleneck around the time when your processor speed is 10Mhz."

    Sorry I'm not following the math there, whats the relation between 100ns and 10Mhz? Why is that the tipping point?

    • VibrantClarity 6 years ago

      1 second = 1 billion nanoseconds and 1 Mhz = 1 million times a second, so if something takes 100ns you can only do it 10 millions times per second (10Mhz).

    • deRerum 6 years ago

      A 10Mhz processor has a clock cycle of 100ns (0.1 millionths of a second). Those are just rough representative numbers I picked...any particular RAM delays would be different and the actual latency would be complicated by bus speeds and protocols etc.

      • bogomipz 6 years ago

        Thanks for the explanations that math makes sense to me now. Cheers.

  • soup10 6 years ago

    more of a hardware problem... if CPU's let you directly address and manage L1, L2, L3 cache memory high performance programmers would love it and languages like C++ would immediately add support.

    • opportune 6 years ago

      There are tricks to pull chunks of memory into cache as is, no? Not that they are ideal.

      • soup10 6 years ago

        yea its a lot of "guesswork" and trial and error trying to optimize cache usage though

    • deRerum 6 years ago

      Anyone can prefetch data into the cache, but letting programmers control cache eviction would open the door to all kinds of user errors which would kill performance.

      • soup10 6 years ago

        true but it could also be very efficient, especially in cases where you have the exact hardware to test on like a game console

the8472 6 years ago

Don't kTLS sockets[0] with crypto offloading[1], sendfile/vmsplice, device-to-device DMA transfers[2] and possibly io_uring solve all those things on linux? Granted, they're not POSIX, but they're incremental extensions.

Netflix implemented similar extensions in freebsd[3]

[0] https://www.kernel.org/doc/Documentation/networking/tls.txt [1] https://lwn.net/Articles/734030/ [2] https://lwn.net/Articles/767281/ [3] https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf

  • loeg 6 years ago

    Not really. These are all incremental performance improvements on POSIX but don't address the author's concerns / desires in the paper. All of them continue to require the kernel to mediate IO between userspace and the hardware. For some reason the author is fixated on direct user access to partitioned hardware queues.

    Netflix's CDN operating system is based on FreeBSD, and they did add a kind of kTLS implementation, but they did not add it to FreeBSD upstream for a host of reasons.

    • the8472 6 years ago

      The means may be different, but the ends they aim for seem to be similar. The ends here are improved latency, throughput, parallelism, non-blocking APIs and security.

      The above-mentioned improvements aim to address the first four without completely bypassing the kernel, instead changing APIs so they step out of the way most of the time, limiting them to coordination tasks and then either offloading to the hardware or directly piping the data to/from userspace mappings without additional context switches where needed.

      VMs using SR-IOV address the security aspect.

      It's basically the difference between a green field design and tacking all those innovations onto the glueball that is linux. The latter may be ugly and complex, but it has the advantage of being backwards-compatible.

      • loeg 6 years ago

        > It's basically the difference between a green field design and tacking all those innovations onto the glueball that is linux. The latter may be ugly and complex, but it has the advantage of being backwards-compatible.

        I'm definitely not trying to argue in favor of the paper's opinion. :-)

        I just believe the author of the article would disagree that the glueball your earlier remarks describe addresses the author's concerns. They specifically say that Linux cannot be fixed to their satisfaction, or at least argue that idea.

        Look for the section on page 5 headed, "Why not use kernel-bypass techniques on Linux?"

        • the8472 6 years ago

          Well, and all I am saying is that the paper, including the section you mention, don't take these latest developments into account. They're not kernel bypasses. They're driver level optimizations and traditional APIs operating on file descriptors and virtual memory aiming to solve similar problems.

          This stuff is very new of course, so we have to wait for something to actually integrate all those pieces and then for benchmarks.

    • drewg123 6 years ago

      The host of reasons was mostly one reason: "lack of time".

      Now that we have to deal with multiple vendors of inline hardware TLS offload solutions, it is more critical to get it upstream, and so it is being upstreamed as we speak. The first piece of it (fixing send tags so they are reliable and can be used for inline hw tls in addition to hardware pacing) is up for review right now: https://reviews.freebsd.org/D20117

pulkitsh1234 6 years ago

This paper was very accessible as compared to other academic papers, is there a way to find other papers like these ? Maybe its the lack of math equations and benchmarks.

I like how most of the statements are supported by examples, which makes it easier to understand (after some Googling ofc), especially for someone like me who is a million miles away from academia and a programmer who rarely has to think about kernel/CPU/memory intricacies, mostly due to working with higher level languages and abstractions on top of the OS itself.

My uneducated and naive thoughts on this paper: Instead of replacing the kernel with `parakernel`, is it possible to implement a POSIX compatible kernel layer over the parakernel itself ? So that drivers, linkers, and other abstractions don't have to be re-implemented again for the parakernel.

bluetomcat 6 years ago

We need entirely new OS abstractions to replace the dated notions of hierarchical file systems built around the metaphor of file cabinets, I/O as streams of bytes, terminals, process hierarchy. Essentially, say goodbye to the Unix model after 50 years. It would open up an entirely new world of software experimentation and craftsmanship.

  • Ericson2314 6 years ago

    Amen to that. Unix was never a good design, and now is severely out of date. We can no longer afford to hack around it.

    • jaas 6 years ago

      "Unix was never a good design"

      Nothing's perfect. Unix was a great design that served its purpose well for a long time, and evolved a bit along the way. Saying it was "never good" is trivializing and, in my opinion, arrogant.

      • dllthomas 6 years ago

        There were always objections.

        "Unix went from being the worst operating system available, to being the best operating system available, without getting appreciably better."

        https://news.ycombinator.com/item?id=19416485

        Which isn't to say that those POV were necessarily correct, just that it isn't all hindsight.

        • the_af 6 years ago

          To be fair, there are always objections to any working piece of technology.

          It's not helpful to claim "it was never good design". It was a working design that drove technology to the point it is now, and in that sense it was hugely successful. What kind of perfect and pure tech do some people want, anyway? Pick anything, whatever they like -- say, Plan 9 or OS/2 or whatever -- and I can bet you in a parallel universe where that tech won, someone on para-HN will claim that it sucked and it was never a good design and if only Unix had won.

          • dllthomas 6 years ago

            I certainly don't disagree that it was successful, and (coming into technology when I did) I've always personally liked it. I just found the writings around what came before interesting.

            > I can bet you in a parallel universe where that tech won, someone on para-HN will claim that it sucked and it was never a good design and if only Unix had won.

            I'd take the same side of that bet as you :-P

    • thaumaturgy 6 years ago

      Yeah, Unix was a terrible design. Composable architecture combined with standardized i/o and IPC models. I can't imagine anything worse, especially compared to today's proliferation of data formats and inscrutable APIs.

      It's a good thing nobody uses it anymore.

      • taborj 6 years ago

        My Online Sarcasm Detector(tm) is beeping wildly

      • pjmlp 6 years ago

        Had Bell Labs been allowed to sell it in first place and that would have been exactly its outcome.

    • pzs 6 years ago

      Would you care to elaborate why Unix, in your opinion, was never a good design?

      • pjmlp 6 years ago

        It was re-written on an unsafe language that completely disregarded what has happened outside Bell Labs since 1961, at Burroughs, IBM, MIT, Xerox and other research cathedrals.

        It only took off thanks to Bell Labs being forbidden to sell it, so it got offered for a symbolic price, alongside source code to major universities, which then decided to build on top, instead of paying OS street prices.

        Had UNIX been sold in the same vein as other mainframe OSes and no one would be talking about whatever quality it might have had.

        • systemBuilder 6 years ago

          I worked on the xerox product OS (Pilot) written by my manager, and was on the team with the designers of Multics and Swift, MIT's OSes. At MIT we all ran 4.15bsd and then 4.2bsd and UNIX was so far ahead we focused on improving it, mostly. Pilot, Multics, CTSS, were nowhere near as powerful or developer friendly as UNIX. UNIX killed off OS research for a long time!

          • pjmlp 6 years ago

            Given what I know from XDE and Cedar, I fail to see how a text based CLI, without the graphical debuggers, REPL and GUI workflows was more developer friendly.

        • Ericson2314 6 years ago

          Thank you for being a breadth of wisdom in the UNIX-worshiping wilderness.

          You can always get mindshare being first massively underpriced thing to market.

          • jstimpfle 6 years ago

            Only that this doesn't even touch on design, or what parts could have been done better.

            • pjmlp 6 years ago

              Sure it does, not using C to start with.

              Or in another form, C should have done the same as other systems languages, do proper bounds checking, arrays and strings without implicit decay into pointers.

              Second, having a proper UI story like NeXTSTEP or NeWS, and not the X11 Frankenstein.

    • FussyZeus 6 years ago

      What a ridiculous statement. Yeah if it was possible to just jump to the best solution each time then we'd benefit massively, but that's not how any science works. The new awesome thing is always built on the back of what was done before. Electric cars have been around since the inception of the internal combustion engine but we didn't have electrical supplies with the energy density of gasoline to make electric cars possible until decades later, and modern life would've been (and remains, unfortunately) impossible without the energy density of fossil fuels. Hopefully someday we'll get past that, but when that happens, that doesn't then mean that fossil fuels were just a bad idea, they were the only stepping stone to what replaced them.

      • Ericson2314 6 years ago

        The other OSes from Unix's era were already as good or better (or at very least there was a diversity of ideas). Then the regulation that prevent AT&T from being sold like the others causes Unix to win prematurely.

        Maybe that's how Ford beat the earlier luxury cars in raw profit, but that's not how the ICE beat electric cars.

    • dekhn 6 years ago

      UNIX was a great design for a wide range of people. FOr example, multiple domains of science were transformed because UNIX made it easy to implement microcomputers for data collection on realtime data and share the resulting software among a pool of other researchers.

    • wongarsu 6 years ago

      Which is why we have Windows with the NT family of kernels. It's a reimagining that contains great ideas like a flexible file system stack that allows you to dynamically add drivers that add encryption, compression, antivirus checks etc, and a few bad ideas like a flexible file system stack that's orders of magnitude slower when working with small files than unix implementations.

      Windows is a mainstream OS that is very different from Unix in many regards (even if it still has files). And it really shows that the grass isn't greener on the other side: some parts are great, some parts are terrible, but there aren't a lot of things that offer a universally better tradeoff.

      • Ericson2314 6 years ago

        The VMS people Microsoft poached and then straightjacketed with legacy concerns did have good ideas, yes.

    • ncmncm 6 years ago

      There are only two kinds of OS: the ones people complain about, and the ones no one uses.

  • thaumaturgy 6 years ago

    I would like to see an updated file system architecture that's closer to a database, with tagging and all that. And I could see that extending to processes too.

    But what are you imagining as alternatives for i/o and terminals?

    • bluetomcat 6 years ago

      What if we surpass file systems completely? Imagine every process having its own "non-volatile" memory area, where all persistent state is kept between restarts, with a clean OS interface to share that state to other processes?

      Terminals and process hierarchy are a complete no-go in a fresh design. Every entity in the system can be identified through unique IDs which can be handed down to other processes based on different OS policies.

      • thaumaturgy 6 years ago

        > Imagine every process having its own "non-volatile" memory area, where all persistent state is kept between restarts...

        I'll treat this seriously: as software development practices are just now beginning to mature, with more emphasis on test coverage being considered best practice, sure ... maybe.

        But there's still a lot of software out there for which "restart the application" (or even, "restart the stinking OS") is the only practical solution when it's behaving badly.

        > Every entity in the system can be identified through unique IDs which can be handed down to other processes based on different OS policies.

        Okay, but how do you write a process which is capable of communicating with any other process?

        • bluetomcat 6 years ago

          > But there's still a lot of software out there for which "restart the application" (or even, "restart the stinking OS") is the only practical solution when it's behaving badly.

          Which, in principle, doesn't prevent a program from crashing or misbehaving when it encounters a corrupted file left over from its last run. In the imagined system, the application developer would be fully aware whether he is putting a data structure in the "volatile" or in the "non-volatile" memory areas, with some safety guarantees from the OS. Restarting would zero only the volatile area, enabling a clean starting state.

          > Okay, but how do you write a process which is capable of communicating with any other process?

          Why would you need to communicate with any process? Maybe I need to communicate with the currently running instance of "HTTP Server App" or "Database Engine App", not with an unrelated process my program knows nothing about.

          • thaumaturgy 6 years ago

            > Which, in principle, doesn't prevent a program from crashing or misbehaving when it encounters a corrupted file left over from its last run.

            Of course. But saving state to nvram also has this problem, plus the additional problem of saving broken program state. Like I say, I'm not totally pessimistic on this anymore: there are some promising trends that make me think we might get there in the not distant future. But we're not there yet.

            Maybe supporting some way for a user to manually reset broken program state would be a good enough compromise -- as long as application developers didn't do something dumb, like store their license information in their program state. Autodesk immediately comes to mind there.

            > Why would you need to communicate with any process?

            Because some other process wants to communicate with you.

            Composability and common interfaces and loose coupling is a huge advantage in Unix-like operating systems or any other software architecture that embraces those principles.

            They mean that you can write software with a much longer lifespan, and usually for less effort. In your example, someone may come along wanting to write "Firewall App" long after you've abandoned your software. If your software is extensible and supports standardized IPC, "Firewall App" is possible. If it doesn't, then it gets thrown out and replaced with something else.

          • Khoth 6 years ago

            > Why would you need to communicate with any process? Maybe I need to communicate with the currently running instance of "HTTP Server App" or "Database Engine App", not with an unrelated process my program knows nothing about.

            Today, I piped the output of some process I was running into grep. Neither of these programs knew about the existence of the other.

            Later, I used an image editor to create an image, then used a web browser to upload it to a website. The web browser and image editor were unaware of each other's existence.

            • bluetomcat 6 years ago

              > Today, I piped the output of some process I was running into grep. Neither of these programs knew about the existence of the other.

              Only because "an unstructured stream of bytes" is the way for Unix processes to communicate. It allows for composability, but a very fragile and unsafe one, requiring programs to spit out and read free text, with all the crazy filtering and guesswork needed.

              • garmaine 6 years ago

                You could pipe structures data. See: PowerShell.

              • jnurmine 6 years ago

                Free text? You can just as well pipe structured binary data (such as audio or images) through one or more processing elements. The processors need to understand the data, of course.

                There are also other IPC mechanisms, like D-Bus.

      • stcredzero 6 years ago

        Imagine every process having its own "non-volatile" memory area, where all persistent state is kept between restarts, with a clean OS interface to share that state to other processes?

        People have been thinking about this for decades. (Probably because it's a compelling idea.) Google "orthogonal persistence."

      • wongarsu 6 years ago

        Currently my mother has a Pictures folder which contains her pictures, where she can open them with the default photo viewer, with an editing program, and where she can trivially add files or make backups with the Windows Explorer. Any alternative interface I can imagine that offers the same flexibilty of using the same picture in multiple programs just ends up mimicking files.

        There's room to replace traditional folder structures with paths or anything else, and most traditional file system implementations have really slow search. But I don't see a future in replacing the notion of a file, it maps too well to the intentions of the user.

        • laughinghan 6 years ago

          Well, in a traditional file system, the basic unit of data is a specific notion of a file which is an untyped blob of octets with limited metadata like size, filename, some timestamps, and some very coarse access controls.

          You could quibble about whether this counts as "replacing the notion of a file", but I could certainly imagine it might be useful to have a system that talks directly to disk whose basic unit of data has much more useful metadata, such as a canonicalized MIME type, much more granular access controls, much more granular access and edit history, etc. Has your mother ever downloaded a file with the wrong extension and been unable to open it? I know I have.

          Similarly, the abstracted "everything is a file" notion of a file without random access, which includes sockets and named pipes and stuff, is an untyped, unsized stream of octets. Message-oriented protocols like WebSockets and HTTP can and in fact are built on top of that, of course, but it could have been the reverse: instead of a stream of octets, TCP could have been a stream of arbitrarily-large but finitely sized messages. There almost certainly would have been advantages to such an approach, and applications that didn't want the message framing and just wanted a stream of octets could have easily ignored it.

      • the-rc 6 years ago

        That's the design of the System/38, AS/400, iSeries and System i. The unique IDs are 128bit pointers or capabilities.

      • warrenm 6 years ago

        >Imagine every process having its own "non-volatile" memory area, where all persistent state is kept between restarts, with a clean OS interface to share that state to other processes?

        So ... something Docker-esque?

    • the_af 6 years ago

      Wasn't this kind of filesystem an abandoned plan for one of the (now old) versions of Windows? Like Windows 2000 or XP? I don't remember why they ditched this plan. Maybe it wasn't such a good idea after all, who knows.

      • thaumaturgy 6 years ago

        There have been a few toy filesystems which have explored this. BeOS back in the late 90s came the closest to broad adoption, even though it didn't fully embrace the paradigm, just aspects of it.

        I think there are a few reasons it hasn't happened:

        1. Filesystems are hard. Every new filesystem architecture ends up requiring a large pool of talented developers.

        2. Nobody wants to break backwards compatibility. Current filesystem design is so integral to all kinds of software that you just can't expect all software in the world to be updated just to work with a new filesystem paradigm.

        3. Databases are also hard, so a database-like filesystem is doubly so.

        4. Current filesystem architecture is good enough for most stuff. The pain of continuing to use it isn't as great as the pain of changing it.

        None of these reasons make a database-like filesystem inherently bad. It's just not practical right now.

        I think we're moving in that direction though, with things like object storage.

        • the_af 6 years ago

          I agree and didn't mean to imply database filesystems are necessarily bad. Just wondering.

          I forgot about BeOS! Do you remember which version of Windows was going to do it, or at least had preliminary plans for it? I distinctly remember reading about it. Was it an early version of Windows XP or what? I can't remember...

  • sytelus 6 years ago

    I am doubtful if hierarchical file systems have anything to do with slow disks. It a design pattern that you encounter all over in real life for human mind to manage large quantity of information. Using tags is another pattern but with its own pro and cons. Same goes for I/O steams, process hierarchy etc. There may be better designs patterns out there but I don’t see why these existing design patterns would become obsolete even if disks become as fast as RAM.

  • taborj 6 years ago

    Every system is driven by user adoption. At this point, it will be nearly impossible to dethrone the current methodologies.

    Not saying it shouldn't be done, just that it might fail.

    • zwkrt 6 years ago

      The classic way that this has always happened in the past is by solving a specific problem in a hacky-but-cheap way as compared to existing tech. The issue I see here is I can already run Unix on a smartwatch or a car radio, so it seems difficult to understand where new computing paradigms will come from. Google's Fuchsia is a contender, but I wonder how realistic it is that it would ever become 'the standard'. Maybe the future is a world of truly heterogenous operating system design, but more likely we will employ a one-size-fits all approach just like every other industry. All car engines are about the same, all shoes are constructed with similar parts, all stoves are basically interchaneable, it is unrealistic to expect that software will be any different.

    • Razengan 6 years ago

      > Every system is driven by user adoption. At this point, it will be nearly impossible to dethrone the current methodologies.

      iOS, when the iPhone first came out, turned many of the entrenched perceptions about computing devices around on their head, and people embraced it.

      Even without the celebrity power of someone like Apple, an experimental system could still thrive today in the shadows with a small cult of followers nurturing and developing it, until it breaks out.

  • strictfp 6 years ago

    One idea I've been throwing around is to replace files with HTTP resources; everything is a resource. Effectively Plan9s idea, but the time might be right.

    • worble 6 years ago

      I don't know much about the OS and designs, but is that similar to how Redox is approaching it with "everything is a url"[0]?

      [0]https://doc.redox-os.org/book/design/urls_schemes_resources....

      • PeterCorless 6 years ago

        If you go that route, bind unique objects to Uniform Resource Names (URN), and then have a mapping to instantiations (URLs), so that the same resource can be found in its multiple locations. Would help immeasurably for a distribution model.

    • BGyss 6 years ago

      Believe it or not there was a system at Sony Pictures Imageworks that worked like this - no idea what it was doing under the hood though. Worked great in practice though - every possible production resource was a URL that started with spref:// iirc.

  • Spivak 6 years ago

    What would you replace it with? At the very core you need two things: the notion of a unit of data and some metadata to address it. To be as flexible as possible that unit of data would probably modeled as a sequence of individually addressable bytes but it could be more structured. Such a thing is dangerous because if your structure isn't sufficiently expressive 20 years down the road people will end up imposing their own structure on top of obj.binary and there goes all the work.

    Once you have data and handles to access it you might start wanting some convenience features like access control, locking, namespacing, constraints, relationships.

    I don't disagree that we can drop many of the current filesystem semantics but fundamentally all that really means is changing is the query language to access objects and manipulate their metadata.

    I also don't disagree about process hierarchy. Being able to express relationships between processes beyond parent-child natively without farming out to an external scheduler would be awesome.

    • scroot 6 years ago

      Object systems

      • silversconfused 6 years ago

        Ok, but what does the level below that look like?

        • scroot 6 years ago

          I'm not sure there has to be a lower level at all, aside from asm and machine language

          • Spivak 6 years ago

            Okay so you have a device driver that exposes your underlying hard disk as a single contiguous segment of blocks of a given size. You have objects which are blobs of heterogeneously sized arbitrary data with an addressing scheme.

            The software that marries these two things is basically a filesystem driver (in that you can implement filesystem semantics on top of it -- hell Ceph does it right now).

            Nothing about a modern Linux/BSD system really stands in the way of doing this.

            • scroot 6 years ago

              Of course, you need some agent that can communicate with the hard disk and that knows how to manage memory. Maybe its one driver, maybe its several, but these can also just be objects. Live objects, I mean. Not just blobs of data, but actors that have behaviors and are always "running." You don't even need the concept of files or filesystems. And if you do, better left to a higher level

              • laughinghan 6 years ago

                Isn't the whole point of persistent storage like hard drives to deal with the fact that computers shut off sometimes, and nothing can be "always 'running'"?

                Something below the object level (possibly part of the object system, possibly a layer below that) needs to read from disk and bootstrap all those agents/live objects/actors into existence.

                • scroot 6 years ago

                  Well of course they aren't running when the computer is off. And yeah, you'd need something to bootstrap the system the same way you need a bootloader

                • fragmede 6 years ago

                  The storage on disk doesn't have to be a traditional hierarchical filesyste,m for anything after the kernel is loaded (+drivers), as long as the kernel is able to recombobulate the agents off the block addressable storage device, it can store them however it wants. Simply serializing swaths of memory to/from disk may not be the most efficient, but that should show its possible.

    • stcredzero 6 years ago

      What would you replace it with? At the very core you need two things: the notion of a unit of data and some metadata to address it

      There were some experiments along these lines. Instead of files, just make everything an object and have orthogonal persistence for every object. In a way, this was what the early Smalltalk implementations were working towards.

      I don't disagree that we can drop many of the current filesystem semantics but fundamentally all that really means is changing is the query language to access objects and manipulate their metadata.

      One thing which Smalltalk demonstrates, is that the query language can simply be the same language used to implement the OS. Activity which looks a lot like database querying, but for objects, not database rows was just an expert level debugging trick of Smalltalk programmers.

  • jnurmine 6 years ago

    I think I don't understand. Is there something preventing people from experimenting and craftsmanship at this very moment?

  • AnimalMuppet 6 years ago

    Then... what? Do you have a positive to recommend, or just "not what we've been doing"? (It's not necessarily bad if you don't - when you're doing the wrong thing, the first step to improvement is to stop doing it.)

    But it seems to me that much of this can be done within the existing structure. You don't want I/O as streams of bytes? Great. Whatever new thing you think it should be, you can build that on streams of bytes. Knock yourself out. (It may not be as fast as it would be if it were directly supported by the OS, but you can prove the value of the concept by building on top of the OS.)

    Same thing with the metaphor of file cabinets (I presume you mean the hierarchical file system.) Well, does your OS let you read and write raw disk sectors? No? Fine. Create one giant file that takes up the whole disk, and manage it yourself. Try out whatever different way of managing that space that floats your boat. Again, it will be slower, but again, you can experiment and prove out your concepts right now. You don't need to wait.

    • garmaine 6 years ago

      E.g. use the relational model for data storage instead of files (SQLite is a purposeful step in this direction).

      Use a single system image with no distinction between network, processor, or core communication unless needed.

  • olliej 6 years ago

    ...except humans instinctively organize things by hierarchies (even something as counting is done hierarchically).

    Just saying “we need to change this” without saying what the short comings you want to address isn’t a super useful statement. Multiple platforms have attempted to have the user interface to their data be tag based, but that simply doesn’t scale for the amount of data people can manage with hierarchies.

    Finally what the heck are you talking about in that last sentence: how does changing representation result in a new age of experimentation and (???) craftsmanship? Why is craftsmanship gated on the user level abstraction to bytes? Experimentation already happens today, what does this change?

  • robbyt 6 years ago

    Correct me if I'm wrong, but isn't this what Plan 9 does?

    • AsyncAwait 6 years ago

      I think Plan9 still has the notion of files, folders and such?

  • dorlaor 6 years ago

    At Scylla we initially defined a new filesystem which adheres shard-per-core and thus a single physical hyperthread has the sole access to the data and thus there is no contention.

    While it will be better than current XFS, we've made aio improvement to the later over the years and today it's good enough for ScyllaDB.

    Practically, even though Scylla has its tcp/ip stack in userspace on top of DPDK, we learned over the years that it's ok to use the less efficient kernel tcp stack. Most of the overhead and the optimizations can still happen within the DB itself as long as it controls the memory, the cache and manages the networking queues

  • notduncansmith 6 years ago

    Why are these insufficient? What abstractions should we building on?

  • fragmede 6 years ago

    PalmOS had this! Applications connected to databases and read and wrote records. Sqlite acts similar for the more modern era, with a traditional hierarchical filesystem underneath the sqlite DB.

    Looking at modern app-based "file" access using Google docs and its ilk are that reimagining. The UI is a list of recent files, a small number of features, and then a search box. There's not File -> Save, nor am I forced to pick using a folder metaphor, where I want to put it.

    That there's (likely) an underlying hierarchical filesystem somewhere below, in the stack seems like an implementation detail. As a programmer there's a library/middleware to be used to access resources, but once inside, object based access already exists. Looking at video game save files, that's been the case for a while, with the state of objects (in fact, visible objects that the user interacts with) being saved and restored from disk.

    I agree it's not as satisfying as a total paradigm shift in computing on every single level, but the notion that file system, byte stream access is a holdover from a previous era ignores practical, user facing progress we've made since.

LorenPechtel 6 years ago

This takes an idea I had years ago and goes much farther with it. My idea: Disk and file access is handled by the memory paging system. A 64-bit machine's segment registers can point to a space far bigger than the largest hard drives. Thus a drive ID would simply be a segment register value, the drive would be accessed by reading/writing memory at an offset from that. A file handle would likewise be a segment register value. The results of doing this would be the use of all surplus memory for disk cacheing and the paging system would take care of all disk buffering, you could efficiently read/write small chunks of data.

Now lets add their approach: When you cause a page fault from accessing stuff not in memory you get the context switch but the actual workload could be handled by an auxiliary controller, it need not be on the CPU.

Changes: Locking parts of a file would be on a friendly basis, you would be able to get around the rules. Access to remote files with small chunks of data would still be inefficient--but the vast majority of accesses are local and remote accesses are generally documents that are read in their entirety.

ktpsns 6 years ago

Strictly speaking, the sentence "I/O is faster then CPU", aka "memory access is faster then computations" is nonsense, because it compares apples with bananas. One could probably say "transfering x data between CPU and SSD is faster then performing the computation f(x) on the CPU", where still f remains undefined.

  • otterley 6 years ago

    The issue being discussed in the article and amongst OS designers is that I/O throughput volumes being seen in the wild today are unable to be accepted in real time (i.e., without significant queueing delays) by otherwise unloaded CPUs.

  • masklinn 6 years ago

    You're misreading the article. Its subject is that because of the way IO stacks have been built CPUs are becoming the bottleneck in IO, this is an issue for both network and non-volatile storage IO e.g.

    > a 40 GbE NIC can receive a cache line sized packet every 5 ns, but the last level cache (LLC) access latency is already up to 15 ns, which means a single LLC access can already prevent the OS from keeping up with arriving packets

    and

    > NVMe SSDs perform I/O faster than the OS can accept new (asynchronous) I/O requests and notify their completion.

    They also note e.g. that while nvme provides for 65k command queues OS generally have one IO queue per CPU.

    • jeanmichelx 6 years ago

      Latency ≠ throughput

      • masklinn 6 years ago

        The LLC access is a sequential cost to processing the packet.

        • loeg 6 years ago

          Sure, but the CPU can amortize that cost over many packets if it is infrequent, while maintaining the same throughput. Also, cache line sized packets are relatively tiny.

MrTonyD 6 years ago

Well, I spent a number of years writing drivers for PC systems. Some years the I/O chips were faster than CPU, and some years the CPU was faster than I/O chips. DMA was usually slower, just because release cycles for CPUs tended to be faster than release cycles for I/O controllers. Eventually, most driver writers decided that it was usually better to use CPU, even if the I/O controller was faster. That way, when the CPU got upgraded, you would automatically get a speed boost. While programming an I/O controller was both more arcane and more more likely to require a complete reimplementation in a couple of years (as well as customer complaints and market share losses.)

I'm not saying that things are the same today - but it kind of sounds to me like they are. Back in the days, people were always claiming that we should switch to the newest and fastest I/O controller since CPUs were more general purpose and would therefore always be slower. It just didn't work out that way in practice.

pjc50 6 years ago

Interesting. It's long been the case that a "computer" pretends to be a single processor to the programmer while in fact being a cloud of semi-general processors which communicate through messages. This makes that completely explicit, giving the programmer all the power and hassle involved in speaking as directly to the devices as possible while maintaining isolation. Similar esoteric architectures are already available (e.g. Tilera, or all the way back to the Inmos Transputer).

Given the allocation of particular hardware devices - NIC, RAM, NVMe - to particular processors running a (static?) application process, it's not clear how the filesystem abstraction would work or whether that's simply delegated to the application. This is very definitely a server-focused system as no mention is made of GPUs or interactive devices.

laythea 6 years ago

This is kinda like what they did in the graphics API world. Moving from OpenGL to Khronos in order to "cut the fat" between the user program and the hardware.

  • ambrop7 6 years ago

    s/Khronos/Vulkan/

bhouston 6 years ago

Very interesting shift that happened over the last 2 decades.

We likely haven't designed OSes or CPUs to match this new reality.

  • ajross 6 years ago

    Largely because it's not really a new reality. IBM faced the same issues on the 360's half a decade (edit: sorry, century!) ago -- you could stream data off of stacked platters in a drive into core much faster than a CPU could manage the copy. And the solution was to invent "I/O Channels", which were early DMA controllers. And the VM layer (when it was added) was cognizant of this stuff, so applications could be written directly to the channel interface.

    There's nothing new under the sun, basically. It's an Ecclesiastes design. I haven't read through the whole article, but my guess is that the "parakernel" interface the authors are positing is going to look a lot like the IBM Channel interface.

phkamp 6 years ago

Congratulations!

You have reinvented the Mainframe Channel Processor!

Your next challenge: Try to avoid reinventing the 3745 Frontend Processor.

  • tinktank 6 years ago

    Why so negative and condescending?

    • p_l 6 years ago

      While I don't like that tone, it's at times hard to not fall into it.

      Because all of this had happened before and will happen again, often without learning anything about the past (example case: NoSQL)

oblio 6 years ago

Is this true for most real life workloads? There's that famous rule-of-thumb indicator for latencies: https://www.prowesscorp.com/computer-latency-at-a-human-scal...

It doesn't seem to be that the orders of magnitude are so close as to require totally rethinking mainstream kernels.

Or am I looking at this the wrong way?

  • jcranmer 6 years ago

    > Is this true for most real life workloads?

    No. It's true if you care about NVMe drives, or high-speed networking--which is to say, it's true if you care about a few kinds of server workloads, but it's absolutely not true for most consumer hardware.

    • blattimwind 6 years ago

      NVMe drives with transfer rates of several gigabytes per second are mainstream now.

    • wtallis 6 years ago

      In consumer use cases, it is common to see the storage bottleneck be the CPU rather than the SSD. Firing up a video game isn't much faster on an Intel Optane NVMe SSD than on a SATA SSD, because the data on disk has to be decompressed and parsed on the CPU before it is usable. A lot of software is still written under the assumption that the disk is slow, and that capacity is somewhat limited. Taken together, those assumptions usually lead to single-threaded loading and decompressing/parsing on the same thread that makes the system calls for IO.

ObscureScience 6 years ago

I apologize for not reading much of it yet, but could someone give a quick comparison to the exokernel idea?

  • _0ffh 6 years ago

    The parakernel does not multiplex resources that are partitioned to allow applications to maximize the performance obtained from the underlying hardware. For multiplexed resources it looks much the same to me.

sinisa_cyprus 6 years ago

Only mainframes had channels. Best implementation thereof is in IBM machines. It is nothing like DMA or that Intel chip or anything else. No Unix machines nor PC and not even specialised hardware, like Tangent, had anything similar.

It is something like separate FPU or MMU units, built for the total control of the peripherals, so that CPU had little or no work to do. Don't forget that device drivers run on CPU.

wmu 6 years ago

BTW, does anybody know a paper about doing some DB ops directly on disc controllers? The other day my former colleague mentioned that he come across such paper (maybe a blog post?), but we couldn't find it. It's really interesting idea and I believe it's doable, although under very specific circumstances (disc vendor specific, sectors layout aligned to DB needs, etc.).

Blackstone4 6 years ago

Does this mean we go from virtual machines (i.e. VMWare) to kubernetes & containers in data centers? Something similar to RanchOS

  • SteveNuts 6 years ago

    Even VMWare has support for running containers on their platform now, so yeah probably.

    Cisco Nexus switches can run Docker workloads also.

  • closeparen 6 years ago

    Kubernetes predecessors like Mesos/Aurora tend to already run on metal, but workloads are still going through the kernel for I/O. To take advantage of this, you would expose NICs and SSDs directly to applications, potentially bypassing the controls currently offered around containers (because these are enforced in the kernel).

inetknght 6 years ago

Look to GPUs for better solutions: make more discrete -- less functional overall but higher performance -- cores, and move them closer to the data; then let the CPU just handle coordination of the discrete processors. Think of SIMD but on a massive scale.

Think of blocks of RAM with math processors. Or the same in your NVMe/NIC/etc.

  • gmueckl 6 years ago

    HP announced their ambition to do that a few years ago, framing it as a grand vision of the future of computing. I haven't heard anything since. This kind of offloading or distributed computing isn't quite a new idea, but it hasn't materialized yet. I suspect that it is too tough a nut to crack for the general case.

    • inetknght 6 years ago

      My understanding is that GPUs have moved toward what I described: thousands of discrete cores with large amounts of math performance but generally terrible (or even non-existent) branching...

      • gmueckl 6 years ago

        And transferring data to and from the GPU is also a source of overhead.

        • inetknght 6 years ago

          Despite having an overhead of transferring data to and from the GPU: it's still faster to move the data to the GPU and let the GPU process it locally than it is to leave it in main RAM away from the processing. Transferring data to and from the CPU is also a source of overhead; even your CPU has cache to bring the data more local. So what's your point?

          • gmueckl 6 years ago

            Whatnyounsay is not true as a blanket statement: the overhead for transferring data to GPU memory is higher than accessing main memory from the CPU, so depending on what you do, the data transfer to the GPU might not result in a performance gain at all.

      • deRerum 6 years ago

        GPUs are latency hiding engines...they address the mismatch between processor clock speeds and memory latency by a unique scheme. Since they can’t improve memory latency, instead they parallelize memory and vastly increase the bandwidth available.

        Once they have increased memory system bandwidth to be able to feed the multiprocessor throughput, the rest of the architecture is designed to make the most efficient use of it.

        They spawn thousands of threads and schedule them in and out really quickly... so the processor utilization is always as close to 100% as possible. When a thread is waiting for memory it is put to sleep in a few clock cycles. When it’s data is available it wakes up and does it’s business. Since the workload is split among thousands of threads therefore there is somebody ready to be scheduled. This is why GPUs only make sense to run on massive workloads.

        The same trick can be used to hide SSD latency.

  • convolvatron 6 years ago

    'systems code' like protocol implementations and device drivers tend to be very control-flow centric. on a simd machine in the worst case this means narrowing the 'vector length' to effectively 1. since these are throughput machines, thats often pretty bad.

    I do agree with you that lots of little processors is a good way forward here with a careful eye towards reducing sharing of state, but maybe its useful in this case for them to have their own instruction streams.

    • inetknght 6 years ago

      I thought separate cores do have their own separate instruction streams -- sometimes even completely different architectures and/or supported instructions? Is that not the case?

      • p_l 6 years ago

        No, they aren't. Generally you have some level of grouping and complex rules on how and when can they branch

vkaku 6 years ago

C - (stdio) - (net) = what an ideal programming library looks like. IO ring FTW.

Eventually, you realize that you're trying to use line buffers / getc / ungetc to parse lines on that packet of data on the iio_ring to serve that cat picture for teh Internetz. :)

We need to eliminate variable length protocols to make these interfaces go away.

racuna 6 years ago

On Exadata architecture (and the recent versions of Oracle Database) the search is made by the hardware, not the software.

I'm not a fan of Oracle, but things like that are awesome.

  • vetinari 6 years ago

    Really? What I vaguely remember that the storage nodes were embedded linux boxes. Yes, they understood indices and would return only minimum needed, but it was still software.

sly010 6 years ago

Isn't the BPF infrastructure an example of this idea?

warrenm 6 years ago

If you actually knew the storage was fast enough, then sure

But you can't - except in very specialized (ie dedicated) designs

loeg 6 years ago

Is this like the academia equivalent of an op-ed? No methods, no data, just opinion based on some recent trends?

  • scott_s 6 years ago

    That's not a bad way of thinking about it. The HotOS workshop is not a place where people publish completed work, but rather present new, early-stage ideas. From the call-for-papers (https://hotos19.sigops.org/cfp.html):

    > We solicit position papers that propose new directions of systems research, advocate innovative approaches to long-standing problems, or report on deep insights gained from experience with real-world systems. We seek early-stage work, where the authors can benefit from community feedback. An ideal submission has the potential to open a line of inquiry for the community that results in multiple conference papers in related venues, rather than a single follow-on conference paper. The program committee will explicitly favor early work and papers likely to stimulate reflection and discussion over mature ideas on the verge of conference publication.

z3t4 6 years ago

IO faster then CPU puts CS best practice and intuition upside down.

kaetemi 6 years ago

So... separating the data and control planes?

toolslive 6 years ago

the OS kernel has been the problem for quite a while now.

m0zg 6 years ago

Some types of IO have been faster than CPU for quite a long time. For instance, a cache miss is typically about 200x slower than accessing data already in the register file. What this means in practical terms is if you miss cache all the time (aren't garbage collected languages wonderful?) your 4GHz CPU turns into a 20MHz pumpkin (or thereabouts), and a fully sequential read from a modern _spinning drive_ (150+ MB/sec) could produce more throughput. A consumer-grade 10GbE NIC will leave it completely in the dust, as will USB3.

Ericson2314 6 years ago

> A prototype parakernel written in Rust is currently underdevelopment

...Where?

agumonkey 6 years ago

Are we going to have transputers ? :)

ummonk 6 years ago

Except I/O is still slower than CPU...