"The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an
inspired design. In this paper, we argue that fork was a clever
hack for machines and programs of the 1970s that has long
outlived its usefulness and is now a liability. We catalog the
ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS
implementations, and propose alternatives."
fork() and exec() is a reflection of the unix philosophy: small tools for separate tasks, and as a result you have great composability. You can set up right etc for your new process by calling any number of system calls between fork() and exec().
Windows is a great example of the alternative (following "Windows philosophy"): There are about three different API calls for creating a new process, each with a heap of complicated optional arguments. The API becomes more complex, less composable, less extensible and less powerful. But also easier to reason about and easier for the kernel to provide, and arguably with fewer footguns.
In it is original implementation, fork() was pretty trivial. All it did was create a new process entry in the kernel table, with all the pages and capabilities and such copied from the original process. Then mark all pages as copy-on-write, and return to the caller. Maybe not trivial, but much less complicated than loading an executable file from disk.
My understanding of Linux internals is maybe 20 years out of date, so I am legitimately curious what makes fork() so complicated these days.
fork() is not trivial now. Processes are huge now -- they have huge heaps among other things. Copying all that is expensive. In the 80s we tried COW, but that turns out to be very slow as well. What operating systems do now is immediately copy the resident set, then do COW for the rest of writable memory, but in large, multi-threaded processes, this is still too slow.
Hrm. Googling "fork linux copy-on-write" seems to find a lot of stack overflow answers from 2014-2015 claiming Linux marks pages as copy-on-write when fork() is called. I didn't see anything more recent in the first page of results.
I could see it being worthwhile to immediately copy a few pages, like the top of the stack, but copying the whole resident set seems excessive. Especially since some of that data might not even be written to.
So the problem is what happens to the old and new processes after the fork. To CoW, you need to mark all the pages read only in _both_ old and new which means that every memory write in the caller will now pagefault since the OS now has to lazily copy on both sides. So with true copy on write the fixed costs may be low but the marginal cost per memory write may be high in both parent and child.
In this case you can see why the resident set is copied, yes? It’s the smallest amount of memory that guarantees predictable performance subsequent to the call returning.
If the parent is threaded and the host has more than one CPU, then fork() == TLB shootdowns, which are slow.
As well there's the cost of all those page faults that the two processes are likely to take to do the copying.
And lastly there's all sorts of complexity involving multiple parent threads calling fork(), or the child calling fork() again (or vfork()) before calling exec.
It's just much easier to copy the resident set and mark the address space as being CoW, because now you only have to worry about page faults for pages that are not in core anyways and so were going to fault anyways, and that means you don't have to worry about TLB shootdowns either (if a page is not in core, it's not referenced by any TLB either). You still have the multi-fork issues, but now you can use an atomic reference count on the address space.
The classic pattern was that you'd spin up something like Apache, load it full of read-only data, then start forking its children. Hoping that you'd share memory between children.
With what you describe, you'd share nothing because at the point you've loaded it up, all the data you just loaded is resident. :-(
The posix_spawn() and posix_spawnp() functions provide the
functionality of a combined fork(2) and exec(3), with some
optional housekeeping steps in the child process before the
exec(3). These functions are not meant to replace the fork(2)
and execve(2) system calls. In fact, they provide only a subset
of the functionality that can be achieved by using the system
calls.
Also, there's no way to set resource limits in the child process, nor switch user or group ID, using posix_spawn().
True. The main problem with the Unix default is that there wasn't a way to set O_CLOEXEC on all new FDs race-free until recently. That's a real problem. FD leaks to children can be bad, but most of the time they are not the end of the world, and often one can steal a closefrom() implementation from a BSD or Illumos as a workaround when you know exactly what you want to allow the child to inherit.
Or more importantly, IPC mechanisms like mutexes. If they're in shared memory, you now have two problems. The runtime of a very very popular scripting languages does this.
Windows approach is not the only alternative. Simply provide API to create a process in a suspended state, then adjust its properties based on pid/handle and then start the process execution.
A huge number of tricky thread problems go away if the child thread is blocked at startup, and allowed to run only after the parent allows it. To retrofit this, it is easiest to lock a mutex before spawning and have the child block on that. Then the parent unlocks it to let the child run.
I think shared libraries can spawn threads on their load/init phase that you don't know about. Then you are hosed but you only know about it due to sporadic weird problems, that if you restart on failure, e.g. a pre-forked worker scenario, you might never even really care about.
I always felt that the way to create new processes and new threads should initially start the same. A new process is simply a thread that after creation does a syscall to isolate itself from the other thread and receive a unique copy of all resources.
This could also solve the issue with forking from multithreaded programs since we can ensure we own all shared resources when we isolate our thread, to effectively thus become a new process.
So instead of fork we have clone/isolate.
A new thread can also of course immediately suspend itself, allow other threads to work on it's data in some way, who then give it a signal to resume itself and then execute if need be.
Has Linux gained syscalls equivalent to those Windows API calls yet? Or is linux too different from the windows kernel to make that happen? (In that case, what does WINE do?)
Windows doesn't have fork(). It has a real, fully mature thread and process model. In Windows NT, every process consists of a handle that is a "Process", which in turn points to a structure containing a list of "Threads". A process is done when its main thread exits or all threads exit, whichever is defined by the main process. Fork/Exec is replaced with CreateProcess (or ShellExecute, your choice).
From my perspective watching the various techniques used by a multitude of operating systems over many decades, the Windows-style process+thread model seems to be winning out over the UNIX fork() model.
For example, PostgreSQL seems to suffer greatly from the "forked process per connection" model, necessitating front-ends that do connection pooling. Database engine after database engine seems to go through this phase and then "upgrade" to either a single-process thread pool model, or start using async IO in some way. (Web servers also.)
For reference, Microsoft SQL Server back in the 1990s could on Windows NT 4 could handle more connections than PostgreSQL in 2020...
It does (or maybe did, not sure if it still works) have the literal equivalent of fork() - NtCreateProcess() with NULL SectionHandle argument creates a new process which is a clone of the caller. However, it never really worked. The Win32 API did not support forking, and so any process which forked itself and then tried to invoke any Win32 API calls would soon crash or fail mysteriously. In theory forking did work for pure NT API applications, but those are rather limited in their abilities.
If you mix fork with threads, you're going to have a [undefined behavior] time. It seems like if you link with the sqlite that comes with macOS, you're using threads whether you like it or not. I think ending up at "you shouldn't use fork() at all" is a bit of an extreme conclusion, though.
BTW, article title needs a (2016). It appears that the relevant Python bug has long since been closed, by avoiding linking with the system sqlite on macOS.
> I think ending up at "you shouldn't use fork() at all" is a bit of an extreme conclusion, though.
Is it? There are more descriptive (as opposed to procedural) APIs which behave in a safer and more well-defined manner to do it these days. Unless you're implementing a shell, fork has never been a great tool.
As one commenter noted 3 months back:
> The dense fog lifts, tree branches part, a ray of light beams down on a pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells." You hesitantly lift your eyes to the item presented upon the pedestal, take a pause in respect, then turn away slumped and disappointed but not entirely surprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.
IMO, fork is strictly better than threads as a tool for having operations perform off of the main thread; they get all the state they need at the beginning and they can use IPC to return the result.
TFA does allow for off-process operations, but all of the inputs to the operation would need to be passed explicitly. In this sense, I suppose TFA isn't arguing against multiprocessing per-se, but against the specific type that implicitly includes all of the current process state (which has both up- and down-sides).
That is basically how NGINX works if you run it in daemon mode. When you start or reload the server, the main process initializes common state then forks to become a worker process. Although I would recommend avoiding any IPC past that if possible
> In this sense, I suppose TFA isn't arguing against multiprocessing per-se, but against the specific type that implicitly includes all of the current process state (which has both up- and down-sides).
You don't have to suppose anything, TFA specifically says that you should use posix_spawn or immediately exec() after forking.
It doesn't imply or hint, let alone say, that threads are superior, it only mentions them because they interact badly with fork() and that's the issue they'd hit. It's not like threads are the only thing which interacts badly with fork.
Isn't the first use-case a pretty debatable / bad one? By daemonizing internally, you make service management and supervision of the program much more difficult, and if you include a non-daemonizing mode for debugging you now have two different runmodes with a pretty significant semantics difference, only one of which is easily inspectable.
Daemonizing is a thing of the past with modern restarter frameworks, like SMF, systemd, supervisord, etc. But daemonizing was always an option, not a requirement, and as an option, it's safe enough to provide it for those who don't use a restarter.
I love this sysadmin comment and the GP dev comment. The key is to get the sysadmin team putting requirements to the code, whether the restarting ends up in process or as a nice small unix separate tool that just does restarts well, is an outcome of a process.
fork + threads is not undefined behavior. It is safe as long as you only do “async-signal safe” functions in the child. The child will be single-threaded.
Note that this includes most of your standard syscalls, like (importantly) write(), read(), close(), chdir(), as well as certain “obviously safe” library functions like strlen(), memcpy(), etc.
Non-multithreaded programs can fork() how they like and do whatever they want after (mostly).
> fork + threads is not undefined behavior. It is safe as long as you only do “async-signal safe” functions in the child. The child will be single-threaded.
Yes, but the async-signal-safe restriction is pretty severe, so you have to know what you're doing. Yes, that's also true of vfork(), but at least vfork() will be much faster.
> Non-multithreaded programs can fork() how they like and do whatever they want after (mostly).
Only as long as they haven't used libraries that are not fork-safe prior to calling fork(). And you still need to do things like fflush() stdio handles prior to fork()ing.
In general, I don't see how one could safely rely on a third-party library spawning or not spawning threads unless they explicitly make guarantees regarding not using them as part of their public contract.
The dense fog lifts, tree branches part, a ray of light shines down on the ruins of a moss-covered pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are optimized to simplify the implementation of command line shells." You look upon the pedestal, pause in respect, then turn away disappointed but unsurprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.
BTW, it's unclear whether the turd is... the specific truth revealed, or the revelation itself (since it could be incorrect). It's still a glorious comment.
Thank you! The goal was to evoke the same emotional response I felt when this idea struck me (I think when I was reading earlier comments in that thread in fact), so I guess there's as much in the sequence as the specific objects. The semantic design of fork was always incomprehensible to me (not what but why), so the setting and pedestal and plaque represent how strongly it struck me and the depth of its historical explanatory power. The turd is how I feel about it. I don't resent the designer's history or motivations or incentives and I'm happy to know the truth. I'm just sad/depressed that this is the reason why things are the way they are, that the original design intent is so misaligned with the needs of modern systems, how it limits our capabilities today even down to hardware, and how unlikely it is for this to change. I don't suppose a turd would last very long on a pedestal, or that the ancients would put a turd there on purpose. Maybe they put something beautiful there back then, but times have changed and now we're stuck with it and it's kind of shit.
Don't get too sad about legacy. A lot of things were brilliant once that aren't now. I do still feel that fork+exec was brilliant then, just not now. Deprecation is hard, but we can celebrate what legacy made possible.
fish shell uses posix_spawn sometimes because of its performance benefits. We can't use it in the following cases:
1. No analog to tcsetpgrp, so it's no good if job control is enabled
2. No analog to fchdir, meaning you have to synchronize with fchdir elsewhere in the progarm
3. Error codes do not convey enough information for good error messages (e.g. if a file doesn't exist, posix_spawn doesn't tell you which file)
4. Inconsistent behavior around dup2 fd redirections and CLO_EXEC.
5. Inconsistent behavior for shebangless scripts
These are basically deal-breakers so fish also supports a fork/exec path. However the performance benefits of posix_spawn are too real to ignore so fish uses posix_spawn when it can, and fork/exec when it must.
Except that the set of things you are allowed to do after vfork() and before exec() is extremely small. I agree that things should use vfork or posix_spawn when possible, but it wouldn't surprise me at all if there's a similar list of cases when a vfork() path wouldn't be possible.
OK very interesting ... is the performance benefit on certain platforms, or everywhere? A previous thread says Ninja is faster on OS X and Solaris because of it.
Though that does seem like a large number of corner cases, probably learned through painful experience :-/
Not sure if it worth the platform specific code, but for #2 macOS 10.15+ has `posix_spawn_file_actions_addfchdir_np()`.
I think most of these are deficiencies in the available posix_spawn actions, not anything inherent. Of course getting all the relevant OSes to add new functionality is a huge pain. The error handling seems bad though.
Another danger using fork is it duplicates the internal state of pseudo random number generators. It's a great way to accidentally take the same random samples in every process, utterly trashing any statistics you were intending to do. Bonus: the python multiprocessing module silently uses fork by default. Person A writes a "make multiprocessing convenient" library, Person B writes a sampling library, you put them together and... whoops!.
I don't think that's really a viable strategy in practice in an ecosystem as complex as python's. There's too many libraries and too many little corner cases and interactions around what the behavior should be.
For example, suppose I am using library A and I initialized the random number generator with a fixed seed. Clearly when I fork it's not appropriate for A to reseed, because I wanted fixed behavior. Something is very wrong so probably there should be an exception. But now suppose I was using library B which was using A and B handles getting system entropy to seed A. Now it is clear that when I fork I probably want B to reseed A, but alas A has already raised an exception because it was given a (from its perspective) fixed seed. So now A needs to be redesigned to be given a seed and like some sort of intent on what should happen when forking, and oh my god wow this is creating a lot of work for everyone everywhere this is not actually going to be done consistently and cannot be trusted.
pthread_atfork functions aren't called if the application calls the clone syscall directly. The right solution is MADV_WIPEONFORK on Linux, or MINHERIT_ZERO on OpenBSD:
FWIW this is the same reason you can't implement implement a portable Unix shell in portable Go. (And similar issues with an init daemon)
Go only exports os.ForkExec() -- there is no os.Fork() or os.Exec(), because the things you can do between the calls could break Go's threaded runtime. (Goroutines are implemented with OS threads.)
That is, the space between fork and exec is where pipelines are implemented, but also entire subinterpreters/subshells. The shell actually uses copy-on-write usefully. (And yes I'm aware that there's a good argument that the shell is almost the ONLY program that needs fork() !)
----
A lot of people have asked me why not implement Oil in Go and various other languages, so I wrote this page:
So the funny thing is that Python is a lower level language than Go for this particular problem. It doesn't do anything weird with regard to syscalls. I'm still looking for help on this (and donations to pay people other than me):
> The shell actually uses copy-on-write usefully. (And yes I'm aware that there's a good argument that the shell is almost the ONLY program that needs fork() !)
It's been a while since I looked at it, but I believe Android uses fork for it's copy-on-write sementics to optimize app startup. On boot it initializes a single instance of the app runtime environment. Then when you launch apps that initial process is forked. As a result you do not need to reinitialize the runtime for every app launch.
Yes I think the argument is that Android (and Chrome) could use something like vfork or posix_spawn().
I'm not sure which, if any; I'd like to see an analysis of that... The issue is what kernel state is preserved/shared across the process creation call.
Every process sort of has a "mirror" in kernel memory. The user memory is CoW, and I suppose you also have to choose whether to copy or reference every kernel data structure as well --- open files in FD tables which point to disk/pipes/sockets, locks which seem to be nonsensical, etc.
But probably you can get the "warmup" property without the full semantics of fork(). That is the CoW of user memory is a somewhat separate choice from the kernel data structures.
So there are definitely cases where a shell uses fork/exec like Ninja, so you could imagine optimizing it. But the subshell/subinterpreter case is probably the most general -- the language semantics depend on it. And it's actually useful, e.g. this "alternative shell challenge":
This is moderately common for environments where you are pushing a lot of startup work into the dynamic linker and will be launching processes frequently. Loading shared libraries for example.
You have a parent process which uses dlopen() to load all the libraries you want to avoid re-linking. When you want to spawn a child, rather than exec() you dlopen() an object with your child's main() and call it. For the case where you have enough libraries this is much faster than an exec(), saving tens of seconds on every application launch if you have a really bad case of C++.
There some small surprises which become obvious with a little thought. You are responsible for everything that normally happens in your process before main() is called. ASLR is only done once per session. People rarely think to fix-up argv[] for ps and friends in the first version.
I think this turns out to be a tangent, but at least superficially it is possible for a C program to "do" shell pipelines without use of fork or vfork (directly) but rather by posix_spawn. I suppose "portable go" does not directly wrap posix_spawn so this option may not be on the table for you.
Typical use: `./a.out seq 3 2 9 -- cat -n` is similar to `seq 3 2 9 | cat -n` except that the return value is nonzero if either side's return value is nonzero.
that said, I wouldn't be surprised if there's something important I'm overlooking here.
But to be fair, the only times I can recall using fork() without exec() were forking network servers, and that was mostly me learning about doing network stuff, and a forking server was the easiest to implement manually.
Oh yeah, and that one time I accidentally wrote a fork bomb trying to stress test a DNS server. At least I learned something from my mistake. ;-)
EDIT: To me, using fork() without exec() is kind of like operator overloading - there are cases where it absolutely is the right tool, but these aren't very numerous, so one should exercise caution. A lot.
"After a fork() in a multithreaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2)."
I don't know why so many people still hit this issue when it already told you what you can do and not do in the document. I've done this sort of things without any issue.
fork() and the exec system calls exist because they were easy to implement in the 70s on PDPs, and fork() was cheap enough then, but much more importantly, it got the shell developers out of having to write and evolve a more complex API in the kernel. With fork/exec a shell developer could try lots of variations for executing a pipe command w/o having to develop any more kernel code.
For example, until BSD came along, not much had to change in kernel land for any shell. Job control meant that the shell would need to put all the processes for a job in the same pgrp, and also there was a need to add `setsid()`.
Hm a lot of that doesn't match my understanding, which is more like:
1. Unix has fork() because it was influenced by Multics. I don't have the citation now, but I think some parts of Unix were from ITSS, perhaps the hierarchical file system, and some were from Multics. It was a drastic simplification of those systems, but with the same ideas.
---
2. The shell developer and the kernel developer were really the same person -- Ken Thompson. I link to his original paper in this post [1]. The original Thompson shell had pipelines and redirects, which are most of what happens between fork() and exec() in a shell.
Also of note is a video by Stephen Bourne who says that Ken Thompson being away at Berkeley was a good time to turn his shell into a programming language (Bourne shell) [2].
Similarly, I read that Bill Joy added chroot() to Unix simply because he needed it for something he was doing one time (building another Unix system, I would imagine).
---
3. Bill Joy also added job control to the kernel and the terminal for his csh shell. It's a very tightly coupled and ugly design.
So the key point is that we shouldn't assume any notion of "shell developers" or "APIs". The kernel developer and the shell developer were really the same person -- Thompson and Joy.
Unix is more of a holistic system than a modular one; it's porous by design!
---
Since job control, it appears that have been almost no system calls added for a shell. Although I have a feeling a few fcntl() operations were added for a shell, i.e. the difference between dup2() and fcntl(F_DUPFD). But I don't remember the argument offhand.
So I think the history is basically what usually happens -- once the same person stops working on 2 sides of an interface, the interface calcifies. It's a little like the relationship between ISAs and C. So we will probably have fork() forever, but that doesn't mean there can't be evolution in a better direction.
Re (2), yes, they may have been the same person, but developing user-land code and kernel-land code are still different activities. The latter takes more effort and time -- certainly it must have in the 70s.
I thought the point of vfork is that they do not share an address space. But there are other things still shared and they should really just have a CreateProcess.
That's an implementation detail at this point. The idea is to have a single syscall that takes all the information needed to spawn the process, and does so atomically, without the need to spread it across several calls. On Win32, that's CreateProcess(). On POSIX, the equivalent is posix_spawn().
They still share an address space until exec replaces it for one of them. Particularly awful is that they share the same mutable stack which is a pathway that only leads to the inner circle of hell.
Assuming you call exec, of course. To not call exec after vfork is not an option; one of the many ways the fork family of functions are fundamentally broken.
Well, without undefined behavior you can also call _exit(), continue within the same function, and receive conforming signals. Unfortunately this isn't always spelled out and there's code out there that definitely does other work invoking undefined behavior.
vfork() does the opposite of solving these problems. While there are a few functions that you can call after fork(), there is absolutely no function you can call after vfork() before exec(). You can't even write most local variables.
vfork() solves the problem of not wasting so much time on fork() when you're just going to call exec() afterwards (fork() does A LOT of work - potentially, anyway).
> vfork() does the opposite of solving these problems. While there are a few functions that you can call after fork(), there is absolutely no function you can call after vfork() before exec(). You can't even write most local variables.
Mostly wrong.
You can call functions on the child side of vfork(), but you don't want to exec() in them -- you want to exec() in the same function that called vfork().
And you can write to local variables, but you have to be careful about it.
There's a ton of vfork()-using code that does these things.
Now, it's true that a compiler optimizer that knows nothing about vfork() but knows about _exit()'s semantics, could delete code it thinks is unreachable. So there is some issue, but you can just disable the optimizer if you run into this.
That's all undefined behavior, under POSIX at least [0]:
> The vfork() function has the same effect as
fork(2), except that the behavior is undefined if the process
created by vfork() either modifies any data other than a variable
of type pid_t used to store the return value from vfork(), or
returns from the function in which vfork() was called, or calls
any other function before successfully calling _exit(2) or one of
the exec(3) family of functions.
So sure - you can do these things, but they have very little defined semantics after vfork().
It is true that Linux describes the semantics more clearly, so perhaps on Linux it is safer to use.
If you can call _exit() and exec, then you can call other functions too. I do believe that the Open Group has changed the description of vfork() to discourage its use because of an old and incorrect paper from the 80s. Actual implementations of vfork() are not as dangerous as the Open Group text purports them to be.
Moreover, most posix_spawn() implementations use vfork(), and they call more functions than _exit() and exec on the child side.
An implementation of posix_spawn() is usually owned by the same people who implemented vfork(), so they know what is and isn't safe to call in that particular implementation. But we don't, and we shouldn't assume that the implementation will not change. That's exactly why public APIs and stability guarantees exist.
The problems with fork in the face of threads are caused by threads, not by fork. Fork was there first, and it is part of a system that is designed and integrated well.
Threads were bolted onto Unix in a hamfisted way, breaking more than just fork. For instance, threads broke relative paths, requiring "at" functions like openat to be invented, an ugly stop-gap measure. Threads were badly integrated with signal handling too, another example.
Blaming those existing mechanisms is purely an emotional argument, from the perspective of being infatuated with threads.
The design of threads (coming from various efforts that became POSIX threads) came from such an infatuation: the desire to get any kinds of threads working at any cost, while ignoring the global state that exists in a Unix process, and the need to make a lot of it thread-local, or at least optionally so.
A thread-local working directory or signal mask would have caused difficulties in hack thread implementations that used user space scheduling or M:N (M user space threads to N kernel tasks).
The situation we have today largely comes from the initial reluctance to accept the fact that each thread has to be an entity known to the kernel; the belief that user space threads are viable into the long-term future.
> Only use fork in toy programs. The challenge is that successful toy programs grow into large ones, and large programs eventually use threads. It might be best just to not bother.
How do you create a new process and pipe it data in a fast fashion without using fork, exec or posix_spawn ?
Since you probably don't know what all the other threads in your process are up to, your only option is to attach a debugger to all of them, halt them all, and copy all their state into brand new threads in the child process.
Do it all correctly and you end up with a multi-threaded-fork.
You still need to fix up signal handlers, interrupted syscalls, various notification API's that no longer work, memory mapped temp files used for IPC, pipes and sockets, and a bunch of other things.
But a fork of a complex process is possible. It just isn't easy.
fork() also presents performance issues for programs with a large virtual space. Here vfork() helps, but it has even more pitfalls than fork(). I had written a small doc about converting the recollindex Recoll indexer from fork() to vfork() a while ago: https://www.lesbonscomptes.com/recoll/pages/idxthreads/forki...
New programming language implementations should maybe make fork() and multithreading be mutually exclusive at link time by default, and only allow them together in an unsafe-I-know-what-I’m-doing mode (if at all).
It's only dangerous if you use libraries without fully understanding what they're doing. And most well designed libraries will avoid creating threads, and will do so only when you make it explicit that you want it to happen.
I also find that libraries that absolutely need to make their own threads are better off being their own process. Then you can use proper communication methods to pass data.
Funny side note, Perl fakes fork() on Windows using (I believe) threads. I am not sure if that is better or worse than Windows having fork() natively, though.
Some people would probably argue that if you use Perl, you have much bigger problems to worry about, but that's another debate.
> When I ran into this problem, I was just trying to run all of Bluecore's unit tests on my Mac laptop. We use nose's multiprocess mode, which uses Python's multiprocessing module to utilize multiple CPUs. Unfortunately, the tests hung, even though they passed on our Linux test server.
There will never be a time at which you can reliably expect any program developed on one system to "just work" on a different system. This person wasted a lot of time tracking down what was essentially a portability bug. Did they need this to be portable? Was this time well spent generating business value?
Pick one system for development through production, stick to it. There will be portability bugs hiding in your code, but you will never have to fix them. You will be upset for a minute that you can't use a different system, but you will get over it.
"The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives."
from A Fork in the Road, <https://www.microsoft.com/en-us/research/uploads/prod/2019/0...>
fork() and exec() is a reflection of the unix philosophy: small tools for separate tasks, and as a result you have great composability. You can set up right etc for your new process by calling any number of system calls between fork() and exec().
Windows is a great example of the alternative (following "Windows philosophy"): There are about three different API calls for creating a new process, each with a heap of complicated optional arguments. The API becomes more complex, less composable, less extensible and less powerful. But also easier to reason about and easier for the kernel to provide, and arguably with fewer footguns.
fork() is not a small tool.
It is a MASSIVE, unwieldy tool that is difficult to use correctly. It happens to have a small interface.
...is it?
In it is original implementation, fork() was pretty trivial. All it did was create a new process entry in the kernel table, with all the pages and capabilities and such copied from the original process. Then mark all pages as copy-on-write, and return to the caller. Maybe not trivial, but much less complicated than loading an executable file from disk.
My understanding of Linux internals is maybe 20 years out of date, so I am legitimately curious what makes fork() so complicated these days.
fork() is not trivial now. Processes are huge now -- they have huge heaps among other things. Copying all that is expensive. In the 80s we tried COW, but that turns out to be very slow as well. What operating systems do now is immediately copy the resident set, then do COW for the rest of writable memory, but in large, multi-threaded processes, this is still too slow.
Use vfork() or posix_spawn().
Hrm. Googling "fork linux copy-on-write" seems to find a lot of stack overflow answers from 2014-2015 claiming Linux marks pages as copy-on-write when fork() is called. I didn't see anything more recent in the first page of results.
I could see it being worthwhile to immediately copy a few pages, like the top of the stack, but copying the whole resident set seems excessive. Especially since some of that data might not even be written to.
So the problem is what happens to the old and new processes after the fork. To CoW, you need to mark all the pages read only in _both_ old and new which means that every memory write in the caller will now pagefault since the OS now has to lazily copy on both sides. So with true copy on write the fixed costs may be low but the marginal cost per memory write may be high in both parent and child. In this case you can see why the resident set is copied, yes? It’s the smallest amount of memory that guarantees predictable performance subsequent to the call returning.
If the parent is threaded and the host has more than one CPU, then fork() == TLB shootdowns, which are slow.
As well there's the cost of all those page faults that the two processes are likely to take to do the copying.
And lastly there's all sorts of complexity involving multiple parent threads calling fork(), or the child calling fork() again (or vfork()) before calling exec.
It's just much easier to copy the resident set and mark the address space as being CoW, because now you only have to worry about page faults for pages that are not in core anyways and so were going to fault anyways, and that means you don't have to worry about TLB shootdowns either (if a page is not in core, it's not referenced by any TLB either). You still have the multi-fork issues, but now you can use an atomic reference count on the address space.
The classic pattern was that you'd spin up something like Apache, load it full of read-only data, then start forking its children. Hoping that you'd share memory between children.
With what you describe, you'd share nothing because at the point you've loaded it up, all the data you just loaded is resident. :-(
If loading == mmap(2)ing read-only, then it's not a problem. Read-only pages don't get CoW'ed.
Back when I did this, loading usually meant querying a bunch of constant data in mod_perl.
So write the query results to files and then mmap() them!
You'd be amazed at how difficult it is to make useful use of mmap() from within a variety of common scripting languages.
Hmmm, from <https://www.man7.org/linux/man-pages/man3/posix_spawn.3.html>:
Also, there's no way to set resource limits in the child process, nor switch user or group ID, using posix_spawn().
For that you may need posix_spawn and exec, but still can evade fork completely.
Using fork() also means you end up with shared ownership of resources like file descriptors, which can have some pretty weird consequences.
This is true with all process creation APIs.
Windows defaults to CLOEXEC semantics and you have to opt-in to child process inheriting open file handles, and that has caused problems.
Unix defaults to not-CLOEXEC sematincs, and that too has caused problems.
The Windows default can cause problems because of simple logic bugs.
The Unix default can cause unsolvable problems because of races between threads.
You should use CLOEXEC everywhere. Except you can't because you are using libraries.
True. The main problem with the Unix default is that there wasn't a way to set O_CLOEXEC on all new FDs race-free until recently. That's a real problem. FD leaks to children can be bad, but most of the time they are not the end of the world, and often one can steal a closefrom() implementation from a BSD or Illumos as a workaround when you know exactly what you want to allow the child to inherit.
closefrom() comes in handy for this. It's missing on some platforms (notably glibc and mac iirc) but actually not too hard to implement a work-alike.
I've copied a closefrom() many a time.
A common hack I've had to add is an argument FD that is not to be closed because. e.g., an flock is held on it.
Or more importantly, IPC mechanisms like mutexes. If they're in shared memory, you now have two problems. The runtime of a very very popular scripting languages does this.
It was simple when processes were simple, but requirements got serious.
> Then mark all pages as copy-on-write, and return to the caller.
Unix actually copied the memory over initially.
Windows approach is not the only alternative. Simply provide API to create a process in a suspended state, then adjust its properties based on pid/handle and then start the process execution.
A huge number of tricky thread problems go away if the child thread is blocked at startup, and allowed to run only after the parent allows it. To retrofit this, it is easiest to lock a mutex before spawning and have the child block on that. Then the parent unlocks it to let the child run.
I think shared libraries can spawn threads on their load/init phase that you don't know about. Then you are hosed but you only know about it due to sporadic weird problems, that if you restart on failure, e.g. a pre-forked worker scenario, you might never even really care about.
I always felt that the way to create new processes and new threads should initially start the same. A new process is simply a thread that after creation does a syscall to isolate itself from the other thread and receive a unique copy of all resources.
This could also solve the issue with forking from multithreaded programs since we can ensure we own all shared resources when we isolate our thread, to effectively thus become a new process.
So instead of fork we have clone/isolate.
A new thread can also of course immediately suspend itself, allow other threads to work on it's data in some way, who then give it a signal to resume itself and then execute if need be.
Has Linux gained syscalls equivalent to those Windows API calls yet? Or is linux too different from the windows kernel to make that happen? (In that case, what does WINE do?)
On linux, fork libc function(s) is(are) a wrapper to clone system call which is more flexible (they added some stuff to make Wine work better):
https://lwn.net/Articles/826313/
Clone is also used to start threads, IIRC:
https://stackoverflow.com/questions/4856255/the-difference-b...
Fork is not a reflection of unix philosophy! It is a beautiful hack tho.
Discussed in:
https://hn.algolia.com/?q=https%3A%2F%2Fwww.microsoft.com%2F...
and also here:
https://news.ycombinator.com/item?id=30502392
To present some contrast:
Windows doesn't have fork(). It has a real, fully mature thread and process model. In Windows NT, every process consists of a handle that is a "Process", which in turn points to a structure containing a list of "Threads". A process is done when its main thread exits or all threads exit, whichever is defined by the main process. Fork/Exec is replaced with CreateProcess (or ShellExecute, your choice).
For a very zen-like example of the fork/exec and pipe management that you'd do on a POSIX system done in Windows, the [MSDN Docs](https://docs.microsoft.com/en-us/windows/win32/procthread/cr...) are quite informative.
When I see what most people try to do with "a real, fully mature thread and process model", I wind up shaking my head about bad engineering processes.
Bad... how?
From my perspective watching the various techniques used by a multitude of operating systems over many decades, the Windows-style process+thread model seems to be winning out over the UNIX fork() model.
For example, PostgreSQL seems to suffer greatly from the "forked process per connection" model, necessitating front-ends that do connection pooling. Database engine after database engine seems to go through this phase and then "upgrade" to either a single-process thread pool model, or start using async IO in some way. (Web servers also.)
For reference, Microsoft SQL Server back in the 1990s could on Windows NT 4 could handle more connections than PostgreSQL in 2020...
> Windows doesn't have fork()
It does (or maybe did, not sure if it still works) have the literal equivalent of fork() - NtCreateProcess() with NULL SectionHandle argument creates a new process which is a clone of the caller. However, it never really worked. The Win32 API did not support forking, and so any process which forked itself and then tried to invoke any Win32 API calls would soon crash or fail mysteriously. In theory forking did work for pure NT API applications, but those are rather limited in their abilities.
The original POSIX subsystem would have used it.
If you mix fork with threads, you're going to have a [undefined behavior] time. It seems like if you link with the sqlite that comes with macOS, you're using threads whether you like it or not. I think ending up at "you shouldn't use fork() at all" is a bit of an extreme conclusion, though.
BTW, article title needs a (2016). It appears that the relevant Python bug has long since been closed, by avoiding linking with the system sqlite on macOS.
> I think ending up at "you shouldn't use fork() at all" is a bit of an extreme conclusion, though.
Is it? There are more descriptive (as opposed to procedural) APIs which behave in a safer and more well-defined manner to do it these days. Unless you're implementing a shell, fork has never been a great tool.
As one commenter noted 3 months back:
> The dense fog lifts, tree branches part, a ray of light beams down on a pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells." You hesitantly lift your eyes to the item presented upon the pedestal, take a pause in respect, then turn away slumped and disappointed but not entirely surprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.
https://news.ycombinator.com/item?id=30504470
IMO, fork is strictly better than threads as a tool for having operations perform off of the main thread; they get all the state they need at the beginning and they can use IPC to return the result.
TFA does allow for off-process operations, but all of the inputs to the operation would need to be passed explicitly. In this sense, I suppose TFA isn't arguing against multiprocessing per-se, but against the specific type that implicitly includes all of the current process state (which has both up- and down-sides).
That is basically how NGINX works if you run it in daemon mode. When you start or reload the server, the main process initializes common state then forks to become a worker process. Although I would recommend avoiding any IPC past that if possible
> In this sense, I suppose TFA isn't arguing against multiprocessing per-se, but against the specific type that implicitly includes all of the current process state (which has both up- and down-sides).
You don't have to suppose anything, TFA specifically says that you should use posix_spawn or immediately exec() after forking.
It doesn't imply or hint, let alone say, that threads are superior, it only mentions them because they interact badly with fork() and that's the issue they'd hit. It's not like threads are the only thing which interacts badly with fork.
As the author of a gist that trashes on fork(), I do nonetheless use it, usually early in daemons' lives:
And maybe POSIX-ish shells should use fork() for subshells, naturally.
But I think that's about it for good uses of fork().
For all process spawning uses of fork() I strongly recommend vfork() or posix_spawn() instead.
Isn't the first use-case a pretty debatable / bad one? By daemonizing internally, you make service management and supervision of the program much more difficult, and if you include a non-daemonizing mode for debugging you now have two different runmodes with a pretty significant semantics difference, only one of which is easily inspectable.
Daemonizing is a thing of the past with modern restarter frameworks, like SMF, systemd, supervisord, etc. But daemonizing was always an option, not a requirement, and as an option, it's safe enough to provide it for those who don't use a restarter.
Your program probably knows how (if it wished) to manage its own resources far better than an external program ever could.
Said no sysadmin ever.
I love this sysadmin comment and the GP dev comment. The key is to get the sysadmin team putting requirements to the code, whether the restarting ends up in process or as a nice small unix separate tool that just does restarts well, is an outcome of a process.
fork + threads is not undefined behavior. It is safe as long as you only do “async-signal safe” functions in the child. The child will be single-threaded.
The Linux man pages have a list of safe operations after fork here: https://man7.org/linux/man-pages/man7/signal-safety.7.html
Note that this includes most of your standard syscalls, like (importantly) write(), read(), close(), chdir(), as well as certain “obviously safe” library functions like strlen(), memcpy(), etc.
Non-multithreaded programs can fork() how they like and do whatever they want after (mostly).
> fork + threads is not undefined behavior. It is safe as long as you only do “async-signal safe” functions in the child. The child will be single-threaded.
Yes, but the async-signal-safe restriction is pretty severe, so you have to know what you're doing. Yes, that's also true of vfork(), but at least vfork() will be much faster.
> Non-multithreaded programs can fork() how they like and do whatever they want after (mostly).
Only as long as they haven't used libraries that are not fork-safe prior to calling fork(). And you still need to do things like fflush() stdio handles prior to fork()ing.
Yes, I’ve read the man page, but thank you for repeating the info here.
It's not just SQLite in macOS.
http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_...
In general, I don't see how one could safely rely on a third-party library spawning or not spawning threads unless they explicitly make guarantees regarding not using them as part of their public contract.
The dense fog lifts, tree branches part, a ray of light shines down on the ruins of a moss-covered pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are optimized to simplify the implementation of command line shells." You look upon the pedestal, pause in respect, then turn away disappointed but unsurprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.
I wrote this silly little bit on a previous fork() thread and touched it up a little. https://news.ycombinator.com/item?id=30504470
It was a glorious comment then, and it is now.
BTW, it's unclear whether the turd is... the specific truth revealed, or the revelation itself (since it could be incorrect). It's still a glorious comment.
Thank you! The goal was to evoke the same emotional response I felt when this idea struck me (I think when I was reading earlier comments in that thread in fact), so I guess there's as much in the sequence as the specific objects. The semantic design of fork was always incomprehensible to me (not what but why), so the setting and pedestal and plaque represent how strongly it struck me and the depth of its historical explanatory power. The turd is how I feel about it. I don't resent the designer's history or motivations or incentives and I'm happy to know the truth. I'm just sad/depressed that this is the reason why things are the way they are, that the original design intent is so misaligned with the needs of modern systems, how it limits our capabilities today even down to hardware, and how unlikely it is for this to change. I don't suppose a turd would last very long on a pedestal, or that the ancients would put a turd there on purpose. Maybe they put something beautiful there back then, but times have changed and now we're stuck with it and it's kind of shit.
Don't get too sad about legacy. A lot of things were brilliant once that aren't now. I do still feel that fork+exec was brilliant then, just not now. Deprecation is hard, but we can celebrate what legacy made possible.
Now, if only we had a time machine...
Agreed. I'm not sure where to go from here, but I think I smell fresh air through the WASM/WASI doorway.
"If in doubt, Meriadoc, always follow your nose." - Gandalf
WASM, making lemonade out of lemons (JS).
They called it RISC iX because it was RISC.
They called it Minix because it was mini.
And they called it POSIX because...
I've literally never heard this one before. Good one!
fish shell uses posix_spawn sometimes because of its performance benefits. We can't use it in the following cases:
1. No analog to tcsetpgrp, so it's no good if job control is enabled
2. No analog to fchdir, meaning you have to synchronize with fchdir elsewhere in the progarm
3. Error codes do not convey enough information for good error messages (e.g. if a file doesn't exist, posix_spawn doesn't tell you which file)
4. Inconsistent behavior around dup2 fd redirections and CLO_EXEC.
5. Inconsistent behavior for shebangless scripts
These are basically deal-breakers so fish also supports a fork/exec path. However the performance benefits of posix_spawn are too real to ignore so fish uses posix_spawn when it can, and fork/exec when it must.
Shells should really use `vfork()` to exec, and `fork()` for subshells (maybe).
Except that the set of things you are allowed to do after vfork() and before exec() is extremely small. I agree that things should use vfork or posix_spawn when possible, but it wouldn't surprise me at all if there's a similar list of cases when a vfork() path wouldn't be possible.
That set was made artificially small by the Open Group, but is not actually that small.
OK very interesting ... is the performance benefit on certain platforms, or everywhere? A previous thread says Ninja is faster on OS X and Solaris because of it.
Though that does seem like a large number of corner cases, probably learned through painful experience :-/
It has performance benefits on both Linux (where it can use vfork) and macOS/BSDs (where it has a kernel implementation).
I tweeted a little about it here, with some perf numbers: https://twitter.com/ridiculous_fish/status/12328893907639336...
ah a good list as a companion to my simple posix_spawn pipe example in another sub-thread here! thank you!
Not sure if it worth the platform specific code, but for #2 macOS 10.15+ has `posix_spawn_file_actions_addfchdir_np()`.
I think most of these are deficiencies in the available posix_spawn actions, not anything inherent. Of course getting all the relevant OSes to add new functionality is a huge pain. The error handling seems bad though.
Another danger using fork is it duplicates the internal state of pseudo random number generators. It's a great way to accidentally take the same random samples in every process, utterly trashing any statistics you were intending to do. Bonus: the python multiprocessing module silently uses fork by default. Person A writes a "make multiprocessing convenient" library, Person B writes a sampling library, you put them together and... whoops!.
Libraries like that should use pthread_atfork() to automatically reset/reseed/whatever state as needed at fork() time.
I don't think that's really a viable strategy in practice in an ecosystem as complex as python's. There's too many libraries and too many little corner cases and interactions around what the behavior should be.
For example, suppose I am using library A and I initialized the random number generator with a fixed seed. Clearly when I fork it's not appropriate for A to reseed, because I wanted fixed behavior. Something is very wrong so probably there should be an exception. But now suppose I was using library B which was using A and B handles getting system entropy to seed A. Now it is clear that when I fork I probably want B to reseed A, but alas A has already raised an exception because it was given a (from its perspective) fixed seed. So now A needs to be redesigned to be given a seed and like some sort of intent on what should happen when forking, and oh my god wow this is creating a lot of work for everyone everywhere this is not actually going to be done consistently and cannot be trusted.
If you're writing a simulation or a test, then you'll want the PRNG to stay unchanged, and you'll want to be in control of any reseeding.
For all other RNG uses, you really do want it to reseed.
A cryptographic PRNG vs. a simulation PRNG are very different things, and should be different libraries.
pthread_atfork functions aren't called if the application calls the clone syscall directly. The right solution is MADV_WIPEONFORK on Linux, or MINHERIT_ZERO on OpenBSD:
https://www.metzdowd.com/pipermail/cryptography/2017-Novembe...
That helps with memory mappings, but it doesn't help with file descriptors -- you still have to be careful with those.
Reading up in the python documentation, it seems to seed once from `/dev/urandom`, and then uses it's own generator to generate further random bits.
What's the purpose for this strategy opposed to deriving every single random value from `/dev/urandom`, simple performance?
Reading from /dev/urandom requires a syscall, which can be extremely slow compared to running your own prng in-process.
See also this recent 340 (!) comment thread about the issues of fork https://news.ycombinator.com/item?id=30502392
Top comment there is gold (and makes a good point).
FWIW this is the same reason you can't implement implement a portable Unix shell in portable Go. (And similar issues with an init daemon)
Go only exports os.ForkExec() -- there is no os.Fork() or os.Exec(), because the things you can do between the calls could break Go's threaded runtime. (Goroutines are implemented with OS threads.)
Some elaboration on that: https://lobste.rs/s/hj3np3/mvdan_sh_posix_shell_go#c_qszuer
That is, the space between fork and exec is where pipelines are implemented, but also entire subinterpreters/subshells. The shell actually uses copy-on-write usefully. (And yes I'm aware that there's a good argument that the shell is almost the ONLY program that needs fork() !)
----
A lot of people have asked me why not implement Oil in Go and various other languages, so I wrote this page:
https://github.com/oilshell/oil/wiki/FAQ:-Why-Not-Write-Oil-...
So the funny thing is that Python is a lower level language than Go for this particular problem. It doesn't do anything weird with regard to syscalls. I'm still looking for help on this (and donations to pay people other than me):
Oil Is Being Implemented "Middle Out" https://www.oilshell.org/blog/2022/03/middle-out.html
> The shell actually uses copy-on-write usefully. (And yes I'm aware that there's a good argument that the shell is almost the ONLY program that needs fork() !)
It's been a while since I looked at it, but I believe Android uses fork for it's copy-on-write sementics to optimize app startup. On boot it initializes a single instance of the app runtime environment. Then when you launch apps that initial process is forked. As a result you do not need to reinitialize the runtime for every app launch.
Yes I think the argument is that Android (and Chrome) could use something like vfork or posix_spawn().
I'm not sure which, if any; I'd like to see an analysis of that... The issue is what kernel state is preserved/shared across the process creation call.
Every process sort of has a "mirror" in kernel memory. The user memory is CoW, and I suppose you also have to choose whether to copy or reference every kernel data structure as well --- open files in FD tables which point to disk/pipes/sockets, locks which seem to be nonsensical, etc.
But probably you can get the "warmup" property without the full semantics of fork(). That is the CoW of user memory is a somewhat separate choice from the kernel data structures.
----
As far as the shell .... In the recent linked thread, Ninja uses posix_spawn because it has a simple use of subprocesses: https://news.ycombinator.com/item?id=30503382
So there are definitely cases where a shell uses fork/exec like Ninja, so you could imagine optimizing it. But the subshell/subinterpreter case is probably the most general -- the language semantics depend on it. And it's actually useful, e.g. this "alternative shell challenge":
https://www.oilshell.org/blog/2020/02/good-parts-sketch.html...
and I use that pattern literally yesterday and this morning, etc.
----
edit: looks like fish already addresses this, i.e. where you can use posix_spawn in a shell, and where you can't! https://news.ycombinator.com/item?id=31743230
This is moderately common for environments where you are pushing a lot of startup work into the dynamic linker and will be launching processes frequently. Loading shared libraries for example.
You have a parent process which uses dlopen() to load all the libraries you want to avoid re-linking. When you want to spawn a child, rather than exec() you dlopen() an object with your child's main() and call it. For the case where you have enough libraries this is much faster than an exec(), saving tens of seconds on every application launch if you have a really bad case of C++.
There some small surprises which become obvious with a little thought. You are responsible for everything that normally happens in your process before main() is called. ASLR is only done once per session. People rarely think to fix-up argv[] for ps and friends in the first version.
I am always thoroughly amazed when people develop programming languages that are intentionally difficult to use.
Please don’t post superficial and petty comments. Instead think more deeply about tradeoffs in designing for concurrency
I think this turns out to be a tangent, but at least superficially it is possible for a C program to "do" shell pipelines without use of fork or vfork (directly) but rather by posix_spawn. I suppose "portable go" does not directly wrap posix_spawn so this option may not be on the table for you.
Basics: https://gist.github.com/ec8469273c7808d46c7285cd056d0104
Typical use: `./a.out seq 3 2 9 -- cat -n` is similar to `seq 3 2 9 | cat -n` except that the return value is nonzero if either side's return value is nonzero.
that said, I wouldn't be surprised if there's something important I'm overlooking here.
Hm very interesting. I haven't used or seen posix_spawn() used before, but I saw in a recent thread that Ninja uses it, so that's a positive sign.
I saved this here! https://github.com/oilshell/oil/issues/1161
The devil, as they say, is in the details.
But to be fair, the only times I can recall using fork() without exec() were forking network servers, and that was mostly me learning about doing network stuff, and a forking server was the easiest to implement manually.
Oh yeah, and that one time I accidentally wrote a fork bomb trying to stress test a DNS server. At least I learned something from my mistake. ;-)
EDIT: To me, using fork() without exec() is kind of like operator overloading - there are cases where it absolutely is the right tool, but these aren't very numerous, so one should exercise caution. A lot.
Let’s also not forget to call our how APIs have to be “fork” aware. I’m surprised fork is still widely in used given all of these downsides
I imagine it comes up fairly often for languages that don't do well with threads. No shortage of those.
Curious why there isn’t an interface in which all required handles and resources could be passed to a child process explicitly. E.g.:
Would remove so many headaches with concurrency and accidental inheritance.
I suppose Linux's clone3 call goes the closest to that.
The suggestions here aren't really great. What you should do is already written in the fork(2) manpage https://man7.org/linux/man-pages/man2/fork.2.html
"After a fork() in a multithreaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2)."
So just use only async-singal-safe function https://man7.org/linux/man-pages/man7/signal-safety.7.html
I don't know why so many people still hit this issue when it already told you what you can do and not do in the document. I've done this sort of things without any issue.
Because in practice it’s almost impossible to know if you are single threaded still. Plenty of system libraries have background threads.
Not to disbelieve you but which ones?
There are many languages where it's difficult to impossible to be async-signal-safe because you can't avoid memory allocations.
Discussed at the time:
Fork() without exec() is dangerous in large programs - https://news.ycombinator.com/item?id=12302539 - Aug 2016 (101 comments)
I asked this a year or so ago. Interesting to read this article in light of that discussion.
https://news.ycombinator.com/item?id=863871 (13 years? Yikes!)
fork() and the exec system calls exist because they were easy to implement in the 70s on PDPs, and fork() was cheap enough then, but much more importantly, it got the shell developers out of having to write and evolve a more complex API in the kernel. With fork/exec a shell developer could try lots of variations for executing a pipe command w/o having to develop any more kernel code.
For example, until BSD came along, not much had to change in kernel land for any shell. Job control meant that the shell would need to put all the processes for a job in the same pgrp, and also there was a need to add `setsid()`.
Hm a lot of that doesn't match my understanding, which is more like:
1. Unix has fork() because it was influenced by Multics. I don't have the citation now, but I think some parts of Unix were from ITSS, perhaps the hierarchical file system, and some were from Multics. It was a drastic simplification of those systems, but with the same ideas.
---
2. The shell developer and the kernel developer were really the same person -- Ken Thompson. I link to his original paper in this post [1]. The original Thompson shell had pipelines and redirects, which are most of what happens between fork() and exec() in a shell.
Also of note is a video by Stephen Bourne who says that Ken Thompson being away at Berkeley was a good time to turn his shell into a programming language (Bourne shell) [2].
Similarly, I read that Bill Joy added chroot() to Unix simply because he needed it for something he was doing one time (building another Unix system, I would imagine).
---
3. Bill Joy also added job control to the kernel and the terminal for his csh shell. It's a very tightly coupled and ugly design.
So the key point is that we shouldn't assume any notion of "shell developers" or "APIs". The kernel developer and the shell developer were really the same person -- Thompson and Joy.
Unix is more of a holistic system than a modular one; it's porous by design!
---
Since job control, it appears that have been almost no system calls added for a shell. Although I have a feeling a few fcntl() operations were added for a shell, i.e. the difference between dup2() and fcntl(F_DUPFD). But I don't remember the argument offhand.
So I think the history is basically what usually happens -- once the same person stops working on 2 sides of an interface, the interface calcifies. It's a little like the relationship between ISAs and C. So we will probably have fork() forever, but that doesn't mean there can't be evolution in a better direction.
[1] Unix Shell: History and Trivia
https://www.oilshell.org/blog/2021/08/history-trivia.html
[2] https://www.oilshell.org/blog/2022/03/middle-out.html#more-w...
Re (2), yes, they may have been the same person, but developing user-land code and kernel-land code are still different activities. The latter takes more effort and time -- certainly it must have in the 70s.
Odd to not see any mention of vfork. vfork solves the problems with fork/exec for large programs.
Isn't vfork much worse in terms of the problem the author is talking about, since the child can now acquire locks in the _parent's_ address space?
I thought the point of vfork is that they do not share an address space. But there are other things still shared and they should really just have a CreateProcess.
no, fork creates a new address space, vfork doesn't
the posix_spawn mentioned in the article is effectively the equivalent of CreateProcess
Last time I looked, posix_spawn() just called fork/exec
That's an implementation detail at this point. The idea is to have a single syscall that takes all the information needed to spawn the process, and does so atomically, without the need to spread it across several calls. On Win32, that's CreateProcess(). On POSIX, the equivalent is posix_spawn().
In glibc it uses vfork() in some cases.
In Solaris/Illumos it uses vfork() or vforkx().
In principle posix_spawn() can be a system call.
On macOS it is a system call.
They still share an address space until exec replaces it for one of them. Particularly awful is that they share the same mutable stack which is a pathway that only leads to the inner circle of hell.
Assuming you call exec, of course. To not call exec after vfork is not an option; one of the many ways the fork family of functions are fundamentally broken.
Well, without undefined behavior you can also call _exit(), continue within the same function, and receive conforming signals. Unfortunately this isn't always spelled out and there's code out there that definitely does other work invoking undefined behavior.
vfork() does the opposite of solving these problems. While there are a few functions that you can call after fork(), there is absolutely no function you can call after vfork() before exec(). You can't even write most local variables.
vfork() solves the problem of not wasting so much time on fork() when you're just going to call exec() afterwards (fork() does A LOT of work - potentially, anyway).
> vfork() does the opposite of solving these problems. While there are a few functions that you can call after fork(), there is absolutely no function you can call after vfork() before exec(). You can't even write most local variables.
Mostly wrong.
You can call functions on the child side of vfork(), but you don't want to exec() in them -- you want to exec() in the same function that called vfork().
And you can write to local variables, but you have to be careful about it.
There's a ton of vfork()-using code that does these things.
Now, it's true that a compiler optimizer that knows nothing about vfork() but knows about _exit()'s semantics, could delete code it thinks is unreachable. So there is some issue, but you can just disable the optimizer if you run into this.
That's all undefined behavior, under POSIX at least [0]:
> The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.
So sure - you can do these things, but they have very little defined semantics after vfork().
It is true that Linux describes the semantics more clearly, so perhaps on Linux it is safer to use.
[0] https://man7.org/linux/man-pages/man2/vfork.2.html
If you can call _exit() and exec, then you can call other functions too. I do believe that the Open Group has changed the description of vfork() to discourage its use because of an old and incorrect paper from the 80s. Actual implementations of vfork() are not as dangerous as the Open Group text purports them to be.
Moreover, most posix_spawn() implementations use vfork(), and they call more functions than _exit() and exec on the child side.
Let's be reasonable about these things.
An implementation of posix_spawn() is usually owned by the same people who implemented vfork(), so they know what is and isn't safe to call in that particular implementation. But we don't, and we shouldn't assume that the implementation will not change. That's exactly why public APIs and stability guarantees exist.
Not on Linux!
Even if that's true, doing so after vfork() is still much harder than after fork(), not easier.
The problems with fork in the face of threads are caused by threads, not by fork. Fork was there first, and it is part of a system that is designed and integrated well.
Threads were bolted onto Unix in a hamfisted way, breaking more than just fork. For instance, threads broke relative paths, requiring "at" functions like openat to be invented, an ugly stop-gap measure. Threads were badly integrated with signal handling too, another example.
Blaming those existing mechanisms is purely an emotional argument, from the perspective of being infatuated with threads.
The design of threads (coming from various efforts that became POSIX threads) came from such an infatuation: the desire to get any kinds of threads working at any cost, while ignoring the global state that exists in a Unix process, and the need to make a lot of it thread-local, or at least optionally so.
A thread-local working directory or signal mask would have caused difficulties in hack thread implementations that used user space scheduling or M:N (M user space threads to N kernel tasks).
The situation we have today largely comes from the initial reluctance to accept the fact that each thread has to be an entity known to the kernel; the belief that user space threads are viable into the long-term future.
> Only use fork in toy programs. The challenge is that successful toy programs grow into large ones, and large programs eventually use threads. It might be best just to not bother.
How do you create a new process and pipe it data in a fast fashion without using fork, exec or posix_spawn ?
Use a shared memory region. For example: https://github.com/erikhvatum/py_interprocess_shared_memory_...
> without using fork, exec or posix_spawn ?
Did you manage to forget what you'd read two paragraphs earlier when you reached this bit? Because the essay's first recommendation is literally:
> Only use fork to immediately call exec (or just use posix_spawn).
It seems difficult to infer "don't use exec or posix_spawn" from this.
Ofc I read that. I think a "only" was missing in the sentence, hence the confusion
posix_spawn or vfork+exec
One other option is to fork all the threads too.
Since you probably don't know what all the other threads in your process are up to, your only option is to attach a debugger to all of them, halt them all, and copy all their state into brand new threads in the child process.
Do it all correctly and you end up with a multi-threaded-fork.
You still need to fix up signal handlers, interrupted syscalls, various notification API's that no longer work, memory mapped temp files used for IPC, pipes and sockets, and a bunch of other things.
But a fork of a complex process is possible. It just isn't easy.
fork() also presents performance issues for programs with a large virtual space. Here vfork() helps, but it has even more pitfalls than fork(). I had written a small doc about converting the recollindex Recoll indexer from fork() to vfork() a while ago: https://www.lesbonscomptes.com/recoll/pages/idxthreads/forki...
New programming language implementations should maybe make fork() and multithreading be mutually exclusive at link time by default, and only allow them together in an unsafe-I-know-what-I’m-doing mode (if at all).
It's only dangerous if you use libraries without fully understanding what they're doing. And most well designed libraries will avoid creating threads, and will do so only when you make it explicit that you want it to happen.
I also find that libraries that absolutely need to make their own threads are better off being their own process. Then you can use proper communication methods to pass data.
Per TFA, a large fraction of OS X system libraries use threads, so if you're developing for the Macintosh, fork() is already out.
And it’s explicitly unsupported on macOS at least.
Won't stop the *nix know-nothings from criticizing Windows for not natively supporting fork().
Funny side note, Perl fakes fork() on Windows using (I believe) threads. I am not sure if that is better or worse than Windows having fork() natively, though.
Some people would probably argue that if you use Perl, you have much bigger problems to worry about, but that's another debate.
> When I ran into this problem, I was just trying to run all of Bluecore's unit tests on my Mac laptop. We use nose's multiprocess mode, which uses Python's multiprocessing module to utilize multiple CPUs. Unfortunately, the tests hung, even though they passed on our Linux test server.
There will never be a time at which you can reliably expect any program developed on one system to "just work" on a different system. This person wasted a lot of time tracking down what was essentially a portability bug. Did they need this to be portable? Was this time well spent generating business value?
Pick one system for development through production, stick to it. There will be portability bugs hiding in your code, but you will never have to fix them. You will be upset for a minute that you can't use a different system, but you will get over it.
I can write code that works NOW, but there is no guarantee that code will work in the future.