| Svelte Hacker News

JoeAltmaier 10 years ago

Confused. Even GC requires something like malloc - a heap manager for allocating new storage. GC adds complex algorithms on top of that. It has to be slower than malloc.

openasocket 10 years ago

For new objects, a lot of GC systems use a memory slab. Essentially, a buffer where you just allocate new data at the end, and then perform garbage collection once the slab is filled up. With this allocation is basically just a pointer bump. When your program requires lots of small allocations that all go out of use at around the same time, GC will significantly out-perform a naive malloc/free for every piece of data.
- scott_s 10 years ago
  
  I am still just as skeptical as JoeAltmaier was. The fast-path for modern memory allocators are also just pointer-bumps in slabs. And if you keep mallocing and freeing something of the same size on the same thread, you are very likely to keep getting the same memory location on each allocation, which has excellent cache behavior.
  For an excellent paper on state-of-the-art memory allocation techniques (which also has a good overview of prior techniques, and how their works builds on that), see "Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation through Large Virtual Memory and Global Data Structures" from OOSPLA 2015: http://2015.splashcon.org/event/oopsla2015-fast-multicore-sc...
  If you're aware of published works which show garbage collection consistently outperforming manual memory management for the reasons you explained, I'm very interested in reading them.
  
  JoeAltmaier 10 years ago
  
  Agreed, except for the 'excellent cache behavior'. If you're getting cache hits on reallocated memory, you're looking at uninitialized data!
  
  cfallin 10 years ago
  
  Stores access the cache too. When you initialize your newly malloc()'d memory, ignoring the previous data, you still benefit from having the cache block present in-cache.
  
  scott_s 10 years ago
  
  To build on what you said: we're interested in the cache line, not the data in that cache line. Both reads and stores access the cache line, and both can see misses. It's not that you're seeing cached data, it's that your store instruction does not have to wait for the cache line to come into the L1.
  
  openasocket 10 years ago
  
  Some of the state-of-the-art allocators close the performance gap pretty significantly, I was comparing to the classic Doug Lea style allocator. I imagine with something like your link the difference will be much less noticeable for allocation speed. However, not having an explicit free can lead to performance benefits, though it seems counter-intuitive. Imagine you are allocating a large number of small, short-lived objects. In the manual case, you have to explicitly call free() for each object as it goes out of scope. The free() call is rather fast for slab allocators, and free-ing everything runs in time O(dead_objects). A well-designed generational GC uses a copy collector for the eden generation, which essentially allocates a new slab and copies over all the live objects, and then frees the old slab. Copying objects is relatively fast for small objects, and this algorithm runs in time O(live_objects), which is much much smaller than the number of dead objects. Plus The copying step moves all the live objects together on the slab which reduces memory fragmentation.
  
  scott_s 10 years ago
  
  You're discounting the cost of the GC approach necessarily having more live objects at once. That will require hitting more slow-path parts of the GC's allocation algorithms (either finding free slabs, finding freed new slabs from a global cache, or allocating new ones from the OS), and it will have poor cache behavior. The manual memory allocator will be on the fast path for malloc and free, as long the number of live objects is less than that of a slab.
  There may be a crossover point where the GC algorithms are better (say, a pathological case where you keep forcing the manual memory allocator to hit its slow paths by allocating just one more than the size of a slab, freeing all of them, then repeat). But I would also not be surprised if there never is a crossover point.
  
  openasocket 10 years ago
  
  > You're discounting the cost of the GC approach necessarily having more live objects at once.
  No, by live objects I don't mean the objects that currently exist in the heap, I mean the objects currently reachable by the program, which is the same whether or not you are using GC. But you are right that GC will use much more heap on average than manual allocation.
  I agree that you can probably always beat GC using manual allocation if you are sufficiently smart about it. For instance, if you have some data structure you know is going to allocate a bunch of small objects, and none of the small objects can outlive the data structure, you can have that data structure create a special slab just for its objects, so when you free the data structure you just need to call one free() on the slab instead of a bunch of free() calls on every object.
  
  scott_s 10 years ago
  
  I don't you need to be particular smart using manual memory allocation to beat GC. You just need to free your memory once you're done with it. Consider:
  for (size_t i = 0; i < 100000000; ++i) { size_t* val = malloc(sizeof(size_t)); *val = i; free(val); }
  Decent memory allocators will just keep reusing the same memory location over-and-over. All 100 million calls to malloc and free will be fast-path calls. GC-based schemes will most likely not be 100 million fast-path allocations. Yes, this is a contrived example, but freeing memory as you no longer need it is common in languages with manual memory allocation. It is also harder, more error prone, and quite often leads to nasty bugs, but it's usually faster.
  I think you're reasoning about just the cost of the frees versus garbage collection cost. But by ignoring the extra memory required on the heap, you're also ignoring the extra slow-path allocations.
  
  openasocket 10 years ago
  
  I believe in this example GC will actually perform better! With GC, this code will fill up the eden slab with a bunch on these integers. When there is no more space in the eden slab, the GC makes a new slab (one malloc() call), copies over any live data by checking what's accesible from the stack (there are no pointer saved on the stack except for the current val) and the old slab is freed (one free() call). So GC calls free 100M/(number of ints we can fit in the eden slab) times, as opposed to the manual case which calls free 100M times. And, just like the manual case, the GC's malloc and free will be fast-path calls (our eden slab will essentially just switch back and forth between two different memory locations). And the cost for tracing out the live variables here is essentially nothing (10-100s of nanoseconds) since nothing is referenced in local variables. The only cost you have to pay with GC in this situation is increased heap usage and, consequently, a larger cache footprint.
  
  scott_s 10 years ago
  
  I would be shocked if the GC performed better. I think you are severely over-estimating how expensive the fast-path for memory allocations are. For this example, decent memory allocators will execute on the order of a dozen instructions for each malloc and free call - no need to allocate new slabs in the allocator, no need to free old ones, no interactions with the operating system or even the internal slab allocator.
  Calling free is not what is expensive. It's what we do when we call free. GC has to do more work because it has to figure things out. Manual memory allocation does not need to spend any instructions figuring anything out because the programmer has already done that. As I have said before, decent memory allocators will keep returning the exact same address for each malloc call in this situation. That is the ideal situation for this case.
  Also keep in mind that larger heap sizes don't come for free - that means more page faults, which will slow you down.
pjmlp 10 years ago

If the language semantics allow for a top of the tops GC implementation,it means a generational collecting GC.
Allocation is a mere pointer bump.