This is a description of the algorithms and data structures used in our conservative garbage collector. I expect the level of detail to increase with time. For a survey of GC algorithms, see for example Paul Wilson's excellent paper. For an overview of the collector interface, see here.
This description is targeted primarily at someone trying to understand the source code. It specifically refers to variable and function names. It may also be useful for understanding the algorithms at a higher level.
The description here assumes that the collector is used in default mode. In particular, we assume that it used as a garbage collector, and not just a leak detector. We initially assume that it is used in stop-the-world, non-incremental mode, though the presence of the incremental collector will be apparent in the design. We assume the default finalization model, but the code affected by that is very localized.
The remaining sections describe the memory allocation data structures, and then the last 3 collection phases in more detail. We conclude by outlining some of the additional features implemented in the collector.
Most static data used by the allocator, as well as that needed by the rest of the garbage collector is stored inside the _GC_arrays structure. This allows the garbage collector to easily ignore the collectors own data structures when it searches for root pointers. Other allocator and collector internal data structures are allocated dynamically with GC_scratch_alloc. GC_scratch_alloc does not allow for deallocation, and is therefore used only for permanent data structures.
The allocator allocates objects of different kinds. Different kinds are handled somewhat differently by certain parts of the garbage collector. Certain kinds are scanned for pointers, others are not. Some may have per-object type descriptors that determine pointer locations. Or a specific kind may correspond to one specific object layout. Two built-in kinds are uncollectable. One (STUBBORN) is immutable without special precautions. In spite of that, it is very likely that most applications currently use at most two kinds: NORMAL and PTRFREE objects.
The collector uses a two level allocator. A large block is defined to be one larger than half of HBLKSIZE, which is a power of 2, typically on the order of the page size.
Large block sizes are rounded up to the next multiple of HBLKSIZE and then allocated by GC_allochblk. This uses roughly what Paul Wilson has termed a "next fit" algorithm, i.e. first-fit with a rotating pointer. The implementation does check for a better fitting immediately adjacent block, which gives it somewhat better fragmentation characteristics. I'm now convinced it should use a best fit algorithm. The actual implementation of GC_allochblk is significantly complicated by black-listing issues (see below).
Small blocks are allocated in blocks of size HBLKSIZE. Each block is dedicated to only one object size and kind. The allocator maintains separate free lists for each size and kind of object.
In order to avoid allocating blocks for too many distinct object sizes, the collector normally does not directly allocate objects of every possible request size. Instead request are rounded up to one of a smaller number of allocated sizes, for which free lists are maintained. The exact allocated sizes are computed on demand, but subject to the constraint that they increase roughly in geometric progression. Thus objects requested early in the execution are likely to be allocated with exactly the requested size, subject to alignment constraints. See GC_init_size_map for details.
The actual size rounding operation during small object allocation is implemented as a table lookup in GC_size_map.
Both collector initialization and computation of allocated sizes are handled carefully so that they do not slow down the small object fast allocation path. An attempt to allocate before the collector is initialized, or before the appropriate GC_size_map entry is computed, will take the same path as an allocation attempt with an empty free list. This results in a call to the slow path code (GC_generic_malloc_inner) which performs the appropriate initialization checks.
In non-incremental mode, we make a decision about whether to garbage collect whenever an allocation would otherwise have failed with the current heap size. If the total amount of allocation since the last collection is less than the heap size divided by GC_free_space_divisor, we try to expand the heap. Otherwise, we initiate a garbage collection. This ensures that the amount of garbage collection work per allocated byte remains constant.
The above is in fat an oversimplification of the real heap expansion heuristic, which adjusts slightly for root size and certain kinds of fragmentation. In particular, programs with a large root set size and little live heap memory will expand the heap to amortize the cost of scanning the roots.
Versions 5.x of the collector actually collect more frequently in nonincremental mode. The large block allocator usually refuses to split large heap blocks once the garbage collection threshold is reached. This often has the effect of collecting well before the heap fills up, thus reducing fragmentation and working set size at the expense of GC time. 6.x will chose an intermediate strategy depending on how much large object allocation has taken place in the past. (If the collector is configured to unmap unused pages, versions 6.x will use the 5.x strategy.)
(It has been suggested that this should be adjusted so that we favor expansion if the resulting heap still fits into physical memory. In many cases, that would no doubt help. But it is tricky to do this in a way that remains robust if multiple application are contending for a single pool of physical memory.)
At the beginning of the mark phase, all root segments are pushed on the stack by GC_push_roots. If ALL_INTERIOR_PTRS is not defined, then stack roots require special treatment. In this case, the normal marking code ignores interior pointers, but GC_push_all_stack explicitly checks for interior pointers and pushes descriptors for target objects.
The marker is structured to allow incremental marking. Each call to GC_mark_some performs a small amount of work towards marking the heap. It maintains explicit state in the form of GC_mark_state, which identifies a particular sub-phase. Some other pieces of state, most notably the mark stack, identify how much work remains to be done in each sub-phase. The normal progression of mark states for a stop-the-world collection is:
The marker correctly handles mark stack overflows. Whenever the mark stack overflows, the mark state is reset to MS_INVALID. Since there are already marked objects in the heap, this eventually forces a complete scan of the heap, searching for pointers, during which any unmarked objects referenced by marked objects are again pushed on the mark stack. This process is repeated until the mark phase completes without a stack overflow. Each time the stack overflows, an attempt is made to grow the mark stack. All pieces of the collector that push regions onto the mark stack have to be careful to ensure forward progress, even in case of repeated mark stack overflows. Every mark attempt results in additional marked objects.
Each mark stack entry is processed by examining all candidate pointers in the range described by the entry. If the region has no associated type information, then this typically requires that each 4-byte aligned quantity (8-byte aligned with 64-bit pointers) be considered a candidate pointer.
We determine whether a candidate pointer is actually the address of
a heap block. This is done in the following steps:
At the end of the mark phase, mark bits for left-over free lists are cleared, in case a free list was accidentally marked due to a stray pointer.
This initial sweep pass touches only block headers, not the blocks themselves. Thus it does not require significant paging, even if large sections of the heap are not in physical memory.
Nonempty small object pages are swept when an allocation attempt encounters an empty free list for that object size and kind. Pages for the correct size and kind are repeatedly swept until at least one empty block is found. Sweeping such a page involves scanning the mark bit array in the page header, and building a free list linked through the first words in the objects themselves. This does involve touching the appropriate data page, but in most cases it will be touched only just before it is used for allocation. Hence any paging is essentially unavoidable.
Except in the case of pointer-free objects, we maintain the invariant that any object in a small object free list is cleared (except possibly for the link field). Thus it becomes the burden of the small object sweep routine to clear objects. This has the advantage that we can easily recover from accidentally marking a free list, though that could also be handled by other means. The collector currently spends a fair amount of time clearing objects, and this approach should probably be revisited.
In most configurations, we use specialized sweep routines to handle common small object sizes. Since we allocate one mark bit per word, it becomes easier to examine the relevant mark bits if the object size divides the word length evenly. We also suitably unroll the inner sweep loop in each case. (It is conceivable that profile-based procedure cloning in the compiler could make this unnecessary and counterproductive. I know of no existing compiler to which this applies.)
The sweeping of small object pages could be avoided completely at the expense of examining mark bits directly in the allocator. This would probably be more expensive, since each allocation call would have to reload a large amount of state (e.g. next object address to be swept, position in mark bit table) before it could do its work. The current scheme keeps the allocator simple and allows useful optimizations in the sweeper.
The collector makes an initial pass over the table of finalizable objects, pushing the contents of unmarked objects onto the mark stack. After pushing each object, the marker is invoked to mark all objects reachable from it. The object itself is not explicitly marked. This assures that objects on which a finalizer depends are neither collected nor finalized.
If in the process of marking from an object the object itself becomes marked, we have uncovered a cycle involving the object. This usually results in a warning from the collector. Such objects are not finalized, since it may be unsafe to do so. See the more detailed discussion of finalization semantics.
Any objects remaining unmarked at the end of this process are added to a queue of objects whose finalizers can be run. Depending on collector configuration, finalizers are dequeued and run either implicitly during allocation calls, or explicitly in response to a user request.
The collector provides a mechanism for replacing the procedure that is used to mark through objects. This is used both to provide support for Java-style unordered finalization, and to ignore certain kinds of cycles, e.g. those arising from C++ implementations of virtual inheritance.
The most significant modification is that the collector always runs in the allocating thread. There is no separate garbage collector thread. If an allocation attempt either requests a large object, or encounters an empty small object free list, and notices that there is a collection in progress, it immediately performs a small amount of marking work as described above.
This change was made both because we wanted to easily accommodate single-threaded environments, and because a separate GC thread requires very careful control over the scheduler to prevent the mutator from out-running the collector, and hence provoking unneeded heap growth.
In incremental mode, the heap is always expanded when we encounter insufficient space for an allocation. Garbage collection is triggered whenever we notice that more than GC_heap_size/2 * GC_free_space_divisor bytes of allocation have taken place. After GC_full_freq minor collections a major collection is started.
All collections initially run interrupted until a predetermined amount of time (50 msecs by default) has expired. If this allows the collection to complete entirely, we can avoid correcting for data structure modifications during the collection. If it does not complete, we return control to the mutator, and perform small amounts of additional GC work during those later allocations that cannot be satisfied from small object free lists. When marking completes, the set of modified pages is retrieved, and we mark once again from marked objects on those pages, this time with the mutator stopped.
We keep track of modified pages using one of three distinct mechanisms:
In particular, it is very difficult for the collector to stop all other threads in the system and examine the register contents. This is currently accomplished with very different mechanisms for different Pthreads implementations. The Solaris implementation temporarily disables much of the user-level threads implementation by stopping kernel-level threads ("lwp"s). The Irix implementation sends signals to individual Pthreads and has them wait in the signal handler. The Linux implementation is similar in spirit to the Irix one.
The Irix implementation uses only documented Pthreads calls, but relies on extensions to their semantics, notably the use of mutexes and condition variables from signal handlers. The Linux implementation should be far closer to portable, though impirically it is not completely portable.
All implementations must intercept thread creation and a few other thread-specific calls to allow enumeration of threads and location of thread stacks. This is current accomplished with # define's in gc.h, or optionally by using ld's function call wrapping mechanism under Linux.
Comments are appreciated. Please send mail to boehm@acm.org