In the last post, Iâ€™ve been talking about the basics of garbage collection and the generational garbage collector. While the generational garbage collector brought huge performance benefits by getting the large, old generation memory area to be infrequently visited by the collector, it still wasnâ€™t enough for the new era of faster processors and memory sizes which spawned larger applications with multiple threads running concurrently, creating loads of garbage. The original generational collector was single threaded, and was called the serial collector. A parallel collector was required, and since the young and old generation memory spaces are different in the way the objects residing in them are used, so are the collectors implemented for them.
Since explaining multi-threaded implementations is difficult, and more-so when itâ€™s with the topic of garbage collection, Iâ€™ve used illustrations that depict the state of a one or two threads at a time; in reality, the amount of threads running is equal to the number of processors the machine has, which can easily be a two-digit number. I hope the explanations are clear enough, and as always questions are welcomed!
The parallel implementation for the young generation memory space is simple and is as follows: the root objects are divided among the garbage collection threads, making the marking phase much shorter. As soon as an object is found to be live, it is copied to the To area immediately (or to the old generation space, if he survived enough iterations). This can be accomplished by allocating a segment in the To area per thread, allowing the different threads to copy objects to the To area without the need for thread synchronization. Since allocation is done in Eden anyway, the To area is allowed to be a bit fragmented without causing harm.
The following image illustrates how a single thread claims a single object. Obviously, in reality all threads allocated for the collection task are sweeping in a non-synchronized way, unaware of each otherâ€™s work.
Parallel compacting collector
The implementation for the old generation memory space is much more complex. Remember that this memory is significantly larger and is occupied by objects that are less likely to be garbage. This operation is a three steps procedure: first, the memory space is divided into regions. Then, the parallel marking starts, with the root objects being divided among collector threads. When a live object is encountered, the region it is located in is updated so that at the end of the process, each region contains information stating how much live data is available in it, and where live objects are located in it.
After thatâ€™s done, the summary step comes along where each region is examined for its density – the amount of live bytes compared to the total bytes in the region. An assumption is now made: since we know that previous collections had been made which compacts the bytes â€œto the leftâ€, and we know that new objects moved into the old generation are allocated on its â€œrightâ€ side, we can assume that the left side of the memory is filled with older objects than the right side, and in respect with the generational collection idea, the more â€œto the leftâ€ an object is the less likely it for it to be dead. The garbage collection uses this assumption for performance optimization: starting at the left most region it looks for the first region with a density which is worthwhile for collection. All regions to the left of that region are not collected and are called the dense prefix.
In addition, during summary the collector can tell what amount of space is going to remain in each region after its compacting so it can already choose which regions are going to be filling other regions (typically, regions on the â€œrightâ€ are filling regions on the â€œleftâ€). It can also note for each region which regions will be filling it, and for each filling region where itâ€™s going to be located after sweeping, or where the first live byte is going to be located.
Now the sweeping step itself starts. Regions that can be collected without synchronization with other threads are being divided between collector threads. These are regions that do not fill other regions, so theyâ€™re compacted into themselves, or regions that are already completely empty already and can just be claimed by the collector. After a thread finishes claiming a region, it immediately starts filling that region from the regions destined to fill it – this information was determined previously by the summary step. Since it knows no other thread is reading from that source region or writing to the destination region (as the thread is in charge of both) it can make the filling and the claiming without need of synchronization.
Note that in the following image the collection occurs over only 60% of the heap (making the collection twice as fast as if it would collect everything), and reclaiming 82% of the garbage. That means that in order to collect the additional 18%, the garbage collector would have to work just as hard as it did for the first 82%! Obviously, these are not precise calculations, but surely the point is made: this is the reason the dense prefix is an important optimization step.
Concurrent mark-sweep collector
While both prior collectors mentioned are parallel and take a lot less time to operate, they still stop the application threads while working. Some applications might have a different requirement though: they might require a more deterministic Java, and are willing to sacrifice some performance in order to get a shorter to non-existent waiting times caused by the garbage collector. For those applications the Java VM supplies the Concurrent Mark-Sweep garbage collector.
This collector uses four steps: first, the initial mark step stops the application for a very short period of time while a single collection thread marks the first level of objects directly connected to the root objects. After that, the application continues to work normally while the collector performs a concurrent mark, marking the rest of the objects while the application keeps running. Because the application might change references to the object graph by the time the concurrent marking is finished, the collector performs an additional step: the remarking, which stops the application for another brief period. When the application changes an object it Is marked as changed. During the remarking step the application checks only those changed objects, and to make things faster it distributes the load between multiple threads. Finally, the collector performs a concurrent sweep which does not compact the memory area in order to save time and operates concurrently with the application threads.
Since the memory area is not compacted, the memory allocator is different in two ways: first, it needs to keep multiple pointers to free memory blocks categorized by their size to allow allocation within the fragments of free memory. Second, it keeps count of the requested lengths to determine the popular object sizes within the application; then, it uses this information to split and merge the free block fragments to match future requirements by the application, assuming that the popular sizes will continue to be popular and often requested.
This covers everything Iâ€™ve managed to gather about the way garbage collectors work. Note that for some applications (typically small, client-side ones) the parallel collectors are not the best. The Java VM selects which collectors to use automatically when its loading according to the properties of the machine and operating system itâ€™s running on. You can read all about it on Sunâ€™s Java VM ergonomics document.