The Kernel Hacker's Bookshelf: UNIX Internals [LWN.net]

September 3, 2008

This article was contributed by Valerie Henson

Back in 2001, I landed my (then) dream job as a full-time Linux kernel developer and distribution maintainer for a small embedded systems company. I was thrilled - and horrified. I'd only been working as a programmer for a couple of years and I was sure it was only a matter of time before my new employer figured out they'd hired an idiot. The only solution was to learn more about operating systems, and quickly. So I pulled out my favorite operating systems textbook and read and re-read it obsessively over the course of the next year. It worked well enough that my company tried very hard to convince me not to quit when I got bored with my "dream job" and left to work at Sun.

That operating systems textbook was UNIX Internals by Uresh Vahalia. UNIX Internals is a careful, detailed examination of multiple UNIX implementations as they evolved over time, from the perspective of both the academic theorist and the practical kernel developer. What makes this book particularly valuable to the practicing operating systems developer is that the review of each operating systems concept - say, processes and threads - is accompanied by descriptions of specific implementations and their histories - say, threading in Solaris, Mach, and Digital UNIX. Each implementation is then compared on a number of practical levels, including performance, effect on programming interfaces, portability, and long-term maintenance burden - factors that Linux developers care passionately about, but are seldom considered in the academic operating systems literature.

UNIX Internals was published in 1996. A valid question is whether a book on the implementation details of UNIX operating systems published so long ago is still useful today. For example, Linux is only mentioned briefly in the introduction, and many of the UNIX variants described are now defunct. It is true that UNIX Internals holds relatively little value for the developer actively staying up to date with the latest research and development in a particular area. However, my personal experience has been that many of the problems facing today's Linux developers are described in this book - and so are many of the proposed solutions, complete with the unsolved implementation problems. More importantly, the analysis is often detailed enough that it describes exactly the changes needed to improve the technique, if only anyone took the time to implement them.

In the rest of this review, we'll cover two chapters of UNIX Internals in detail, "Kernel Memory Allocation" and "File System Implementations." The chapter on kernel memory allocation is an example of the historical, cross-platform review and analysis that sets this book apart, covering eight popular allocators from several different flavors of UNIX. The chapter on file system implementations shows how lessons learned from the oldest and most basic file system implementations can be useful when solving the latest and hottest file system design problems.

Kernel Memory Allocation

The kernel memory allocator (KMA) is one of the most performance-critical kernel subsystems. A poor KMA implementation will hurt performance in every code path that needs to allocate or free memory. Worse, it will fragment and waste precious kernel memory - memory that can't be easily freed or paged out - and pollute hardware caches with instructions and data used for allocation management. Historically, a KMA was considered pretty good if it only wasted 50% of the total memory allocated by the kernel.

Vahalia begins with a short conceptual description of kernel memory allocation and then immediately dives into practical implementation, starting with page-level allocation in BSD. Next, he describes memory allocation in the very earliest UNIX systems: a collection of fixed-size tables for structures like inodes and process table entries, occasional "borrowing" of blocks from the buffer cache, and a few subsystem-specific ad hoc allocators. This primitive approach required a great deal of tuning, wasted a lot of memory, and made the system fragile.

What constitutes a good KMA? After a quick review of the functional requirements, Vahalia lays out the criteria he'll use to judge the allocators: low waste (fragmentation), good performance, simple interface appropriate for many different users, good alignment, efficient under changing workloads, reassignment of memory allocated for one buffer size to another, and integration with the paging system. He also takes into consideration more subtle points, such as the cache and TLB footprint of the KMA's code, along with cache and lock contention in multi-processor systems.

[PULL QUOTE: This is an example of how even the oldest and clunkiest algorithms can influence the design of the latest and greatest. END QUOTE] The first KMA reviewed is the resource map allocator, an extremely simple allocator using a list of <base, size> pairs describing each free segment of memory, sorted by base address. The charms of the resource map allocator include simplicity and allocation of exactly the size requested; the vices include high fragmentation and poor performance under nearly every workload. Even this allocation algorithm is useful under the right circumstances; Vahalia describes several subsystems that still use it (System V semaphore allocation and management of free space in directory blocks on some systems) and some minor tweaks that improve the algorithm. One tweak to the resource map allocator keeps the description of each free region in the first few bytes of the region, a technique later used in the state-of-the-art SLUB allocator in the Linux kernel. This is an example of how even the oldest and clunkiest algorithms can influence the design of the latest and greatest.

Each following KMA is discussed in terms of the problems it solves from previous allocators, along with the problems it introduces. The resource map's sorted list of base/size pairs is followed by power-of-two free lists with a one-word in-buffer header (better performance, low external fragmentation, but high internal fragmentation, esp. for exact power-of-two allocations), the McKusick-Karels allocator (power-of-two free lists optimized for power-of-two allocation; extremely fast, but prone to external fragmentation), the buddy allocator (buffer splitting on power-of-two boundaries plus coalescing of adjacent free buffers; poor performance due to unnecessary splitting and coalescing), and the lazy buddy allocator (buddy plus delayed buffer coalescing; good steady-state performance but unpredictable under changing workloads). The accompanying diagrams of the data structures and buffers used to implement each allocator are particularly helpful in understanding the structure of the allocators.

After covering the simpler KMAs, we get into more interesting territory: the zone allocator from Mach, the hierarchical allocator from Dynix, and the SLAB allocator, originally implemented on Solaris and later adopted by several UNIXes, including Linux and the BSDs. Mach's zone allocator is the only fully garbage-collected KMA studied, with the concomitant unpredictable system-wide performance slowdowns during garbage collection, which would strike it from most developers' lists of useful KMAs. But as with the resource map allocator, we still have lessons to learn from the zone allocator. Many of the features of the zone allocator also appear in the SLAB allocator, commonly considered the current best-of-breed KMA.

The zone allocator creates a "zone" of memory reserved for each class of object allocated (e.g., inodes), similar to kmem caches in the later SLAB allocator. Pages are allocated to a zone as needed, up to a limit set at zone allocation time. Objects are packed tightly within each zone, even across pages, for very low internal fragmentation. Anonymous power-of-two zones are also available. Each zone has its own free list and once a zone is set up, allocation and freeing simply add and remove items from the per-zone free list (free list structures are also allocated from a zone). Memory is reclaimed on a per-page basis by the garbage collector, which runs as part of the swapper task. It uses a two-pass algorithm: the first pass counts up the number of free objects in each page, and the second pass frees empty pages. Overall, the zone allocator was a major improvement on previous KMAs: fast, space efficient, and easy to use, marred only by the inefficient and unpredictable garbage collection algorithm.

The next KMA on the list is the hierarchical memory allocator for Dynix, which ran on the highly parallel Sequent S2000. One of the major designers and implementers is our own Paul McKenney, familiar to many LWN readers as the progenitor of the read-copy-update (RCU) system used in many places in the Linux kernel. The goal of the Dynix allocator was efficient parallel memory allocation, in particular avoiding lock contention between processors. The solution was to create several layers in the memory allocation system, with per-cpu caches at the bottom and collections of large free segments at the top. As memory is freed or allocated, regions move up and down one level of the hierarchy in batches. For example, each per-cpu cache has two free lists, one in active use and the other in reserve. When the active list runs out of free buffers, the free buffers from the reserve list are moved onto it, and the reserve list replenishes itself with buffers from the global list. All the work requiring synchronization between multiple CPUs happens in one big transaction, rather than incurring synchronization overhead on each buffer allocation.

The Dynix allocator was a major advance: 3 - 5 times faster than the BSD allocator even on a single CPU. Its memory reclamation system was far more efficient than the zone allocator's, performed on an on-going basis with bounded worst case performance on each operation. Performance on SMP systems was unparalleled.

The final KMA in this chapter is the SLAB allocator, initially implemented on Solaris and later re-implemented on Linux and BSD. The SLAB allocator refined some existing techniques (simple allocation/free computations for small cache footprint, per-object caches) and introduced several new ones (cache coloring, efficient object reuse). The result is an allocator that was both the best performing and the most efficient by a wide margin - only 14% fragmentation versus 27% for the SunOS 4.1.3 sequential-fit allocator, 45% for the 4.4BSD McKusick-Karel allocator, and 46% for the SunOS 5.x buddy allocator.

Like the zone allocator, SLAB allocates per-object caches (along with anonymous caches in useful sizes) called kmem caches. Each cache has an associated optional constructor and destructor function run on the objects in a newly allocated and newly freed page, respectively (though the destructor has since been removed in the Linux allocator). Each cache is a doubly-linked list of slabs - large contiguous chunks of memory. Each slab keeps its slab data structure at the end of the slab, and divides the rest of the space into objects. Any leftover free space in the slab is divided between the beginning and end of the objects in order to vary the offset of objects with respect to the CPU cache, improving cache utilization (in other words, cache coloring). Each object has an associated 4-byte free list pointer.

The slabs within each kmem cache are in a doubly linked list, sorted so that free slabs are located at one end, fully allocated slabs at the other, and partially allocated slabs in the middle. Allocations always come from partially allocated slabs before touching free slabs. Freeing an object is simple: since slabs are always the same size and alignment, the base address of the slab can be calculated from the address of the object being freed. This address is used to find the slab on the doubly linked list. Free counts are maintained on an on-going basis. When memory pressure occurs, the slab allocator walks the kmem caches freeing the free slabs at the end of the cache's slab list. Slabs for larger objects are organized differently, with the slab management structure allocated separately and additional buffer management data included.

This section of UNIX Internals has aged particularly well, partly because the SLAB allocator continues to work well on modern systems. As Vahalia notes, the SLAB allocator initially lacked optimizations for multi-processor systems, but these were added shortly afterward, using many of the same techniques as the Dynix hierarchical allocator. Since then, most production kernel memory allocators have been SLAB-based. Recently, Christoph Lameter rewrote SLAB to get the SLUB allocator for Linux; both are available as kernel configuration options. (The third option, the SLOB allocator, is not related to SLAB - it is a simple allocator optimized for small embedded systems.) When viewed in isolation, the SLAB allocator may appear arbitrary or over-complex; when viewed in the context of previous memory allocators and their problems, the motivation behind each design decision is intuitive and clear.

File Systems Implementations

UNIX Internals includes four chapters on file systems, covering the user and kernel file system interface (VFS/vnode), implementations of on-disk and in-memory file systems, distributed/network file systems, and "advanced" file system topics - journaling, log-structured file systems, etc. Despite the intervening years, these four chapters are the most comprehensive and practical description of file systems design and implementation I have yet seen. I definitely recommend it over UNIX File System Design and Implementation - a massive sprawling book which lacks the focus and advanced implementation details of UNIX Internals.

The chapter on file systems implementations is too packed with useful detail to review fully in this article, so I'll focus on the points that are relevant to current hot file system design problems. The chapter describes the System V File System (s5fs) and Berkeley Fast File System (FFS) implementations in great detail, followed by a survey of useful in-memory file systems, including tmpfs, procfs (a.k.a. /proc file system), an early variant of a device file system called specfs, and a sysfs-style interface for managing processors. This chapter also covers the implementation of buffer caches, inode caches, directory entry caches, etc. One of the features of this chapter (as elsewhere in the book) is the carefully chosen bibliography. Bibliographies in research papers serve a double purpose as demonstrations of the authors' breadth of knowledge in the area and tend to be cluttered with more marginal references; the per-chapter bibliographies in UNIX Internals list only the most relevant publications and make excellent supplementary reading guides.

System V File System (s5fs) evolved from the first UNIX file system. The on-disk layout consisted of a boot block followed by a superblock followed by a single monolithic inode table. The remainder of the disk is used for data and indirect blocks. File data blocks are located via a standard single/double/triple indirect block scheme. s5fs has no block or inode allocation bitmaps; instead it maintains on-disk free lists. The inode free list is partial; when no more free inodes are on the list, it is replenished by scanning the inode table. Free blocks are tracked in a singly linked list rooted in the superblock - a truly terrifying design from the point of view of file system repair, especially given the lack of backup superblocks.

In many respects, s5fs is simultaneously the simplest and the worst UNIX file system possible: its throughput was commonly as little as 5% of the raw disk bandwidth, it was easily corrupted, it had a 14 character limit on file names, and so on. On the other hand, elements of the s5fs design have come back into vogue, often without addressing the inherent drawbacks still unsolved in the intervening decades.

The most striking example of a new/old design principle illustrated by s5fs is the placement of most of the metadata in one spot. This turned out to be a key performance problem for s5fs, as every uncached file read virtually guaranteed a disk seek of non-trivial magnitude between the location of the metadata at the beginning of the disk and the file data, located anywhere except the beginning of the disk. One of the major advances of FFS was to distribute inodes and bitmaps evenly across the disk and allocate associated file data and indirect blocks nearby. Recently, collecting metadata in one place has returned as a way to optimize file system check and repair time as well as other metadata-intensive operations. It also appears in designs that keep metadata on a separate high-performance device (usually solid state storage).

The problems with these schemes are the same as the first time around. For the fsck optimization case, most normal workloads will suffer from the required seek for reads of file data from uncached inodes (in particular, system boot time would suffer greatly). In the separate metadata device case, the problem of keeping a single, easily-corrupted copy of important metadata returns. Currently, most solid-state storage is less reliable than disk, yet most proposals to move file system metadata to solid state storage make no provision for backup copies on disk.

Another cutting edge file system design issue first encountered in s5fs is backup, restore, and general manipulation of sparse files. System administrators quickly discovered that it was possible to create a user-level backup that could not be restored because the tools would attempt to actually write (and allocate) the zero-filled unallocated portions of sparse files. Even more intelligent tools that do not explicitly write zero-filled portions of files still had to pointlessly copy pages of zeroes out of the kernel when reading sparse files. In general, the file and socket I/O interface requires a lot of ultimately unnecessary copying of file data into and out of the kernel for common operations. It has only been in the last few years that more sophisticated file system interfaces have been proposed and implemented, including SEEK_HOLE/SEEK_DATA and splice() and friends.

The chapters on file systems are definitely frustratingly out of date, especially with regard to advances in on-disk file system design. You'll find little or no discussion of copy-on-write file systems, extents, btrees, or file system repair outside of the context of non-journaled file systems. Unfortunately, I can't offer much in the way of a follow-up reading list; most of the papers in my file systems reading list are covered in this book (exceptions include the papers on soft updates, WAFL, and XFS). File systems developers seem to publish less often than they used to; often the options for learning about the cutting edge are reading the code, browsing the project wiki, and attending presentations from the developers. Your next opportunity for the latter is the Linux Plumbers Conference, which has a number of file system-related talks.

Another major flaw in the book, and one of the few places where Vahalia was charmed by an on-going OS design fad, is the near-complete lack of coverage of TCP/IP and other networking topics (the index entry for TCP/IP lists only two pages!). Instead, we get an entire chapter devoted to streams, at the time considered the obvious next step in UNIX I/O. If you want to learn more about UNIX networking design and implementation, this is the wrong book; buy some of the Stevens and Comer networking books instead.

Summary

UNIX Internals was the original inspiration for the Kernel Hacker's Bookshelf series, simply because you could always find it on the bookshelf of every serious kernel hacker I knew. As the age of the book is its most serious weakness, I originally intended to wait until the planned second edition was released before reviewing it. To my intense regret, the planned release date came and went and the second edition now appears to have been canceled.

UNIX Internals is not the right operating systems book for everyone; in particular, it is not a good textbook for an introductory operating systems course (although I don't think I suffered too much from the experience). However, UNIX Internals remains a valuable reference book for the practicing kernel developer and a good starting point for the aspiring kernel developer.

(Log in to post comments)

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 3, 2008 22:07 UTC (Wed) by pr1268 (subscriber, #24648) [Link]

It's really hard to make money from books like this. A lot of effort gets put into them, and then you're lucky if they sell 5000 copies, because only a very rarefied audience is interested in them. And these days, people are much more likely to get the information they want from the web.

Which is a shame, really. It's interesting to see how both the quality and availability of information increased substantially after the printing press was invented, but now with the Internet it seems that the quality is decreasing even though the availability continues to increase.

Thanks to Valerie for her contribution. Articles like hers make my subscription to LWN all the more worthwhile.

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 3, 2008 23:54 UTC (Wed) by mattdm (subscriber, #18) [Link]

I'm certain the exact same comment was made around 1450 or so about the printing press vs. hand-lettered books. I think actually the 90%-of-everything rule still applies, just has it always has. It's just a lot easier for you to see more of it, and if you take a certain attitude it's easy to let the junk occlude the good stuff.

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 4, 2008 1:32 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

Good point. I was going on pure intuition when I mentioned that the quality of articles increased after the printing press was invented--my intuitive thought was that errors in hand-scribed articles likely decreased. But then again, with the rapid proliferation of printing press-published books, those few errors that made it through propagated rapidly and extensively. Not unlike the Internet of today....

And thanks to JoeBuck's comment below. I agree that those who publish quality content online aren't doing so for the $$$ (or €, ¥, £, or whatever). :)

90%

Posted Sep 4, 2008 3:03 UTC (Thu) by ncm (subscriber, #165) [Link]

The 90%-of-everything-is-crap rule is known as Sturgeon's Law.

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 4, 2008 0:24 UTC (Thu) by JoeBuck (subscriber, #2330) [Link]

These kinds of works are usually written to make reputations rather than to make money; having written the definitive work on some subject can certainly help with getting hired, or getting tenure.

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 4, 2008 4:24 UTC (Thu) by BrucePerens (guest, #2510) [Link]

Yes. But what you see, sometimes, because of that is that the book is not revised because the goal of tenure, etc., has been reached and the author isn't motivated to make the investment. And some folks don't get the intangible rewards they wanted from being an author (not that tenure is really intangible).

Of the various books in my series, I think only the Samba one got done in Samba's source code repository, and that one is probably the one that has most continued to live. If I do a series again, I'll make sure that all books have online source code repositories. Prentice has not done a good job of making the source available for my first 24 titles. But even with the Samba book it was hard to get the publisher to do a second edition, because there were lots of books that would come back from stores.

I can think of another book on Unix that had high expectations but turned out to be pretty much a flop for its author. I'd better not say which one, it'll start a fight.

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 8, 2008 14:41 UTC (Mon) by branden (subscriber, #7029) [Link]

Why not give a shot, Bruce? Don't muzzle the discussion--your thoughts might trigger the author to barrel into this forum and take stock of his publishing endeavors, rather than getting his breeches in a bunch; he might have been keeping his powder dry for the past few years, and now has some high-caliber insights to volley.

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 8, 2008 20:05 UTC (Mon) by BrucePerens (guest, #2510) [Link]

If you knew who I was talking about you'd not say that :-)

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 8, 2008 20:09 UTC (Mon) by corbet (editor, #1) [Link]

I get the sense the poster knew exactly who you were talking about...

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 8, 2008 20:12 UTC (Mon) by BrucePerens (guest, #2510) [Link]

Oops. I read right over the double-entendres.

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 9, 2008 21:03 UTC (Tue) by branden (subscriber, #7029) [Link]

Bruce,

I must be getting subtle in my old age...

Perish the thought!

The Kernel Hacker's Bookshelf: UNIX Internals

Posted Sep 4, 2008 0:58 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Sometimes, what's available online is excellent. Consider Ulrich Drepper's series on memory serialized here at LWN:

lwn.net/Articles/250967/

1978 called, wants its garbage collectors back

Posted Sep 4, 2008 13:04 UTC (Thu) by i3839 (subscriber, #31386) [Link]

I thought Python's GC also used reference counting. What better alternatives are there? Scanning memory for anything that looks like a pointer doesn't seem to be one. Perhaps built a reference graph of all allocated memory, and remove all disjunct pieces or something? But that seems to have more overhead than reference counting. I don't know much about GC...

1978 called, wants its garbage collectors back

Posted Sep 4, 2008 13:22 UTC (Thu) by knobunc (subscriber, #4678) [Link]

The "What the Heck Is?" series of posts by Dan Sugalski was good:
www.sidhe.org/~dan/blog/archives/cat_what_the_heck...

In particular he has one about GC:
www.sidhe.org/~dan/blog/archives/000200.html

He was posting good overviews of CS topics so that people could work on Parrot (the proposed Perl VM).

But a more thorough review of GC technology on LWN would be appreciated!

-ben

1978 called, wants its garbage collectors back

Posted Sep 4, 2008 14:06 UTC (Thu) by rwmj (subscriber, #5474) [Link]

I don't know much about GC...

So it seems. Have a look at Wikipedia then for the serious end of garbage collection take a look at Richard Jones [no relation]'s garbage collection resources page and in particular his book.

Rich.

1978 called, wants its garbage collectors back

Posted Sep 5, 2008 13:12 UTC (Fri) by i3839 (subscriber, #31386) [Link]

Umm, Wikipedia just mentions two basic types of GC, a reachability based one and a reference counting based one, and I mentioned both methods without knowing a thing about GC. Same for the "What the heck is: Garbage Collection" article mentioned in the comment above, and the book doesn't seem to mention anything fundamentally different either, just goes into details (same ideas, different algorithms, though interesting ones. Especially the copying one is smart, as it solves the fragmentation problem). Isn't there another paradigm of doing GC out there? Or is the little I know really all that's to know about GC on a high level?

I was totally wrong on the overhead/cost though. ;-)

1978 called, wants its garbage collectors back

Posted Oct 11, 2008 9:24 UTC (Sat) by anomalizer (subscriber, #53112) [Link]

If you are interested in GC options in general, try reading up about the various GC developments that have been happening in the Java world. Things have moved a long way from reference counting algoithms. The caveat of course is that what works for Java is not going to work for the kernel.

But still, no point in using GC if not needed

Posted Sep 4, 2008 15:45 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link]

There are indeed a number of high-performance garbage collectors, even real-time garbage collectors. However, even these garbage collectors impose significant overhead and slowdown -- the overhead/slowdown is not eliminated, but rather spread out in bite-sized chunks over time. Alternatively, high-priority tasks are permitted to preempt the garbage collector, which has its own set of advantages and disadvantages. And the key point is that KMAs can be constructed without requiring garbage collection in the common case.

Less-common cases where KMA garbage collection can be helpful include: (1) when flat out of memory, reclaiming blocks from the per-CPU caches lets the system creep along a bit farther, perhaps giving OOM a chance to do its job, (2) when shipping blocks from one NUMA node to another upon kfree(), one might use a timeout to avoid leaving blocks pending for KMAs that care about segregating memory on a NUMA basis (and DYNIX/ptx needed to care, given the huge remote latencies of the NUMA-Q hardware), (3) where real-time response is required, high-priority processes might just dump memory onto a per-CPU list and let low-priority processes clean up after them, and (4) if the Linux kerne