Linux/Reading the Linux Kernel Sources
Reading the Linux Kernel Sources
Part of an ongoing Linux Kernel exploration by SVLUG - the Silicon Valley Linux Users Group
You might enjoy our first YouTube video!
Good places to follow along:
Look out for this: sometimes code gets moved from one file or one directory to another.... or converted from raw assembler (in the arch area) into C and categorized. Therefore the LXR site letting you surf the older source trees can be very interesting. A specific example of this is the eventual combining of several closely related architectures into one arch with sub-architectures.
- Wikipedia handy words
- Potentially useful books
- Assembly Language Step by Step by Jeff Duntemann (Wiley 2000) ISBN 0471375233 - section on the gas assembly language syntax (Chapter 12), INTs used in Linux (Chapter 13), etc.
- Linux Device Drivers, 3rd edition, Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman (O'Reilly 2005) ISBN 0-596-00590-3 - covers device drivers for kernel 2.6.10. Full text available for free at LWN
- See also Ralf Brown's x86 interrupt list
Introduction to the Linux Kernel Source
Traditional computer science programming courseware, for the most part, uses source code examples that are over-simplified and academic, giving little insight into how things work in the real world, and into the forces that change source code over time.
The Linux kernel is one of the most widely used pieces of production source code in the world today. It's been under continuous development for over two decades, and has seen ever-growing popularity and usage. In that respect, it's one of the more seminal works of source code for any aspiring engineer to study.
Studying the Linux kernel sources poses a number of challenges. The kernel is comprised of over 15 million lines of code. In addition, the kernel is a unique "program" - one which provides the most fundamental support for the system's hardware, libraries and the applications which run under it. So studying a kernel is considerably more complicated than studying a typical user space C program. We can't just start at int main (int argc, char **argv) ...' and proceed from there.
The Linux kernel has a number of different distinct entry points. It also contains a small amount of assembly code required for firmware loads and kernel boot sequences. Naturally the Linux kernel cannot link with the standard C libraries (since, at their core the C libraries depend on systems calls which are provided by the kernel --- so that would present a classic chicken-and-egg problem).
Here we'll present a number of tips for studying the kernel sources. We'll then provide an outline for one sequence of topics which can be studied in a progressive and productive way. The outline will be primarily presented as links via Cross-referencing Linux
We'll also rely heavily on other online resources such as the LDP Kernel Hacker's Guide (which, although quite dated can give us some historical perspective on the code from a time when the kernel was somewhat smaller and simpler).
Get Linux first release 0.01
Where the CPU Starts Executing
One approach would be to start where the CPU starts executing a freshly loaded kernel image.
Naturally we could ask ourselves: "what is the first part of the compiled kernel that will be executed?" Another way of asking this is: "what is the kernel's 'entry point?'" (Unfortunately search on the obvious Google terms: "Linux kernel 'entry point'" leads to discussions of how system calls transition from user space into kernel space --- which is a different sort of "entry point" (though one well worth studying)).
In order to answer that question we have to consider how a kernel gets loaded into memory and started on a system. When a computer starts up the system memory (RAM) is empty; the only available operating instructions are in ROM (read-only memory) --- also called the firmware.
Naturally the firmware is different for each system architecture. Thus PCs, PowerPC (older Machintosh systems, and IBM POWER, RS/6000 workstations, etc), DEC Alpha, MIPS, SPARC, and other systems each have their own low-level code --- and their own loading conventions. Any given compilation of the kernel will have at least one such entry point.
It's even more complicated than that, however, because there are, in some cases such as the PC, a number of different ways that the kernel can be loaded. For example from hard disk, floppy, CD-ROM or over a network (e.g. via PXE). Older Linux kernels, (before circa 2003) could be dumped onto a raw floppy diskette (using the dd command) and booted from there. We see where that code used to be here:
As the error message here indicates this raw boot model is no longer supported. So we have to use some program like syslinux, lilo, or grub to load the kernel and jump into the code.
Another complication is that the normal PC Linux kernel image is compressed. There is a small header of code which does some minimal memory management and decompresses the rest of the image into RAM. Then that bit of code jumps into the kernel. The PC (x86) code for decompressing a kernel image can be found in .../arch/i386/boot/compressed/head.S, with other architectures' decompression code found by simply swapping out i386 for your architecture of choice.
Of course, this is so tied to the architecture that it is in assembly language making it somewhat more difficult for C programmers to read. There are a few comments at the beginning of this file which reveal a little about the expectations/assumptions that the code must make about how controls has been passed to it. These become requirements for how the bootloader (syslinux, lilo, grub) must prepare things before the processor jumps to the instruction after the startup_32 label. It uses these arguments to find and decompresses the kernel into the correct place and then jumps to the "correct place," which the code only refers to using a register.
According to the comments, the point that the decompression code jumps to is found in arch/$YOUR_ARCH/kernel/head.S and is marked with the macro ENTRY(stext). Again, the trail fades out. Calculated values are jumped to and macros are invoked. The next hint can be found in the Linux Kernel 2.4 Internals page: after the initial assembly, execution jumps to start_kernel() in init/main.c. From there the code becomes much easier to read.
- Is this correct? Is there any point in the Linux kernel sources called before this point when loading a compressed kernel? The filename head.S is the convention used for this meaning.
- Is there currently any way to load an uncompressed Linux kernel on a 32-bit PC? How about under x86_64?
- Is it possible to step through the code with a JTAG based source level debugger in our weekly meeting? Even though this may affect timing, it can tell us what code is actually executed, the calling sequence and the values of variables.
- It is possible to step through the code with a JTAG based source level debugger on some embedded Linux device that boots Linux directly from Flash? (That keeps us from getting bogged down in the details of the MBR/bootloader, which is unnecessarily complicated on many x86 machines). Perhaps b: Tomato_(firmware) ?
Where to Therefrom?
Even if we choose to pursue this path of exploring the kernel ... where does the chain of execution go from there? At the end of head.S we see the code clear the EBX register and jump to the address which is contained at that point in the EBP (base pointer) register. The preceding code was calculating possible offsets between the kernel's compiled in load/start address and the location at which it was actually loaded (if it was built as "relocatable").
Notice that the jump is not to a symbolic label, and the comments in the sources don't tell us where to read next.
It should be obvious that starting our reading where the CPU starts executing might present some challenges that we're not quite ready to tackle.
Where "User Space" is Started
Every good Linux systems administrator should know that the classic UNIX kernel starts exactly one normal user space program. So, perhaps this is a useful place to start reading the kernel sources. We know that several things have to happen before init is started (the root filesystem must be located and mounted, the initial console must be opened and connected to file descriptors 0, 1, and 2 (stdin, stdout, and stderr respectively) and the initial environment must be created.
Each pre-requisites has it's own pre-requisites: the block device on which the rootfs is hosted must be detected and initialized, the memory limits must be scanned (or otherwise auto-detected), any memory management unit (MMU) and programmable interrupt controllers (PICs, APICs, IOAPICs) must be detected, enumerated and programmed; etc.
So we could find where the init process is created and trace backwards from there to learn more about how the system is prepared for its ultimate mission (running our programs, one would think).
We can also ask ourselves what happens after the init process has been started. Of course any competent sysadmin knows what happens out in user space: init reads /etc/inittab and executes all of the processes described there.
From that we can intuit at least some of what the kernel must be doing.
Clearly the scheduler must be running, giving the init process and each of its descendants time to execute code in user space.
Any good UNIX programmer knows that code in user space can only do some very basic operations --- basically computation, arithmetic and string operations, on memory that's already allocated to the process. Everything else involves access to files, devices, or other system calls (requests to the kernel to do provide a service to the program).
So we can also look forward, finding out how the kernel handles system calls, and how it provides filesystems.
Let's start at: http://lxr.linux.no/source/init/main.c#L839
We can see here that the kernel attempts to start /sbin/init, /etc/init, /bin/init and finally falls back to trying /bin/sh. If we read backwards a little bit we see where the kernel tries to execute whatever program was based via the init= kernel argument by the boot loader (and we can follow the hyperlinks to find where the command line was parsed for such an option). Going back a little further we can see where the kernel tries to start the /init (formerly the /linuxrc) if there was one in an initial RAM disk.
We can compare the current version of this file to the oldest one available (from the 1.0.9 kernel version) and see how much is recognizable, and then consider the changes that have accrued over the years, and ponder why those changes where made.
After init Has Been Started
A traditional UNIX kernel only starts one normal user space process, init and thereafter it assumes its roles as the mediator between user space and the system hardware resources.
- Show how the kernel spawns modprobe and hotplug utilities as a counterpoint to the traditional UNIX model
Primarily the kernel schedules CPU time to processes, dispatches signals to them, and handles system calls for them. That's the view of the kernel from the perspective if the applications programmer.
- Show how the kernel handles system calls via entry.S et all
- Compare this to the sysenter VDSO technique
- Other memory mapping tricks?
From another perspective the kernel services interrupts --- hardware events.
The first and most obvious would the periodic events from the system clock --- on a PC those from from a PIT (programmable interrupt timer). The system clock interrupt becomes the heartbeat of the system. During this interrupt the kernel updates the "jiffies" value, possibly updates the kernel real-time clock values, and schedules user space processes and kernels tasks.
(Most interrupts for most I/O devices are divided into a couple of parts, traditionally called the "bottom half" and the "top half" --- the bottom half is normally the minimal amount of work that saves enough state and status that will allow the rest of the work to be deferred until the top half can be scheduled to run --- so these are one source of kernel tasks that can require scheduling; additionally the Linux kernel maintains a number of kernel threads which appear in normal ps listings as processes (with funny names that are enclosed in square brackets. those exist in that form so that the kernel can schedule them using the same mechanisms as for any user space processes).
- The above top half/bottom half explanation is the reverse of what the Linux Device Drivers book says. :-( That is, the top half is normally the minimal amount of work that saves enough state and status that will allow the rest of the work to be deferred until the bottom half can be scheduled to run.
- I see that w: Talk:Interrupt_handler#Wrong also mentions that some people use top/bottom with a swapped meaning of other people. Perhaps, here at Wikiversity, we should use the unambiguous "first level interrupt handler" and "second level interrupt handler" terminology.
- Find the code responsible for handling the system clock interrupt!
- Find one of the simple kernel thread tasks, such as krecoveryd and show how it's initialized and what it does during it's time slices
- Discuss the idle task
End of the Road
While it's useful to start, in some sense, where the user space interactions with the kernel "start" --- it's also useful to have some idea how it all will end. Once the kernel has started and settles into its role, handling system calls, dispatching signals, servicing interrupts, it will do so until ...