Stripping kernel/uboot source to 10% for code reading

When reading kernel or uboot source code, it's quite a headache that there are so many duplicate definitions for CPU specific functions. For e.g., when I was working on our company's pxa988 uboot code, there are only ~400 relevant files, but how many files total are there in the whole uboot project? More than 6000!

Here I present a simple but reliable way to get rid of those not immediately needed files, it works both for kernel and uboot. It's based on building and file timestamps: after a full build, ideally all relevant files will be visited, thus their atime updated, but those irrelevant files shouldn't.

So here are the steps:

  1. touch all source code files (this step is mandatory, or else the atime won't update correctly, I don't know why, but guess it's an optimization).
    git ls-tree HEAD -r --name-only|xargs touch
    
  2. touch a sentinel file: touch mark
  3. full build
  4. find all files newerly accessed than mark, like this: find . -anewer mark. These are used (visited) by the build.
  5. Similarly find all files not accessed after mark, these are not used by the build, so not so important for your code reading task. Remove them.
  6. After removing those non-build files, do a full build again (with the same kernel/uboot config, of course) to verify it's still building OK.

But sadly, this idea doesn't work. In step 4, you will find that all source files has been visited (with a newer atime than mark), WTH?

strace(1) comes to the rescue!

(
 cd uboot;
 strace -f time make -j8 helan_ff_config;
 strace -f time make -j8
) 2>&1 | tee ~/strace.log

In the ~/strace.log, we can find who is visiting a file for a totally different CPU/driver:

$ grep drivers/mmc/arm_pl180_mmci.c ~/strace.log
[pid   684] lstat("drivers/mmc/arm_pl180_mmci.c", {st_mode=S_IFREG|0644, st_size=11286, ...}) = 0
[pid   684] open("drivers/mmc/arm_pl180_mmci.c", O_RDONLY) = 6
[pid   694] lstat("drivers/mmc/arm_pl180_mmci.c", {st_mode=S_IFREG|0644, st_size=11286, ...}) = 0

$ grep ' 684]' ~/strace.log|head -n 5
[pid   684] close(10)                   = 0
[pid   684] execve("/usr/bin/git", ["git", "update-index", "--refresh", "--unmerged"], [/* 78 vars */]) = 0
[pid   684] brk(0)                      = 0x25b6000
[pid   684] access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
[pid   684] mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b5c68ec2000

And you can see that it is a git update-index ... command, and it is run in a script named setlocalversion (same script for both kernel and uboot). After hacking this script (do not run the git command), the idea works.

See my scripts to do it in one command: for uboot and for kernel. But you need to adapt them if you want to use, of course. For those commands not found anywhere else such as pn (Print Nth field) or file-arg1-arg2 (file version of arg1-arg2, doing simple set subtraction: echo $(arg1-arg2 'a b c d e' 'e b b') => a c d), you can find them in the same github project as the above 2 scripts.

Using this idea, the uboot code can be reduced from ~6000 files to ~400, and kernel from ~40k to ~4k (based on our small embedded system's kernel config, of course), that is, about 10 percent.

Still, you may want to know and use beagrep which is great for reading huge code base (it can grep 2G Android source code in 0.23 second).