beagrep, grep 2G source code in 0.23 second

Beagrep can grep readlink in Android source code in 0.8 0.231 second when cache is hot, and about 11 seconds when cache is cold.

Using ag or ack or bare grep, the results are a couple of seconds when cache is hot, or a couple of minutes when the cache is cold2.

1 Introduction

I still remember when I read the ``Linux Device Drivers'', 1st Edition, in the preface, the author says:

… The text you are approaching is the result of hours of patient grepping …

I have been using grep to read source code ever since. It's been adequate for a long time, until I started reading Android source code with it. I remember vividly how it took me 30 minutes3 to grep readlink in Android source code (with VCS directory and binary files excluded).

Then I learned about the beagle software which is a desktop search engine, and it occurred to me, what if I use beagle and grep together? I mean, to use beagle first to decide the (relatively a lot smaller) set of possible matching files (i.e., files that contained the word readlink, taking the previous example), then invoke grep on this set only. Usually the job can be done in the blink of an eye.

Beagrep is available on GNU/Linux, Mac OS (requires MacPorts) and Windows (requires CYGWIN). It also works with CJK characters correctly, as long as the document is in utf8.

Because beagrep is so fast, I have also developped some more use cases besides simply grepping through Android source code, for e.g., finding function call sites.

There are other tools similar to beagrep, discussed at the end of this blog.

(Note a Chinese version of this page is available here, but it may be outdated).

2 Install

2.1 Linux install

2.1.1 Using .deb under debian/ubuntu

Update (2014-05-03) I found that building and installing the lastest version of beagrep deb package will sometimes cause runtime exception when running on Ubuntu 12.04. So you are encouraged to build and install from source code directly, which does not show this problem. Another reason is these packages is quite outdated.

I have provided several .deb package files on github, one for debian testing, 3 for ubuntu lucid(10.04), oneiric(11.10) and precise(12.04). They are all for amd64. If you happen to be using one of these, you can download them here. Here's how to install it using ubuntu 12.04 as an example. After you have downloaded the .deb file:

dpkg -i beagrep-for-ubuntu-precise-2012-09-05-ae0a1a2.deb
# The above step will probably get dependency errors, so you fix it:
apt-get -f install
# This will install the dependence packages.

2.1.2 Compile by yourself

If you need compile by yourself, using Ubuntu precise as an example:

git clone https://github.com/baohaojun/beagrep.git -b for-ubuntu-precise
# you can also run "git branch -a" to see other versions of debian/ubuntu supported

# Install build dependencies. You can open debian/control and check
# the build dependencies and install all of them. The following
# command will try to install them automatically, but if it fails then
# you need figure out which packages to install manually. Good luck!
# Or you can contact some debian/ubuntu advanced user for help.
apt-get install $(cat debian/control | perl -ne 'print if m/Build-Depends/..m/Standards-Version/'|grep -v -e Build-Depends:\\\|Standards-Version|perl -npe 's/,/ /g')

# do the build
(set -e; autoreconf -i; ./configure; make -j4; sudo make install; echo OK)

2.1.3 Build .deb by yourself

Update (2014-05-03) I found that building and installing the lastest version of beagrep deb package will sometimes cause runtime exception when running on Ubuntu 12.04. So you are encouraged to build and install from source code directly, which does not show this problem.

Before someone volunteer to make beagrep into debian/ubuntu distribution, you might also find a need to build .deb by yourself (for e.g., to help your colleagues use beagrep easily without needing to compile it themselves).

Here's how I do it:

  1. Make sure you can compile beagrep manually by referring to the previous section.
  2. Make a tar.gz of beagrep (note that the filename must be exact, it is required by dpkg):
    tar czfv ~/.cache/beagrep_0.4.0.orig.tar.gz beagrep --exclude-vcs
    
  3. Untar it and build the .deb package:
    cd ~/tmp
    tar zxfv beagrep_0.4.0.orig.tar.gz
    cd beagrep
    dpkg-buildpackage
    

    The .deb file is generated in the ~/tmp directory.

  4. Refer to 2.1.1 for how to install.

2.2 Windows install

Windows install can only be done compiling by yourself, and be warned, it's not easy…

2.2.1 Install dependencies

You need install CYGWIN, Mono and sqlite for Windows.

For cygwin, you will need the following packages as the minimum (there might be more, though I have tested these should be enough):

nc util-linux git vim rsync inetutils apache2 shutdown make
gnome-common gcc-core gcc-g++ mingw-gcc-core mingw-gcc-g++
mingw64-i686-gcc-core mingw64-i686-gcc-g++
mingw64-x86_64-gcc-core mingw64-x86_64-gcc-g++ screen
cygutils-extra procps wget git-svn

(See after-check-out.sh for how I installed these packages automatically).

For Mono, default download and install from mono project website is OK.

For sqlite, you need download the .zip file for Windows and extract the .dll into your $PATH.

2.2.2 Get beagrep Windows code

git clone git://github.com/baohaojun/beagrep.git -b for-windows

2.2.3 Compile and install

cd beagrep
bash build-win.sh "/cygdrive/c/Program Files (x86)/Mono-3.2.3"

Now, the argument to build-win.sh is the folder where I have installed mono, on my windows7, it is /cygdrive/c/Program Files (x86)/Mono-3.2.3. One thing very important in building and using beagrep on Windows is to get the directory name format correct, because Windows has very weird characters in its pathname, : and Space, which is harmful for writing Makefile and Shell scripts, respectively.

  1. Fixing errors during build

    Note that when you are building, compilation will fail complaining /usr/lib/gcc/i686-pc-mingw32/4.5.2/libstdc++.la not found, this is because it is specified in C:/Program Files (x86)/Mono-3.2.3/lib/pkgconfig/mono.pc, and you need to edit this file to remove it.

3 Usage

3.1 Quick smoke test

cd /tmp; mkdir $$; cd $$; echo main > 1.txt; mkbeagrepidx; beagrep -e 'main' --grep '--color=auto'; true; cd ..; rm $$ -rf

If beagrep has been installed correctly, at the end of the above command's output, you should see something like /.cache/11468/1.txt:1:main.

3.2 Create index

In your source code directory, using android as example:

cd ~/src/android
mkbeagrepidx

This step will cost you quite some time, under my Linux indexing Android cost me about half an hour (it's about the same time you run grep directly on android source). So I'd advise you create a cron job to do it at midnight.

It takes about 8 minutes to index linux kernel (v3.6-rc6):

Debug: IndexWorker Done
Debug: Elapsed time 478.01s.

But the good news is if indexing has already been done before, there re-indexing will only work on those updated files based on file time-stamp. So it will cost you only a few minutes to re-index the whole Android source.

Even better, after an initial indexing, you can do a sub-folder re-index, mkbeagrepidx will ask you if you want to update the index found for upper directory. This generally only takes seconds depending on the size of the sub-folder.

3.3 Man page for mkbeagrepidx

mkbeagrepidx is a simple wrapper over beagrep-build-index. You can configure it for which directories to ignore using --deny-directory-pattern option. By default,

  • The $PWD/out is ignored, because it contains android build output
  • The */.git is ignored, because of well known reason
  • The */.repo is ignored, for the same reason.

The syntax is comma separated shell glob patterns, and you can check how it is converted into regular expression by examining the beginning of mkbeagrepidx output:

Always: Will ignore directories matching regular expression: ^(?:/home/bhj/tmp/test/out)$|^(?:.*/\.repo)$|^(?:.*/\.git)$

You can customize it using several ways, in the order of increasing priority:

  • Not customize it, then the default
    "$PWD/out,*/.repo,*/.git"
    

    will be used.

  • Override it in ~/.mkbeagrepidx.rc, setting the BEAGREP_IGNORE_DIR_PATTERNS environment variable:
    export BEAGREP_IGNORE_DIR_PATTERNS="$PWD/out,*/.repo,*/.git"
    
  • Override it in the .mkbeagrepidx.rc in the current working directory, same as the above.
  • Override it on the command line (you must repeat the default pattern because it won't append):
    mkbeagrepidx --deny-directory-pattern "$PWD/out,*/.repo,*/.git,*/.svn"
    

3.4 Searching using beagrep

Under your source code directory:

cd ~/src/android
beagrep -e "readlink"

3.4.1 Man page for beagrep

Here's a list of all arguments that beagrep takes:

beagrep -e REGEXP_MATCH [-p REGEXP_PATH] [-a ADDITIONAL_WORDS] [-v REGEXP_REVERSE_PATH] [-i] [-f] [-l] [--grep GREP_OPTIONS] [-a]
-e REGEXP_MATCH
This is the minimum required arguments. For e.g., beagrep -e readlink

The REGEXP_MATCH serves 2 purposes:

  • First, it is computed into whole words for querying beagle. For e.g., l] [--grep GREP_OPTIONS] above should be matched with the following REGEXP: l\] \[--grep GREP_OPTIONS\], but it should be converted into 4 words: l grep GREP OPTIONS for beagle.
  • Second, it is used as the regexp for grep to work on.
-a ADDITIONAL_WORDS
means to add more words into the beagle query. This is useful by increasing the work beagle need to do, but reduce the possible work set grep need to work on.
-p REGEXP_PATH
means to limit the search result to those files whose path-name matches REGEXP_PATH.
-v REGEXP_REVERSE_PATH
means to exclude those matched files whose path-name matches REGEXP_REVERSE_PATH.
-i
means to do case insignificant grep.
-f
means to do the match in file-names only. For example, beagrep -e readlink -f will only show results like readlink.h and readlink.c.

This is very useful for finding files. Note that when -f is used, the beagle querying words will be computed differently: only the basename will be used, and filename: is prepended onto each words.

-l
means to list the beagle matched list of files directly, without running grep to match on them.
–grep GREP_OPTIONS
means to pass additional arguments to the grep invocation. For e.g., the -l argument can be passed to beagrep directly, or it can be passed using --grep, they mean different things:

beagrep -e "hello world" -l will show a file containing "hello wonderful world", but beagrep -e "hello world" --grep -l will not show that file as a match.

4 How does it work?

beagrep is a very practical software, it works because of the following observations:

4.1 grep patterns are usually simple

Or rather, they can be decomposed into several simple sub-patterns: whole words.

For example, to grep such a seemingly complex pattern in Android source code:

"JsonToValue(\"\\\\\"hello world\\\\\"\","

In fact, it contained some simpler sub-patterns, i.e., those 3 wholesome English words: JsonToValue hello world. For a file to match this complex pattern, one necessary but not sufficient condition is for this file to contain all these 3 words. And what is good for this job? A search engine! Using beagle, the parent project for beagrep, a desktop search engine, you can find which files (actually, which file in this case) contained these 3 words in the blink of an eye.

Only 1 file contained all 3 words:

$beagrep-files 'JsonToValue hello world '
Beagrep index found at /home/bhj/.cache/for-code-reading//home/bhj/src/gingerbread-tegra/.beagrep
/home/bhj/src/gingerbread-tegra/external/chromium/base/json/json_reader_unittest.cc
/dev/null

So, you can imagine how quick it is to run grep on the set of files containing all required words:

beagrep -e "JsonToValue(\"\\\\\"hello world\\\\\"\","
pat is: 'JsonToValue("\\"hello world\\"",'.
beagrep query argument `JsonToValue hello world '
Beagrep index found at /home/bhj/.cache/for-code-reading//home/bhj/src/gingerbread-tegra/.beagrep
/home/bhj/src/gingerbread-tegra/external/chromium/base/json/json_reader_unittest.cc:168:  root.reset(JSONReader().JsonToValue("\"hello world\"", false, false));
Unmatched ( in regex; marked by <-- HERE in m/JsonToValue( <-- HERE ""hello world"",/ at /home/bhj/bin/beagrep line 98.

To summarize, complete words are what search engines are good for, and fortunately, when grepping source code, we almost always grep using whole words, instead of sub-words. For e.g., this evil pattern r.*e.*a.*d.*l.*i.*n.*k can match our readlink, but do you really need that power of grep?

4.1.1 BTW, creating the regexp pattern automatically in Emacs

From the example above, you can see the actual matched string is:

JsonToValue("\"hello world\"",

but because of meta characters in regexp and shell, the regexp pattern for beagrep to work on is a lot more complex:

"JsonToValue(\"\\\\\"hello world\\\\\"\","

It'd be tragedy if you need type all those \ characters by yourself. So of course I didn't. In fact, when you work in Emacs, after you marked some text and press C-u M-x grep, Emacs will correctly add the \ -s for you, to convert this plain text into a matching regexp (which can be passed to grep by the shell).

Note that last time I checked, the Emacs grep regexp generation code has some bugs, so I rolled my own fix for it, you can check my .emacs for definition of grep-default-command and grep-shell-quote-argument.

4.2 grep keywords are usually interesting

beagrep can greatly quicken the speed of grep, only because it can greatly reduce the working set of files for grep.

Note that you need provide interesting words to search for so as to greatly reduce the working set. By interesting I mean non-common.

For e.g., say you want to grep is. This word is so common in English that almost all files would probably contain it (source code file will probably contain it in comments). Then you are basically running grep nakedly on the whole android source.

Fortunately, this requirement is easy to meet. In the first place, you probably don't want to grep for common words; and even if you do need to, you probably won't grep for one common word alone, which is very uninteresting; thirdly, even if you do need to grep for a common word alone, you can provide more words for beagrep to work on by using its -a option (see the manpage above).

So:

  • Don't grep for include alone, because almost all C/C++ source and header files contain it.
  • Don't grep for import alone, because almost all java source files contain it.

And so on.

5 Other projects using beagrep

Because beagrep is so fast, I have used it in a couple other projects/tools.

5.1 offline Wikipedia

I added CJK character support into beagrep so that both English and Chinese offline Wikipedia can be browsed and subject-searched.

Check it out at https://github.com/baohaojun/system-config, sorry I didn't make it a stand-alone project, it's under the gcode/offline.wikipedia directory.

5.2 grep-func-call and grep-func-call-all

These 2 scripts generate the function usage relationship for easier code reading. Given a function funcA, which is used in a bunch of other functions in the system, these 2 scripts work like this:

  1. Use beagrep to find out in which files funcA appears.
  2. Run ctags-exuberant to tag these files, get the function definition lines: In file FileF, funcB is defined on line N, funcC is defined on line M, and no other functions are between N and M.
  3. If funcA appears in FileF, between line N and line M, then funcB calls funcA. This is not precise, but it's good enough.
  4. It not only works with functions, but also with almost all complete words that beagrep can handle.

The first script works with the current Emacs buffer only, the 2nd works system-wide, searching function call sites though out the whole source repo.

See the following output, note that grep-mode is used for jumping around, and => and <= is used to denote the caller and callee:

-*- mode: grep; default-directory: "~/src/android/kernel/drivers/staging/android/" -*-
Grep started at Mon Jun 30 13:18:42

grep-func-call -e "BINDER_WRITE_READ" --nc -a


Entering directory `/home/bhj/src/android/bionic/libc/kernel/common/linux'
binder.h:61: <=  signed long protocol_version;
binder.h:65: => #define BINDER_WRITE_READ _IOWR('b', 1, struct binder_write_read)
Entering directory `/home/bhj/src/android/external/kernel-headers/original/linux'
binder.h:81: <=         signed long     protocol_version;
binder.h:87: => #define BINDER_WRITE_READ               _IOWR('b', 1, struct binder_write_read)
Entering directory `/home/bhj/src/android/external/valgrind/main/coregrind/m_syswrap'
syswrap-linux.c:4258: <= PRE(sys_ioctl)
syswrap-linux.c:5498: =>            PRE_FIELD_READ("ioctl(BINDER_WRITE_READ).write_buffer",
syswrap-linux.c:5500: =>            PRE_FIELD_READ("ioctl(BINDER_WRITE_READ).write_size",
syswrap-linux.c:5502: =>            PRE_FIELD_READ("ioctl(BINDER_WRITE_READ).write_consumed",
syswrap-linux.c:5504: =>            PRE_FIELD_READ("ioctl(BINDER_WRITE_READ).read_buffer",
syswrap-linux.c:5506: =>            PRE_FIELD_READ("ioctl(BINDER_WRITE_READ).read_size",
syswrap-linux.c:5508: =>            PRE_FIELD_READ("ioctl(BINDER_WRITE_READ).read_consumed",
syswrap-linux.c:5511: =>            PRE_FIELD_WRITE("ioctl(BINDER_WRITE_READ).write_consumed",
syswrap-linux.c:5513: =>            PRE_FIELD_WRITE("ioctl(BINDER_WRITE_READ).read_consumed",
syswrap-linux.c:5517: =>                PRE_MEM_WRITE("ioctl(BINDER_WRITE_READ).read_buffer[]",
syswrap-linux.c:5520: =>                PRE_MEM_READ("ioctl(BINDER_WRITE_READ).write_buffer[]",
Entering directory `/home/bhj/src/android/frameworks/native/cmds/servicemanager'
binder.c:142: <= int binder_write(struct binder_state *bs, void *data, unsigned len)
binder.c:152: =>     res = ioctl(bs->fd, BINDER_WRITE_READ, &bwr);
binder.c:303: <= int binder_call(struct binder_state *bs,
binder.c:339: =>         res = ioctl(bs->fd, BINDER_WRITE_READ, &bwr);
binder.c:357: <= void binder_loop(struct binder_state *bs, binder_handler func)
binder.c:375: =>         res = ioctl(bs->fd, BINDER_WRITE_READ, &bwr);
Entering directory `/home/bhj/src/android/frameworks/native/libs/binder'
IPCThreadState.cpp:795: <= status_t IPCThreadState::talkWithDriver(bool doReceive)
IPCThreadState.cpp:848: =>         if (ioctl(mProcess->mDriverFD, BINDER_WRITE_READ, &bwr) >= 0)
Entering directory `/home/bhj/src/android/kernel/drivers/staging/android'
binder.c:2690: <= static long binder_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
binder.c:2712: =>       case BINDER_WRITE_READ: {
Entering directory `/home/bhj/src/android/kernel/drivers/staging/android'
binder.h:81: <=         signed long     protocol_version;
binder.h:87: => #define BINDER_WRITE_READ               _IOWR('b', 1, struct binder_write_read)

Grep finished (matches found) at Mon Jun 30 13:18:46

5.3 Generate call graph

This is a tool to generate a call graph for the software project you want. I have found its effectiveness is to be questioned, but you can see a picture below:

call-graph.png

This picture is generated for the adb sub-project in android code. You can see which functions are calling adb_connect, and which functions are called by it.

It is generated using beagrep + ctags-exuberant + graphviz, using my wrap scripts like following in the android/system/core directory:

generate-call-graph.pl > call_graph.org
dot-partition.pl call_graph.dot -s adb_connect -m 1 -r 2

6 Other tools similar to beagrep

6.1 Opengrok

This tool is very much like beagrep, it also uses the Lucene search engine! Beagrep uses beagle, which uses the C# implementation of Lucene. I probably wouldn't start working on beagrep had I known opengrok earlier. That said, I'm glad I have beagrep, because it is Emacs friendlier:-)

I do not use opengrok myself since I have beagrep already, but you are encouraged to try it out, it has quite a few nice features. Especially if you do not use Emacs for reading and writing code.

6.2 Ack/Ag

These 2 tools are faster grep, they do not use indexing. Author of Ack has maintained a quite complete list of other grep-like tools at this web page. Both opengrok and beagrep are listed there.

Footnotes:

1

Here's how I improved it from 0.8s to 0.23s: use Mono's AOT option to do the JIT optimization Ahead Of Time. I found out about this option when I was investigating why beagrep on my MacbookAir+Linux is consistently a few tenths of second slower than other machines, which is another story, see beagrep performance tuning on MacbookAir+Linux.

2

Here's how I run grep in Android source tree: time grep -I -r --exclude-dir=.git --exclude-dir=.repo -e readlink ~/src/android. The first time it took 5m20s, second time 1m21s, and third time 3.5s, and it can't be reduced much further. The result is retrieved on a ThinkPad T420 with 8G memory. One thing notable here I think is that it could require multiple runs to reach the minimum 3.5s. Another thing is I tried it multiple times on my MacbookAir with about the same debian installation, grep always takes about 1m as the minimum. Was it because less memory (only 4G for MacbookAir), or was it because SSD and thus different caching strategy? It eludes me. (EDIT: it's because cpu freq is locked at 800MHz, same as the above footnote, see beagrep performance tuning on MacbookAir+Linux.)

3

As stated in the previous footnote, something must have went wrong with this result, 30 minutes is just impossibly long compared to the 5m20s cache cold and 3.5s cache hot stated above. Was it because of a slower machine, or even being run within a virtual machine (on a slower real machine)? I can neither remember nor reproduce now. But I think one problem was the way I used grep, it's something along the line of: find-filter-out-non-src-files | xargs grep readlink. This will cause a lot of grep processes to be created because on most systems there is a max number of characters (typically 128k on 32 bit systems) limit with the command line arguments. When this limit is exceeded, you will see something like bash: XXX: Argument list too long. For this reason, xargs will call its command multiple times. For e.g., when I run this command: seq 1 1000000 | xargs bash -c 'echo -n $#\ ' true, you'd expect 1000000 is printed, because xargs will pass this many arguments to the bash command, but no, instead you will see (try it, see how slow it is!) 23692 21841 21841 21841... printed one by one because xargs must invoke this bash command multiple times and each time pass just as many arguments as possible. This is not the most efficient way for calling grep (with all the arguments creating and passing around and grep processes creating and exiting) and must have contributed to my 30 minutes measurement.