beagrep, grep 2G source code in 2 seconds
Beagrep can grep readlink in Android source code in 0.8 second when
cache is hot, and about 11 seconds when cache is cold.
Using ag or ack or bare grep, the results are a couple of seconds when cache is hot, or a couple of minutes when the cache is cold1.
Table of Contents
1 Introduction
I still remember when I read the ``Linux Device Drivers'', 1st Edition, in the preface, the author says:
… The text you are approaching is the result of hours of patient grepping …
I have been using grep to read source code ever since. It's been
adequate for a long time, until I started reading Android source code
with it. I remember vividly how it took me 30 minutes to grep readlink in Android source code (with VCS directory and binary files
excluded).
Then I learned about the beagle software, and it occurred to me, what
if I use beagle and grep together? I mean, to use beagle first to
decide the (relatively a lot smaller) set of possible matching files
(i.e., files that contained the word readlink, taking the previous
example), then invoke grep on this set only. Usually the job can be
done in a couple of seconds.
Beagrep supports both Linux and Windows (requires CYGWIN).
(Note a Chinese version of this page is available here, but it may be outdated).
2 Install
2.1 Linux install
2.1.1 Using .deb under debian/ubuntu
I have provided several .deb package files on github, one for debian testing, 3 for ubuntu lucid(10.04), oneiric(11.10) and precise(12.04). They are all for amd64. If you happen to be using one of these, you can download them here. Here's how to install it using ubuntu 12.04 as an example:
dpkg -i beagrep-for-ubuntu-precise-2012-09-05-ae0a1a2.deb # will probably get dependency errors apt-get -f install # fix the dependency errors
2.1.2 Compile by yourself
If you need compile by yourself, using Ubuntu precise as an example:
git clone https://github.com/baohaojun/beagrep.git -b for-ubuntu-precise # you can also run "git branch -a" to see other versions of debian/ubuntu supported # Install build dependencies. You can open debian/control and check # the build dependencies and install all of them. The following # command will try to install them automatically, but if it fails then # you need figure out which packages to install manually. Good luck! # Or you can contact some debian/ubuntu advanced user for help. apt-get install $(cat debian/control | perl -ne 'print if m/Build-Depends/..m/Standards-Version/'|grep -v -e Build-Depends:\\\|Standards-Version|perl -npe 's/,/ /g') # do the build (set -e; autoreconf -i; ./configure; make -j4; sudo make install; echo OK)
2.1.3 Build .deb by yourself
Before someone volunteer to make beagrep into debian/ubuntu distribution, you might also find a need to build .deb by yourself (for e.g., to help your colleagues use beagrep easily without needing to compile it themselves).
Here's how I do it:
- Make sure you can compile beagrep manually by referring to the previous section.
- Make a tar.gz of beagrep (note that the filename must be exact, it
is required by dpkg):
tar czfv ~/tmp/beagrep_0.4.0.orig.tar.gz beagrep --exclude-vcs
- Untar it and build the .deb package:
cd ~/tmp tar zxfv beagrep_0.4.0.orig.tar.gz cd beagrep dpkg-buildpackage
The .deb file is generated in the
~/tmpdirectory. - Refer to #using.deb for how to install.
2.2 Windows install
Windows install can only be done compiling by yourself, and be pre-warned, it's not easy…
2.2.1 Install dependencies
You need install CYGWIN and Mono for Windows.
2.2.2 Get beagrep Windows code
git clone git://github.com/baohaojun/beagrep.git -b Windows
2.2.3 Compile and install
cd beagrep
bash build-win.sh
3 Usage
3.1 Quick smoke test
cd /tmp; mkdir $$; cd $$; echo main > 1.txt; mkbeagrepidx; beagrep -e 'main' --grep '--color=auto'; true; cd ..; rm $$ -rf
If beagrep has been installed correctly, at the end of the above
command's output, you should see something like
/tmp/11468/1.txt:1:main.
3.2 Create index
In your source code directory, using android as example:
cd ~/src/android
mkbeagrepidx
This step will cost you quite some time, under my Linux indexing Android cost me about half an hour (it's about the same time you run grep directly on android source). So I'd advise you create a cron job to do it at midnight.
It takes about 8 minutes to index linux kernel (v3.6-rc6):
Debug: IndexWorker Done Debug: Elapsed time 478.01s.
But the good news is if indexing has already been done before, there re-indexing will only work on those updated files based on file time-stamp. So it will cost you only a few minutes to re-index the whole Android source.
Even better, after an initial indexing, you can do a sub-folder
re-index, mkbeagrepidx will ask you if you want to update the index
found for upper directory. This generally only takes seconds depending
on the size of the sub-folder.
3.3 Man page for mkbeagrepidx
mkbeagrepidx is a simple wrapper over beagrep-build-index. You can
configure it for which directories to ignore using
--deny-directory-pattern option. By default,
- The
$PWD/outis ignored, because it contains android build output - The */.git is ignored, because of well known reason
- The */.repo is ignored, for the same reason.
The syntax is comma separated shell glob patterns, and you can check how it is converted into regular expression by examining the beginning of mkbeagrepidx output:
Always: Will ignore directories matching regular expression: ^(?:/home/bhj/tmp/test/out)$|^(?:.*/\.repo)$|^(?:.*/\.git)$
You can customize it using several ways, in the order of increasing priority:
- Not customize it, then the default
"$PWD/out,*/.repo,*/.git"
will be used.
- Override it in
~/.mkbeagrepidx.rc, setting theBEAGREP_IGNORE_DIR_PATTERNSenvironment variable:export BEAGREP_IGNORE_DIR_PATTERNS="$PWD/out,*/.repo,*/.git"
- Override it in the .mkbeagrepidx.rc in the current working directory, same as the above.
- Override it on the command line (you must repeat the default pattern
because it won't append):
mkbeagrepidx --deny-directory-pattern "$PWD/out,*/.repo,*/.git,*/.svn"
3.4 Searching using beagrep
Under your source code directory:
cd ~/src/android beagrep -e "readlink"
3.4.1 Man page for beagrep
Here's a list of all arguments that beagrep takes:
beagrep -e REGEXP_MATCH [-p REGEXP_PATH] [-a ADDITIONAL_WORDS] [-v REGEXP_REVERSE_PATH] [-i] [-f] [-l] [--grep GREP_OPTIONS] [-a]
- -e
REGEXP_MATCH - This is the minimum required arguments. For e.g.,
beagrep -e readlinkThe
REGEXP_MATCHserves 2 purposes:- First, it is computed into whole words for querying beagle. For
e.g.,
l] [--grep GREP_OPTIONS]above should be matched with the following REGEXP:l\] \[--grep GREP_OPTIONS\], but it should be converted into 4 words:l grep GREP OPTIONSfor beagle. - Second, it is used as the regexp for grep to work on.
- First, it is computed into whole words for querying beagle. For
e.g.,
- -a
ADDITIONAL_WORDS - means to add more words into the beagle query. This is useful by increasing the work beagle need to do, but reduce the possible work set grep need to work on.
- -p
REGEXP_PATH - means to limit the search result to those files whose path-name matches
REGEXP_PATH. - -v
REGEXP_REVERSE_PATH - means to exclude those matched files whose path-name matches
REGEXP_REVERSE_PATH. - -i
- means to do case insignificant grep.
- -f
- means to do the match in file-names only. For example,
beagrep -e readlink -fwill only show results like readlink.h and readlink.c.This is very useful for finding files. Note that when
-fis used, the beagle querying words will be computed differently: only the basename will be used, andfilename:is prepended onto each words. - -l
- means to list the beagle matched list of files directly, without running grep to match on them.
- –grep
GREP_OPTIONS - means to pass additional arguments to the
grep invocation. For e.g., the
-largument can be passed to beagrep directly, or it can be passed using--grep, they mean different things:beagrep -e "hello world" -lwill show a file containing "hello wonderful world", butbeagrep -e "hello world" --grep -lwill not show that file as a match.
4 How does it work?
beagrep is a very practical software, it works because of the following observations:
4.1 grep patterns are usually simple
Or rather, they can be decomposed into several simple sub-patterns: whole words.
For example, to grep such a seemingly complex pattern in Android source code:
"JsonToValue(\"\\\\\"hello world\\\\\"\","
In fact, it contained some simpler sub-patterns, i.e., those 3
wholesome English words: JsonToValue hello world. For a file to
match this complex pattern, one necessary but not sufficient condition
is for this file to contain all these 3 words. And what is good for
this job? A search engine! Using beagle, the parent project for
beagrep, a desktop search engine, you can find which files (actually,
which file in this case) contained these 3 words in a blink of the
eyes.
Only 1 file contained all 3 words:
$beagrep-files 'JsonToValue hello world ' Beagrep index found at /home/bhj/.cache/for-code-reading//home/bhj/src/gingerbread-tegra/.beagrep /home/bhj/src/gingerbread-tegra/external/chromium/base/json/json_reader_unittest.cc /dev/null
So, you can imagine how quick it is to run grep on the set of files containing all required words:
beagrep -e "JsonToValue(\"\\\\\"hello world\\\\\"\","
pat is: 'JsonToValue("\\"hello world\\"",'.
beagrep query argument `JsonToValue hello world '
Beagrep index found at /home/bhj/.cache/for-code-reading//home/bhj/src/gingerbread-tegra/.beagrep
/home/bhj/src/gingerbread-tegra/external/chromium/base/json/json_reader_unittest.cc:168: root.reset(JSONReader().JsonToValue("\"hello world\"", false, false));
Unmatched ( in regex; marked by <-- HERE in m/JsonToValue( <-- HERE ""hello world"",/ at /home/bhj/bin/beagrep line 98.
To summarize, complete words are what search engines are good for, and
fortunately, when grepping source code, we almost always grep using
whole words, instead of sub-words. For e.g., this evil pattern
r.*e.*a.*d.*l.*i.*n.*k can match our readlink, but do you really
need that power of grep?
4.1.1 BTW, creating the regexp pattern automatically in Emacs
From the example above, you can see the actual matched string is:
JsonToValue("\"hello world\"",
but because of meta characters in regexp and shell, the regexp pattern for beagrep to work on is a lot more complex:
"JsonToValue(\"\\\\\"hello world\\\\\"\","
It'd be tragedy if you need type all those \ characters by
yourself. So of course I didn't. In fact, when you work in Emacs,
after you marked some text and press C-u M-x grep, Emacs will
correctly add the \ -s for you, to convert this plain text into a
matching regexp (which can be passed to grep by the shell).
Note that last time I checked, the Emacs grep regexp generation code
has some bugs, so I rolled my own fix for it, you can check my .emacs
for definition of grep-default-command and
grep-shell-quote-argument.
4.2 grep keywords are usually interesting
beagrep can greatly quicken the speed of grep, only because it can greatly reduce the working set of files for grep.
Note that you need provide interesting words to search for so as to greatly reduce the working set. By interesting I mean non-common.
For e.g., say you want to grep is. This word is so common in English
that almost all files would probably contain it (source code file will
probably contain it in comments). Then you are basically running grep
nakedly on the whole android source.
Fortunately, this requirement is easy to meet. In the first place, you
probably don't want to grep for common words; and even if you do need
to, you probably won't grep for one common word alone, which is very
uninteresting; thirdly, even if you do need to grep for a common
word alone, you can provide more words for beagrep to work on by
using its -a option (see the manpage above).
So:
- Don't grep for
includealone, because almost all C/C++ source and header files contain it. - Don't grep for
importalone, because almost all java source files contain it.
And so on.
5 Other projects using beagrep
Because beagrep is so fast, I have used it in a couple other projects/tools.
5.1 offline Wikipedia
I added CJK character support into beagrep so that both English and Chinese offline Wikipedia can be browsed and subject-searched.
Check it out at https://github.com/baohaojun/system-config, sorry I
didn't make it a stand-alone project, it's under the
gcode/offline.wikipedia directory.
5.2 grep-func-call and grep-func-call-all
The latter used beagrep and ctags-exuberant to search for which
functions called a specific function. It's under the bin/ directory.
5.3 Generate call graph
This is a tool to generate a call graph for the software project you want. I have found its effectiveness is to be questioned, but you can see a picture below:
This picture is generated for the adb sub-project in android code. You
can see which functions are calling adb_connect, and which functions
are called by it.
It is generated using beagrep + ctags-exuberant + graphviz, using my wrap scripts like following in the android/system/core directory:
generate-call-graph.pl > call_graph.org dot-partition.pl call_graph.dot -s adb_connect -m 1 -r 2
Footnotes:
1 Here's how I run grep in Android source tree: time grep -I -r --exclude-dir=.git --exclude-dir=.repo -e readlink
~/src/android. The first time it took 5m20s, second time
1m21s, and third time 3.5s, and it can't be reduced much
further. The result is retrieved on a ThinkPad T420 with 8G
memory. One thing notable here I think is that it could
require multiple runs to reach the minimum 3.5s. Another
thing is I tried it multiple times on my MacbookAir with
about the same debian installation, grep always takes
about 1m as the minimum. Was it because less memory (only 4G
for MacbookAir), or was it because SSD and thus different
caching strategy? It eludes me. (Edit: it is because of cpu
freq is locked at 800MHz, see beagrep performance tuning on MacbookAir+Linux.