Introducing beagrep, the beast like grep
I just found out about the beagle project couple of months ago, I'm totally excited by it. It's the missing brick that I longed for to write a `grep on steroid' which I can use as a source code reading tool.
Yeah, right, I was using grep to read souce code, often times
finding cscope insufficient (because some files are not source code,
and even cscope's fuzzy syntax parser can not parse them). On the
other hand, with large projects, such as Linux Kernel, or the even
larger Android system, grepping can be very slow on the whole
project. I once searched for readlink
in android source tree, it
took me >30 minutes!
With beagrep (beagle combined with grep), I can grep it for less than 2 seconds!
1 Why is it possible
When you grep for reading source code's sake, you often don't need
complex regexp power: when you search readlink
, you grep readlink
,
not grep r.*e.*a.*d.*l.*i.*n.*k
, that just does not make any sense!
IOW, you 99.9% times search only for whole words like readlink
,
which is a kind of simple regexp, and unlike complex regexps (such as
r.*e.*a.*d.*l.*i.*n.*k
), is something search engines can deal with
perfectly.
2 How is it done
It is really a very simple idea, when you want to grep a target repexp, do the following:
- Break the target regexp into whole words, for e.g.,
grep -e some.*fun.*stuff
should be broken into "some fun stuff". - Query beagle with the whole words, beagle answers with which files in the repository contais these words. These files are the possibe matching files.
- Grep the target regexp in those possible files (which often is only a very small part of the whole repository, thus grep can finish in a blink of the eyes).
3 Modifications to beagle
Here's the details of how I changed beagle to satisfy my need (warning: boring stuff ahead):
- Change all beagle built in filters to FilterText. This is
because I don't want those keywords filtered by those
SourceCode filters. This way, I can beagle-query `extends
CFunny' to see which classes are inheriting from `CFunny' in
Java (The default Java filter will remoke
extends
since it is too common and uninteresting in java source files). - Remove some restricts. For e.g., only the first 100000 tokens in a file would be indexed, which is undesirable for my purpose. Also, I enlarged the memory threshold by 10 times, since I found it causing problems with some large xml files.
- Remove more restricts. Basically, I unremoved anything the NoiseFilter will remove. Also, another filter will remove common English words, I unremoved those as well.
- Added support for indexing Chinese characters (This is because I'm a Chinese).
4 Usage
Here's how I use it:
- Build a static index at the top level dir of the souce code:
mkbeagleidx
- Use beagrep in any directory in the source tree:
~/system-config/bin/beagrep -e "ENGLISH_STOP_WORDS"
The output is like the following:
beagle query argument `ENGLISH STOP WORDS' /src/beagle/beagled/LuceneCommon.cs:1206: ...ENGLISH_STOP_WORDS... ... ...
Note:
ENGLISH_STOP_WORDS
is broken into 3 words before beagle is queried.
5 Where to find everything
I have put the source code at github.
If you checkout the source code, you can find the beagrep and its helper scripts under system-config/bin.
The beagle source code I modified is under system-config/gcode/beagle.
The c# program which breakes ENGLISH_STOP_WORDS
to ENGLISH STOP
WORDS
is under system-config/gcode/BeagleTokenizer.
The simplest way to set things up is to run
~/system-config/bin/after-co-ln-s.sh
and then
~/system-config/bin/Linux/after-check-out.sh
For more details, please RTFS using beagrep!