Introducing beagrep, the beast like grep

I just found out about the beagle project couple of months ago, I'm totally excited by it. It's the missing brick that I longed for to write a `grep on steroid' which I can use as a source code reading tool.

Yeah, right, I was using grep to read souce code, often times finding cscope insufficient (because some files are not source code, and even cscope's fuzzy syntax parser can not parse them). On the other hand, with large projects, such as Linux Kernel, or the even larger Android system, grepping can be very slow on the whole project. I once searched for readlink in android source tree, it took me >30 minutes!

With beagrep (beagle combined with grep), I can grep it for less than 2 seconds!

1 Why is it possible

When you grep for reading source code's sake, you often don't need complex regexp power: when you search readlink, you grep readlink, not grep r.*e.*a.*d.*l.*i.*n.*k, that just does not make any sense!

IOW, you 99.9% times search only for whole words like readlink, which is a kind of simple regexp, and unlike complex regexps (such as r.*e.*a.*d.*l.*i.*n.*k), is something search engines can deal with perfectly.

2 How is it done

It is really a very simple idea, when you want to grep a target repexp, do the following:

  1. Break the target regexp into whole words, for e.g., grep -e some.*fun.*stuff should be broken into "some fun stuff".
  2. Query beagle with the whole words, beagle answers with which files in the repository contais these words. These files are the possibe matching files.
  3. Grep the target regexp in those possible files (which often is only a very small part of the whole repository, thus grep can finish in a blink of the eyes).

3 Modifications to beagle

Here's the details of how I changed beagle to satisfy my need (warning: boring stuff ahead):

  1. Change all beagle built in filters to FilterText. This is because I don't want those keywords filtered by those SourceCode filters. This way, I can beagle-query `extends CFunny' to see which classes are inheriting from `CFunny' in Java (The default Java filter will remoke extends since it is too common and uninteresting in java source files).
  2. Remove some restricts. For e.g., only the first 100000 tokens in a file would be indexed, which is undesirable for my purpose. Also, I enlarged the memory threshold by 10 times, since I found it causing problems with some large xml files.
  3. Remove more restricts. Basically, I unremoved anything the NoiseFilter will remove. Also, another filter will remove common English words, I unremoved those as well.
  4. Added support for indexing Chinese characters (This is because I'm a Chinese).

4 Usage

Here's how I use it:

  1. Build a static index at the top level dir of the souce code:
    mkbeagleidx
    
  1. Use beagrep in any directory in the source tree:
    ~/system-config/bin/beagrep -e "ENGLISH_STOP_WORDS"
    

    The output is like the following:

    beagle query argument `ENGLISH STOP WORDS'
    /src/beagle/beagled/LuceneCommon.cs:1206: ...ENGLISH_STOP_WORDS...
    ...
    ...
    

    Note: ENGLISH_STOP_WORDS is broken into 3 words before beagle is queried.

5 Where to find everything

I have put the source code at github.

If you checkout the source code, you can find the beagrep and its helper scripts under system-config/bin.

The beagle source code I modified is under system-config/gcode/beagle.

The c# program which breakes ENGLISH_STOP_WORDS to ENGLISH STOP WORDS is under system-config/gcode/BeagleTokenizer.

The simplest way to set things up is to run

~/system-config/bin/after-co-ln-s.sh

and then

~/system-config/bin/Linux/after-check-out.sh

For more details, please RTFS using beagrep!