Several of the projects I want to pursue some day have to do with automatic examination and/or manipulation of source code. In the past I’ve played with TXL but found that – while it’s tremendously useful in certain circumstances – its focus on transformations between equivalent code is too limiting and the grammars available for my languages of interest (primarily C/C++) are incomplete. For my stack ripper I used the “-fdump-translation-unit” feature of gcc itself, and it mostly worked, but I found that both the format and content of its output were a bit random and buggy. Several other projects, such as GCC-XML and smatch, use basically the same approach – inheriting the same problems and/or requiring patches to specific gcc versions to make the output saner. I’m apparently not the first to go on this kind of search for a decent parser, so maybe there are others will be interested in what I’ve found.

What I’ve found, via (last link above), is Elkhound and Elsa. Elkhound is a parser generator, and Elsa is a C++ created using Elkhound. Both look quite “real” in the sense of being able to handle real C++ (not just an easy-to-implement subset) without choking and of producing parse trees that can not only be examined but can also be manipulated within a program without undue pain (this stuff’s never going to be easy). Here’s what the project page has to say about completeness.

Elsa can parse most C++ “in the wild”. It has been tested with some notable large programs, including Mozilla, Qt, ACE, and itself. I have not tried parsing KDE recently, so that’s the next major goal.

In C mode, Elsa can parse most C programs, including the Linux kernel (our highest-priority C program). It handles most gcc extensions, including K&R function definitions and the “implicit int” rule. There is a good chance it will parse your C program.

As for usability, the tutorial (nice that it even has one) includes an example of using Elsa for a semantic grep that can find uses of a particular identifier without being fooled by similar names and commented-out code and all that other stuff that causes programmers to grumble when they’re using simple text matching for this kind of stuff. This kind of functionality is a key building block for the sorts of things I want to do, and its implementation looks short and straightforward enough that I won’t go totally nuts fighting with my tools.

Overall, Elsa looks like something that could be very useful as a basis for some of my future projects. If you’re thinking of doing anything that involves parsing real code, such as searching for instances of known bugs or adding language features, Elsa might be a really good place to start.