26
Jun 07

Status Report: Recent Work

New Tool: Prcheck – PRBool’s best friend

Mozilla has a number boolean types and most of them are a form of an int. People expect them to behave like a bool, but since they can be assigned more than 1 value for true, this assumption can lead to bugs. Prcheck will mandate that prbools can only be assigned 1 or 0. Typically static checkers output errors, but prcheck outputs errors & fixes for them in diff format so they can be fixed automatically. See this bug for more info. I think the tool is almost complete. Hopefully, once the MCPP issues described below are addressed, I’ll have one giant patch to eliminate this problem.

MCPP Teething Troubles

Even in the open-source world there are some problems with vendor monoculture. For example we a single vendor providing the C++ compiler and one C preprocessor that is widely used. Even though MCPP is a portable preprocessor that can plug into multiple compilers, it still chokes on some GCCisms. Things are slowly changing and new open source compilers are on the horizon: I can’t wait.

Additionally, since Mozilla is one of the largest projects to be preprocessed with MCPP, we found some scalability bugs in MCPP. The MCPP maintainer addressed those.

Elsa Backend Source Location Work – Works Well?

I am amazed, but hack & slash approach to adding end-of-ast-node info to elkhound still seems to work correctly! It mostly required consulting wikipedia on LR & GLR parsing algorithms to understand how to modify the data structures used in elkhound. I love the combination of Wikipedia and pretty & easy to modify source code.

The preprocessor undo-log appears to function exactly as intended too. That combined with the end-of-ast-node info means that I finally have the ability to easily cut’n'paste code to move it around programmatically. It basically makes the C/C++ pretty printer in oink obsolete for my purposes. This is cool because now I can output prettier source code AND rely on less oink C++ code to do so (ie no need to call the pretty printer[s]), so more refactoring tools could be written in higher level languages such as OCaml or JavaScript in the future.

Publishing My Oink Mods

I have been sending people tarballs on request, due to not being able to integrate my changes into the upstream svn repository. I tried a few svn2hg tools, but they didn’t work very well. hgsvn is good enough to work for OpenJDK, so it might work for me. I’ll try to publish an hg repository of my work within the next week or two.

I’m sorry for not doing this earlier, but it’s extremely time consuming to switch gears from coding oink stuff to trying to package it, especially due to all of the politics involved. Hopefully once the repository is easier to work with I’ll have more people sharing the oink worldload with me. Now that dust settled and the major missing pieces are either implemented or decided upon, a lot of exciting possibilities opened up.

libgcc

Another day, another piece of preprocessor trivia. Turned out there was an alternative to MCPP that I could have used: gcc’s libcpp. It is common knowledge that gcc uses an integrated preprocessor. It is not so well known that the preprocessor is factored out into what appears to be a mostly standalone library inside of gcc called libgcc. Google barely knows about it and there are no docs or other webpages pointing to it, so i missed it in my search.

This could be a useful project, take libcpp turn it back into a standalone preprocessor and add the cpp undo log comments to it. The only downside of libgcc is that it is GPL which would normally be a pain for BSD-licensed projects, but by turning it back into a standalone tool there is no linking to to worry about.

So if anyone finds implementing the macro-undo log with libgcc interesting, please feel free to do so :)

Random Rant on Parallels vs VMWare Fusion vs BootCamp vs 64bit Linux

Continue reading →


19
Jun 07

It works!

Back to Real Life

Just over a month ago I ran into this problem. Before last month I hoped to never have to work on the C preprocessor or a parser generator. So much for that plan. Now my head is full of CPP-expansion-related trivia.

After a month of design and implementing changes to mcpp, elsa, elkhound and oink I can finally move on. During the last week all of the pieces of the puzzle finally came together ( without any nasty surprises other than bugs). Now I can go back to solving real problems.

Upcoming Features

Benefits of having CPP support don’t stop at actually being able to rewrite code. Now that the Oink C++ parser is aware of the C preprocessor, it should be possible to refactor C++ almost as easily as Java. Here are some cool things that are possible now:

  • Nicer UI. Exact source position info allows for eclipse-style context menus for renaming & other refactorings in lxr (or other online code browsers).
  • Richer type system. It should be possible to detect macro constants. Tools will be able to tell the difference between a prnull and 0. Should also be able to detect and maintain NSRESULT and other macros used for declarations.
  • Macro refactoring. Now it’s possible to write a tool to automate the process of converting function-like macro calls to actual function calls. For example, PR_MAX could be converted into std::max calls with all of the accompanying casts to ensure that the resulting C++ is correct.
  • Other nasty tricks like rewriting code within macro declarations.

I’m not sure how much of these features I will work on, but they are relatively easy to implement now.

I plan to write a minimalistic successor to squash and develop more aggressive refactorings than renames.

Additionally, I will continue pushing above changes upstream and trying to facilitate a more open oink development community.


12
Jun 07

Undoing CPP Expansion in 3 simple steps. Say “Hello” to easier C++ rewriting.

This is incredibly exciting: I believe that I finally solved the messy and mind-numbingly boring CPP/C++ integration problem! Having code displaced or generated due to CPP-expansion should no longer be a fatal problem for Squash. I believe macro-expansion is (or was) the single biggest problem between me and large-scale automated refactoring of the Mozilla codebase.

What’s even more exciting is that I think my solution is both incredibly simple to implement and more general than prior work. Most other tools combine the CPP expansion & C parsing into a single step and then integrate (or should I say violently shove?) CPP constructs into the AST. This results in complete lack of separation between preprocessing and program analysis. For example, due to this tight coupling existing solutions were useless to me because the fancy CPP logic could not be separated from the C parser. I would also have a hard time submitting a more convoluted C++ parser upstream to the Elsa maintainer.

Design

There are three parts to my solution:

  1. Critical component. A CPP expansion undo-log injected during CPP-expansion by a modified C preprocessor (upcoming version of MCPP). The statements are wrapped in C comments such that the preprocessed result can be parsed by any C/C++/etc parser or compiler. Implementation-wise this is the hardest part since MCPP(as most other C proprocessors) was never designed it keep track of macro expansion info.
  2. A small modification to the Elsa lexer to parse the undo-log and set it aside in a separate data structure.
  3. Tricky. A function that utilizes the cpp undo-log to map the preprocessed source locations to the unpreprocessed ones. This is a a ridiculously simple solution to a tricky design problem of how to efficiently advertise the fact that every AST node has at least 2 different source positions (pre expansion, post expansion & a stack of positions resulting from expanding nested macros).

The MCPP maintainer is almost done with 1. I have a prototype implementation of 2 & 3 weighing in at less than 500lines. Now that the design phase is complete, the amount of changes to Elsa is trivial, so I should be done with those real soon now.

Looking Ahead

Now I need to modify Elsa to retain more precise source locations. This includes adding end-of-ast-node-location and adding positions to nodes(such as expressions) that don’t even have a start position at the moment. This combined with cpp-undo-log enhanced precise positions should allow for code rewrites to retain as much original source code as possible. This reduces the amount of ugly machine-generated code and results in better correctness (existing code is likely to work).

CPP Undo-log Example

The undo-log took a couple of tries to get right. Now macro-parameters have a notion of scope and sensible names. The following example features macro-induced column displacement and macro-expansion causing line shrinkage.

#define NULL 0L
#define FOO(a, b) a + b
int i = NULL; int j;
int k = FOO(
FOO(NULL , 1),
2);

Preprocessed version

# 1 "testcase4.c"
/*mNULL 1:8-1:15*/
/*mFOO 2:8-2:23*/

int i = /*<NULL 3:8-3:12*/0L/*>*/;
# 3 “testcase4.c”
int j;
int k = /*<FOO 4:8-6:3*//*!FOO#0-0 5:0-5:13*//*!FOO#0-1 6:1-6:2*//*<FOO#0-0*//*<FOO*//*!FOO#1-0*//*!FOO#1-1*//*<FOO#1-0*//*<NULL*/0L/*>*//*>*/ + /*<FOO#1-1*/1/*>*//*>*//*>*/ + /*<FOO#0-1*/2/*>*//*>*/;

Conclusion

It took a lot to arrive at such a simple solution. I expect that all of my work is likely to end up upstream in BSD-licensed projects: MCPP & and Elsa/Oink. I sincerely hope that other people will be able to build on it for their CPP-infested analysis needs and avoid the unbearable mind-numbing discomfort associated with making CPP play along.