CPP Integration Progress

Existing Work

All of existing work is based on basic C parsers so it can’t be directly applied to C++.

I found out that someone did a PhD on refactoring C code resulting in the CRefactory(down atm) project. Looks like reversing the C preprocessor was by far the hardest task to address. Languages consisting of stacked lexical syntaxes (like OCaml & camlp4) or preprocessorless ones like Java don’t even have this problem.

A way to tackle the problem is to violently shove the some of the preprocessor markup into the C AST. Unfortunately this is an incredibly hard task because CPP is purely lexical and works at the token level, whereas the C compilers works with higher level AST syntax trees. Combining the two can result in an ambiguous grammar which is useless for refactoring.

CRefactory integrates CPP into the AST where possible, it even handles some CPP conditionals. The author’s argument is that every CPP conditional represents a separate configuration and processing one configuration at a time will result in a combinatorial explosion of configurations to process thus conditionals must be integrated into the AST.

However, CPP-in-AST solution is error prone and has issues scaling to large projects. Besides I think processing every configuration within the AST is still potentially a combinatorial explosion, the major benefit being that one can eliminate unfeasible conditionals if they cause syntax errors. This conditional elimination would be incredibly slow for C++. I also don’t believe that this would be enough to solve Mozilla’s conditionals since most of the troublesome macros are platform specific and would have dependencies on the system headers. Having said that, I appreciate seeing people prove that with enough effort even seemingly impossible tasks can be accomplished.

A paper on ASTEC presents another solution which involves translating [lexical] CPP constructs into a CPP-like AST-based language. This works great in theory, but the translation process is only semi-automatic and requires a lot of hand-holding. I have mixed feeling about this approach. A simpler intermediate language is an excellent thing but as soon as it becomes “CPP sux, so I invented this better language: use it” the world gains yet another troublesome programming language.

My Approach

It took me a few weeks to figure out how to tackle the CPP problem in squash. I think I have a solution that will work real soon(TM).

I like to avoid solving unsolvable problems and stick with works-for-me approach. Thus I probably wont make squash aware of CPP conditionals. My existing approach of combining output of running squash on Mac/Linux(and Windows in the future) should take care of 99% of cases in Mozilla. The rest will be flagged by a compiler and will be trivial to fix by hand.

Macro Expansion

I also don’t see the benefit of a partially integrated CPP within the C++ AST. Instead I am modifying MCPP to log the macro expansions with special markup enclosed in comments. Afterwards I plan to modify elsa to parse the macro-expansion log and modify the position tags on AST nodes accordingly. Then I’ll be able to do cool things like “cut out the piece of code that was parsed as 0″ which will reference the source locations, figure out that 0 was a result of macro expansion and return “NULL”. Or when I try to rewrite a member in PR_MAX(something->GetPresContext()->foo, 0), the code will know that the the two GetPresContext() calls in the AST correspond to the same code fragment.

Besides, isn’t this marked-up code pretty?

/*m__CONCAT,"testcase3.c",2,*/
/*m__MATHDECL_1,"testcase3.c",5,*/
/*m__MATH_PRECNAME,"testcase3.c",8,*/
# 10 "testcase3.c"
/*!0x2adfbaa6e010 _*/
/*!0x2adfbaa6e015 b*/
# 10 "testcase3.c"
/*<__CONCAT*/__boo/*>*/
/*!0x2adfbaa6e010 _*/
/*!0x2adfbaa6e029 _*/
# 11 "testcase3.c"
# 11 "testcase3.c"
# 11 "testcase3.c"
testcase3.c:11: error: Not a valid preprocessing token "//"
macro "__CONCAT" defined as: #define __CONCAT(x,y) x ## y /* testcase3.c:*
/
macro "__MATHDECL_1" defined as: #define __MATHDECL_1(function,suffix) __CON
CAT(function,suffix) /* testcase3.c:5 */
from testcase3.c: 11: int __MATHDECL_1( __CONCAT(__, lgamma),_r);
int /*<__MATHDECL_1*/ /*<__CONCAT*//*<0x2adfbaa6e010*/ /*<__CONCAT*/__lgamma/*>
*//*>*// *<0x2adfbaa6e029*/_r/*>*//*>*/ /*>*/ ;

This is the original testcase:

#define __CONCAT(x,y) x ## y

#define __MATHDECL_1(function,suffix) \
__CONCAT(function,suffix)

#define __MATH_PRECNAME(name,r) __CONCAT(name,r)

__CONCAT(__, boo)
int __MATHDECL_1( __CONCAT(__, lgamma),_r);
Code

The MCPP maintainer is absolutely awesome work with. MCPP needs to be modified to preserve horizontal whitespace(which preprocessors don’t do) and to provide the above expansion info. He volunteered to do a lot of the MCPP modifications for me and has been a lot of help in guiding me through his code.

At this point looks like we’ll be able to integrate this work into the MCPP trunk. This is great news because to the best of my knowledge all of the other C refactoring tools either use a bitrotting fork of some C preprocessor or reimplement a yet another buggy version of CPP resulting in a lot of collectively wasted effort.

I hope to have this working in squash within two weeks so I can finish the outparam rewrite and hopefully rid squash of a large portion of code that details with the non-deterministic source positions due to macro expansion.

1 comment

  1. Robert O'Callahan

    Did you look at how the Eclipse C/C++ tools handle the preprocessor? They integrate it into the AST.

    Actually, given the issues with Oink/Elsa, I wonder if the Eclipse C/C++ tools are worth a closer look…