02
Mar 07

Uno, dos… DeHydra!

A Creature so Fierce…

I’ve been wrestling with control flow graphs like this. I eliminated those pesky “empty” nodes found in the previous incarnations, improved branching to track conditions and realized that what I’m really doing is developing a tool to bite off the excessive necks and heads (otherwise known as edges and basic blocks) of the Hydra monster that is represented by the bloody graph. Thus, say hello to DeHydra which was once known as dos.

DeHydra Plans

I am quite happy with how dehydra is turning out. JavaScript is a nice language to pattern match code in a flow sensitive fashion and the uno-style api seems appropriate for the task.

In the near term I really need some value-awareness through abstract interpretation. For example, this graph of a simple function demonstrates several things that dehydra will need to notice:

  1. There is a path from hydra-unfeasable.c:6 to hydra-unfeasable.c:10 and that those blocks are part of only a single control flow, not 4.
  2. hydra-unfeasable.c:12 requires a contradiction in the condition in order to have flow through it and should be dropped or even reported as dead code.

I pondered a number of approaches ranging from value analysis, using a theorem prover and reversing conditions via prolog code. A simple C++ symbolic expression interpreter in C++ seems like the most productive thing at this point. So I will focus on that in the next short while.

While I was reviewing my options I found a paper on null pointer analysis for Java to be a useful stating point for references to other relevant literature.

Longer Term DeHydra Plans

There is literature on how source-to-source transformation (like squash but also like the future C++ to ES4 tool) is assisted by flow sensitive static analysis tools (like dehydra) which I will have to absorb.
DeHydra will gain whole-program analysis capability by:

  1. Inlining control flow graphs into the main() function
  2. Supporting user-defined function attributes and using some sort of attribute inference (like in ML) to propagate them. This method would have higher performance, but less precision.

For the above to work will also need support for a user-correctable/suggestable callgraph to compensate for function pointers and dynamic module loading. This should also be enough to do dead code and module detection. Oh and I want dehydra to support JavaScript for doing cross-language analysis.

There is also Whaley’s paper which uses prolog as a concise and expressive query engine for the source code. I would really like to see dehydra evolve such capabilities.

All this will take more than a little time on my own, but this stuff is too exciting for other impatient people to idly observe.


28
Feb 07

WebSquash

Looking for developers to test the web frontend for squash
I got the web frontend to squash working. Right now I’m looking for people to test it on my test server before I open it to the wild web. It ended up in a further frontend script explosion, but all of the pieces seem to make sense. As it stands right now there are 5 pieces:

  1. JavaScript client-side provides progress notification
  2. A PHP frontend to communicate with the stateful server
  3. Python server that handles command queuing, progress reporting and error handling
  4. Python library to build a list of possible candidates for squashing, produce the necessary .i files and an invocation command from squash
  5. Squash: the friendly neighborhood class member renamer

Passion of CPP: Macros are Considered Painful

In the process of testing the web frontend I updated the Mozilla sourcecode only to notice that Elsa can no longer parse files for tasks that worked before. At first I got a little discouraged thinking that I’ll have to teach Elkhound about yet another obscure C++ feature that wasn’t handled correctly before. However, turned out that in one case I was feeding squash a file that didn’t even compile and in the other 2 cases CPP was messing with my head.

The first case was the magic of CPP leading to unintentional code duplication and squash confusion: PR_MAX(GetPresContext()->PointsToAppUnits(0.5f), onePixel)
gets expanded and parsed as
GetPresContext()->PointsToAppUnits(0.5f) ? GetPresContext()->PointsToAppUnits(0.5f) : onePixel

I ended up putting in a special case teaching squash to not get upset if it can only find one of the two instances of class member to replace when PR_MAX is involved.

The second case was exciting. In my innocent perception of CPP wonder I thought that running g++ on a .cpp or a .i file produced from the said .cpp would result in pretty similar behavior. Not so.

PR_LOG(gLog, PR_LOG_DEBUG,
("xul: %.5d. %s %s=%s",
-1, // XXX pass in line number
NS_ConvertUTF16toUTF8(extraWhiteSpace).get(),
NS_ConvertUTF16toUTF8(qnameC).get(),
NS_ConvertUTF16toUTF8(valueC).get()));

yields

do { if (((gLog)->level >= (PR_LOG_DEBUG))) { PR_LogPrint ("xul: %.5d. %s %s=%s", -1, // XXX pass in line number NS_ConvertUTF16toUTF8(extraWhiteSpace).get(), NS_ConvertUTF16toUTF8(qnameC).get(), NS_ConvertUTF16toUTF8(valueC).get()); } } while (0);

Here the // comment ends up being promoted to being inside a line due to PR_LOG contracting and the resulting line won’t parse since half of it is commented out.

This kind of CPP mischief leads me to believe that something has got to give. If we are to embrace automated tools to aid in verification and development either CPP use has to be reduced considerably or Elsa needs to get a builtin preprocessor. I suspect the solution to this will involve a mixture of the two approaches.


24
Jan 07

Will Rename Class Members for Food

Squash may now be ready as a class member renaming tool for early adopters. I would like people to use me as a frontend to squash. Email me your requests for renames and I will reply with giant patches. This way squash can be immediately useful. Plus I can fix bugs in squash and figure out actual usecase while I get the frontend set up.Progress

Squash can now produce a good looking 92K patch for renaming nsIFrame::GetPresContext. This means that squash can now correctly traverse 167 files and produce a patch that affects 103 of them. I am going to work on the web frontend next.

Some issues below.

Continue reading →


11
Jan 07

Squash Progress and Plans

Out-param Rewriting Work

Since the last post I worked on rewriting functions that use out-parameters to use return values instead. I got as far as rewriting method definitions and simple call sites, but decided to hold off further work until the rest of squash is more complete.

Squash Development Roadmap
Robert O’Callahan helped me devise a near term roadmap. I am going to focus getting squash to be production quality for member renames and to produce commit-quality patches. An example query would be to rename sIFrame::GetPresContext to nsIFrame::PresContext. This involves a couple of big details:

  • Produce aesthetically pleasing code via text substitution instead of oink pretty printing. The advantage of this is that the original coding style, comments and indentation will all be preserved. This involves reparsing the resulting code to verify correctness (doubles-memory usage & processing time).
  • To produce a complete patch squash needs to process all of the relevant source code. This increases memory usage and processing time linearly. I’ll use grep to narrow down candidates for processing and in the future will use a AST database of mozilla to figure out exactly what needs changing.
  • It is useful to be able to process all interesting source code in one invocation but just processing the layout/generic directory sequentially uses over 2GB of RAM (Elsa’s AST does not support deallocation) and takes 3 minutes on a quad Opteron. So in order to reduce RAM usage and be a trendy multi-core developer I’ll fork() a process for every file and use that for both parallelism and memory cleanup purposes.
  • Develop a web frontend that maintains an up-to-date mozilla source tree and has squash setup on it where one would be able to enter their rename operation and have patch emailed back to them. Rob even had a cool idea to have the user enter a bugzilla id and have the patch automatically attached to that. This will be useful so I don’t have to work so hard on packaging squash and users will get instant gratification. Plus people without quad Opterons will be able to test squash too :)

All that is Milestone 1. After that I’ll work on infrastructure like AST-node-location info, cleaning up pretty printing and defining the exact goal for the next milestone.

Current Status

Over the past 3 days I refactored squash to be able to do renames without having to go through class squashing, etc. I added the ability to rename class members and now it can produce ugly patches for that.

The current workflow to rename nsIFrame::GetPresContext to nsIFrame::PresContext is:

  1. Identify possible targets
    find ~/work/ff-build -name \*.o |xargs grep nsIFrame > /tmp/output.sh
  2. My sed is rusty so I used regexps in Kate to convert resulting lines into something like
    make -C ./layout/generic/ nsSpacerFrame.i
    make -C ./layout/generic/ nsFrameSetFrame.i
    make -C ./layout/generic/ nsBlockFrame.i
  3. Run the script to produce the needed .i files
    . /tmp/output.sh
  4. Grand-finale:
    find ~/work/ff-build/ -name \*.i |time xargs ./squash -o-lang GNU_Cplusplus -sq-implementation nsIFrame -sq-no-squash -sq-rename-member GetPresContext PresContext > nsiframe.diff
    Note that find outputs absolutely filenames which is essensial for squash to resolve relative include files.

The setup and squashing itself is a bit laborious and RAM/CPU intensive and is the reason for a web frontend. I am going to be ecstatic once this all works.


03
Jan 07

Squashed and Compiled

Squash Milestone Reached
Squash can now produce a patch that squashes my testcase class nsCSSLoaderImpl into the nsICSSLoader interface such that the resulting code compiles, links and runs!

Gory Details
Patching function bodies turned out easier than expected. Since the last post, I’ve added the ability to rewrite variable declarations, casts and static method calls. This was enough to get nsCSSLoader.cpp compiling.

I also ran into an issue where some methods need to remain virtual such that they can be referenced from other modules. I added a -sq-virtual flag to specify method names which need to stay virtual.

I discovered that the implementation class can be used from other source files so now squash can work on multiple files. Unfortunately, this made me run into another Elsa misfeature: memory allocation. Elsa data structures do not attempt to clean up in their destructors. Once an AST is produced, it will remain in memory for the duration of execution. This is an issue, because merely parsing all of the .i files in layout/style/ takes over 600M of memory even though squash is strictly sequential and processes a single file at a time. Hopefully, converting Elsa to use auto_ptr is feasible and I wont have to resort to funny fork() tricks to reclaim memory.

ML vs C++ for Compilers: Rant
I wonder why people insist on using C++ for symbolic manipulations instead of an *ml like O’Caml and either give up or, more frequently, reinvent features such as the ml type system, list processing, garbage collection or pattern matching. Isn’t it more productive to not have to deal with segfaults, slow compilation times and have a tenfold reduction in code size?

Squash Usage
First I grep the build directory for usage of CSSLoaderImpl. This imprecise and will eventually be handled by squash itself, but first the memory deallocation issue has to be addressed or an index of the whole sourcetree needs to be built.

find -name \*.o | xargs grep CSSLoaderImpl

This returned nsCSSLoader.o and nsLayoutStatics.o . Now .i files are produced by running make in their respective directories

make nsCSSLoader.i
make nsLayoutStatics.i

For convenience I gather the .i files in a moz directory and run squash.

./squash -o-lang GNU_Cplusplus -sq-exclude-include string/nsTString.h -sq-include nsString.h -sq-include nsCOMArray.h -sq-virtual LoadSheetSync -sq-virtual LoadSheet -sq-implementation CSSLoaderImpl moz/nsCSSLoader.i moz/nsLayoutStatics.i > cssloader.patch

Turns out pretty printing C++ is hard and Oink/Elsa’s pretty printer still needs a lot of work. By producing patches and only rewriting part of the code squash rewrites only the code that needs changing. This avoids pretty printer bugs, and maximally preserves comments and the original code structure. The wackyness of the pretty printed code is apparent in the cssloader.patch, especially in the function bodies.

Future Work
I am happy to see that patching is viable even without precise source coordinates or preprocessor support in Elsa. My near term goals for squash are:

  • Push squash upstream
  • Add the ability to translate out-parameters to return values where possible
  • Get a list of candidates for DeCOMtamination and improve squash enough to process all of them
  • Work on a source code indexer. This would be useful to both squash as a semantic grep database and could be used to improve lxr.

In the longer term, I would also like to see some Elsa changes:

  • Figure out a memory de-allocation strategy
  • Resolve the pain that is caused by Elsa having own “string” class which makes using STL an exercise in namespace verbosity. If Elsa were to switch to C++ strings, the above de-allocation job would be simplified too.
  • Elsa is lossy when it comes to C++ extensions: may need to extend the Elsa AST a little
  • It would be nice to improve Elsa memory consumption further. This would be hard.
  • It would be great to make Elsa’s C++ const-correct

21
Dec 06

Oink, Meet Squash

Progress

I spent some time proving to myself that it is possible to automate DeCOMtamination. The result is squash – a tool that aims to accomplish the first step of DeCOMtamination. My near term goals are to be able to squash together a real life XPCOM interface and implementation such that:

  1. The resulting code compiles
  2. Firefox runs correctly
  3. Simple functions using out parameters are converted to use return values instead

Currently I am approximately halfway through with first requirement. In its current form squash lives as a patch to the oink suite of tools.

Usage

  • Pick a simple XPCOM implementation class to squash. I like CSSLoaderImpl.
  • Produce a .i file because Oink/Elsa do not have an integrated preprocessor yet. make CXX="gcc -E" nsCSSLoader.o; mv nsCSSLoader.o nsCSSLoader.i
  • Squash it: ./squash -o-lang GNU_Cplusplus -sq-implementation CSSLoaderImpl nsCSSLoader.i -sq-exclude-include string/nsTString.h -sq-include "nsString.h" > /tmp/cssloader.patch > cssloader.patch
  • Apply the patch: patch -p0 < cssloader.patch
  • Run make, deduce the problem with the patch and start over

Current functionality at the header level:

  • Class member merging
    • virtual interface members are replaced with non-virtual ones from implementation
    • Class members and their parameters are renamed: eg s/CSSLoaderImpl/nsICSSLoader/
  • Additional members from the implementation are added to the interface
  • Extra class/struct definitions used by the implementation members are moved up into the header as needed
  • Header and forward declaration inference
    • The resulting class will usually not compile because more headers are needed. Squash computes the set difference of class definitions as used by the implementation and the interface. Part of the list ends up as forwad declarations, the other part is translated futher into #include statements.
    • The above is trickly algorithm to fully implement fully, so in the meantime there are also manual overrides with -sq-include and -sq-exclude-include

At the source level squash renames the class part of function definitions along with their arguments.

Results

Sample patch output.

Oink patch.

Challenges and Limitations

I am really grateful for Elsa’s ability to parse C++ as used in Mozilla. I think this opens a lot of doors to automation, optimizations and new features that would be too laborious to even consider implementing otherwise. However Elsa still has some maturing to do. We loose typedef information, pretty-printing is still spotty and looks too different from parsed C++, but the biggest challenge is the lack of an “end-of-ASTNode” position information. All these will be addressed in the future, but in the meantime I have to make unfortunate choices of either getting distracted and working on Elsa/Oink internals or do work-around and continue with writing tools.
My short term goal is to finish step 1 and to get squash into the oink SVN.


05
Dec 06

Static Analysis

Introduction

My name is Taras Glek. I have been tasked with working on static analysis tools to automate deCOMtamination, verify code, etc.

Progress

C++ is a hard language to parse and no static analysis can be done before Mozilla source code be represented as an abstract syntax tree. My first step was to get the Mozilla trunk parsing with Elsa. With a small fix, Elsa now parses all files needed to build Firefox with -DNS_DISABLE_LITERAL_TEMPLATE (Elsa’s template support is incomplete).

Now that code parses we can move to the next step of writing tools that can analyze the Mozilla AST. These would prove certain parts of code correct or do source-to-source transformations such as the planned patch tool. More applications of static analysis here.
DeCOMtamination of the Gecko core is a time consuming task and is the main candidate for static analysis. I started formalizing the broad tasks collectively referred to as DeCOMtamination into an algorithm that could be implemented. I am prototyping the analysis in O’Caml using the Olmar binding to Elsa. O’Caml is a great language to this task because it has a lot of features for writing compilers and it allows me to load and traverse parts of the AST in an interactive console which greatly speeds up development. I am currently able to process class declarations, graph the class hierarchy into a giant pseudo-UML class diagram and am working on an ability to move class members up from the implementation class into the interface class while adding forward declarations, etc.

Additionally I am involved in garbage collection related analysis that prevents unsafe uses of the garbage collection API and improving LXR precision and functionality via feeding it data from Elsa/Olmar.

Update: This mini-presentation should clarify some of the reasons why we are doing deCOMtamination.