25
Jan 08

Dehydra progress

GCC Dehydra is evolving much faster than the Elsa version did and it is easier to use. Once I implemented virtual methods correctly, Joshua was able to do his thing in no time at all. All it takes is a custom GCC (I’d love to see it packaged) and specifying plugin parameters in CXXFLAGS.

Dehydra has some new tricks now like a tree representation of types (instead of a string) with full typedef support. Lisp remnants in GCC are getting a new life as JavaScript objects.

I’m current working on exposing the full GCC tree structure in JavaScript so one could do any analysis they wanted in pure JS. Dynamically typed GCC tree nodes are great for that. I’m starting with middle-end GIMPLE representation so in theory one will be able to analyze anything gcc can compile (Java, C++, C, ObjC, ObjC++, FORTRAN?). Eventually this will be expanded to support frontend specific tree nodes to be able to look at code closer to the way it was written. Oh and I expect people will be able to script large parts of C++ -> JavaScript rewrites with Dehydra.

In theory, one could make tree node conversion two way which would enable writing optimization passes in JS, but that would be silly.

What’s the point?

I want to be able to do Exception-safety analysis in pure JS. I want to enable unit checking (thought typedefs and inline conversion functions) in pure JS.

Additionally, Dehydra should be awesome for generating bindings. For example, I’ll be able use Dehydra to import GCC’s autogenerated enums to get string names for nodes.

Also it will become easy to extract callgraphs and various other stats out of the code if they are accessible in JS. Eventually we’ll be switching Dehydra to Tamarin to do all of the above really really fast.

GCC Plugins

While I am messing with the GCC AST, Dave is working on utilizing GCC’s control flow graphs with a separate plugin. Eventually we’ll merge our work, but for now it’s nice to not step on each others toes while adding features to the compiler. Given how easy life is with plugins I am amazed that people chose to go uphill bothways and not collaborate on a plugin interface for their crazy GCC extensions. Yes, I’m looking at you: mygcc and gccxml.

Aren’t there IDEs interested in making use of GCC internals too or is everybody interested in maintaining yet another crappy C parser like Linux’s Sparse tool?

I’m looking forward to exploring the many ways we can reuse what’s in the compiler to empower developers for Mozilla 2.


17
Jan 08

GCC + SpiderMonkey = GCC Dehydra

Analysis

GCC Dehydra is starting to work. I encourage people try it out for their code scanning needs. The main missing feature is control-flow-sensitive traversal, which means that currently function bodies are traversed represented in a sequential fashion. It is the most complicated part of Dehydra, but most of the time this feature is not needed.

So far I got Benjamin’s stack-nsCOMPtr finding script to do stuff, which indicates that most of the features are working.

My vision is to switch to the GCC backend for all of our code analysis needs since it is well tested, fairly feature complete works with new versions of GCC (by definition).

Not everything is perfect in GCC land. There are some frustrating typedef issues to solve.

Source Re-factoring

Elsa still holds its own when it comes to refactoring code because it has a much cleaner lexer/parser and rarely opts to “optimize away” original AST structure. We should stick with Elsa’s arcane requirement of having to preprocess files with gcc <= 3.4 until either GCC becomes viable as a platform for refactoring or clang matures.

GCC is not suitable for refactoring work because it:

  1. Starts simplifying the AST  too early
  2. The parser is handwritten and therefore would be hard to modify to maintain end-of-AST-node location info.
  3. GCC reuses many AST nodes which means their locations point at the declaration rather than usage-point.
  4. Handwritten nature of GCC makes any of these above improvements time-consuming to implement and the political issues are something I’d rather not deal with.

Most of these wouldn’t have been an issue if GCC was written in ML :)
What’s Next?

Time to start using GCC Dehydra to enforce GC-safety and lots of fun exception-rewrite preparation work.

Stay tuned for more exciting developments regarding regaining control over source code here and on Dave Mandelin’s blog.


08
Jan 08

Dehydra as a GCC plugin

Thanks to the 2-fold increase in manpower working on pork, we finally have an opportunity to work on the nice-to-have things.

Progress

Recently I have been working on a GCC plugin to do Mozilla-specific analyses with GCC.

Unfortunately, I didn’t notice that GCC had a plugin branch so I reinvented the wheel there. Fortunately that part was rather easy and turned out that the plugin branch isn’t very useful to work with as it is in SVN, doesn’t link GCC with -rdynamic nor does it install the hooks I need in the C++ frontend. Overall the plugin shim is relatively trivial and it will be pretty easy to merge with other similar efforts.

My first and only plugin is a C reimplementation of Dehydra. GCC sources are currently fairly hostile to C++, so I elected to not make my head spin by mixing in C++ in addition to C and JavaScript. I think the C Dehydra has reached the hello world state, to take it for a spin see the wiki page.

GCC Thoughts

Integrating with GCC is pretty awesome. So far I regret not jumping in earlier. I was reluctant to do so as everyone I’ve talked to (other than Tom Tromey) claimed that GCC is ridiculously complicated and impossible to do stuff with. In fact academic people are so scared of GCC that they tend to opt to go with commercial frontends that have ridiculus licensing terms and make it impossible to release their work to general public.

GCC internals are pretty crazy since everything is done with macros and the AST is dynamically typed so it’s fairly painful to figure out seemingly simple things like “what AST nodes does this AST node contain”. Additionally, GCC loves rewriting AST nodes inplace as the compilation progresses which sucks when one wants to analyze the AST while it looks as close as possible to the source. GCC parser also sucks to work with as it is implemented as a C code hodge-podge (technical term which applies to much code in GCC). Luckily, I am mainly concerned with poking at data that’s already in GCC.

The upside is that GCC is a well-tested production compiler that most source compiles with. Integrating with GCC means that the AST is correct (Elsa is a frontend so there is no way of knowing if AST has mistakes in it) . Integration also means that the user doesn’t have to worry about making preprocessed files and maintain obsolete versions of GCC or old GCC headers. Unlike Elsa, GCC already has useful features like typedef tracking and doesn’t implement location tracking with a stupid programming trick. Additionally, I hope to reuse computations from from middle-end GCC passes to build my control flow graph, do value numbering and other useful, but tricky to implement stuff.

GCC isn’t scary at all, it’s just another way of implementing a compiler. Some people elect to have more pain in life by electing to reinvent ML in C++ instead of using ML for compiler writing,  others get their pain dosage from working on a C compiler originally generated from LISP sources.

Lastly, I’d like to thank patient gcc hackers in #gcc without whom I wouldn’t stand a chance in figuring out how to get this far.


28
Dec 07

Recent Progress

Looks like pork is slowly going to get merged back into oink. This makes me happy as it will result in decreased merging headaches and gives more visibility to my work outside of Mozilla. My elkhound changes are already in!

Recently I added support for retaining gnu attributes to elsa and corresponding features dehydra and garburator. Now dehydra can verify things based on attributes and  garburator gained a way to rewrite special cases like classes that are always allocated on the stack. Elsa still drops most attributes, but at least classes, methods and variable declarations are covered.

I also spent a couple of days investigating gcc plugins. Turns out modifying gcc to support plugins is dead easy, but getting anything useful done in GCC requires a steep learning curve. I tried to find how to enumerate all of the toplevel declarations in the source, but I couldn’t find the correct global variable that corresponds to the toplevel scope(aka the Translation Unit?). I have a few more ideas of what to try next. Once I do that, it shouldn’t take much work to make a basic gcc-hosted version of dehydra. There is also a gcc plugin branch hosted in the gcc svn, but I can’t find any example code for it. It isn’t a big deal since none of the plugins I’ve seen mentioned venture outside of intra-function analyses.

I am still pondering on how to tackle rewriting Mozilla to use exceptions. It is the key to improving overall readability/perf of Moz C++, but the logistics of writing the corresponding analyses+rewrites followed by a parallel manual correction step are still making my head spin. All I’m sure about is that the first step to exceptions would be to enable the OOM exceptions and do the corresponding exception safe analysis+rewrite.


29
Nov 07

GCC Plugins under my xmas tree?

Over at LWN there is an article on GCC plugins. It touches onto how it would be useful to implement static analysis tools as GCC plugins.

It does not mention that certain optimizations are not feasible without interfacing with the compiler and that there could be a very significant decrease in errors if we the compiler were pluggable with API-specific checks. Wouldn’t it be nice if less developer time had to be spent hunting for common bugs and more implementing awesome new features?

Typically safety is accomplished by executing code in a Virtual Machine that does extra runtime checks with Just In Time compilation to make up for performance losses. This approach has many known performance and footprint disadvantages. It is used by languages like Java and Scheme.

Applications in these languages are slow and/or ship with a JIT compiler to optimize them during their runtime. C++ is compiled ahead of run time and has a reputation for running faster than these dynamic languages. However it also makes it a lot easier to make mistakes such as leak memory and buffer overflows. One can use the C++ OO system to perform extra-runtime checks to avoid some of these issues but that tends to cancel out any performance advantages of writing code in C++.

Another approach is to enforce various safety-related properties through the compiler. Awesome existing languages such as OCaml come with a strict type system that ensures that once code compiles it will run fast and have a lower bug percentages than comparable code in other languages. EcmaScript4 will feature a rocking type system similar to OCaml.

C++ does not have such an awesome typesystem. However, there are many C++ errors that occur frequently and should be detected by the compiler, however long as there is no way to specify Mozilla-specific type system restrictions the compiler has no way of getting that information. Such plugins provide a certain piece of mind that once code complies with whatever rules we set for it, it is more likely to run correctly. Furthermore, some optimizations that we have in mind for Mozilla 2 (such as incremental garbage collection) will be much easier to work with if the compiler flags memory misuse at compile time.

Currently we can use dehydra to scan the codebase, but it would be much more efficient to be able to plug such verification abilities into every developer’s GCC. I sincerely hope that whoever is in charge of the plugin decision at GCC will realize the massive advantage this would give to GCC over other compilers.

ps. Another use for plugins is to enable more aggressive optimizations. Small changes in the sourcecode can affect how conservative the generated code this. A clever plugin could warn whenever gcc cancels an optimization due to misbehaving source.


28
Nov 07

Volume of Refactoring Ahead

In the previous post, I described the simple rewriting case that I am working on at the moment. Someone was quick to point out that the approach wouldn’t work for all methods (XPIDL Arrays were the example). Indeed, anything more complicated than simple getters can’t be rewritten to “Succeeded/Failed” pattern without switching to C++ exceptions. However, in the codebase the size of Mozilla there are several megabytes worth outparam getters to be rewritten.

Currently my wimpy little outparams.js script identifies 100 methods that can be rewritten by outparamdel without any manual intervention. However doing a search for ::Get with a ** parameter yields over 2700 candidates, of which most look like they can be rewritten. Reason for the laughable detection rate is that the detection script currently refuses to flag methods that are defined in XPIDL interface or are implemented by more than one class. Soon the script will make heavier use of the class hierarchy and we will probably change XPIDL to support more efficient getters.

Complete Class Hierarchy

I finally managed to convince Dehydra to serialize the Mozilla class hierarchy into JSON files without running out of virtual memory. This will generate lots of input for refactoring and analysis tools. All kinds of interesting stats can be produced with simple scripts. Generating the index is relatively straightforward. It would be awesome if someone could figure out how to expose this data as a web app. Since there is so much being loaded incrementally, I don’t see how one can keep things simple but use an asynchronous API.

In the coming months, I am looking forward to extending this to be a complete callgraph to find dead code and other fun data.


26
Nov 07

Mozilla 2: Outparamdel

There will be a lot of under-the-hood code changes in Mozilla 2. Our goal is to end up with a simpler, safer and faster codebase.

This is my perspective on the work ahead with respect to outparamdel.

Outparamdel

In the presence of a garbage collector we will be getting rid of stack nsCOMPtr<> usage (using raw pointers instead), but the getter_Addrefs() will still be needed to pass references to heap-allocated nsCOMPtr so they can be assigned to. Eliminating as many outparams as possible will eliminate much getter_Addrefs() footprint/perf/code bloat.

Within the next week I hope to attempt to rewrite all of the auto-detectable outparamdel candidates.

Determining What To Rewrite

outparams.js is a dehydra script for flagging code that can be rewritten to use outparameters. It turned out a bit trickier than I initially expected. Here are the checks required

  1. Check that a function only ever returns 1 failure error code (NS_OK and something else). This enough if a function is not virtual.
  2. Ensure that either a) This is the only implementation of a particular virtual function or b) All other implementations of this function satisfy 1
    Currently I only do a).
  3. Also check that all overloads of this function have the same outparameter type. This is required since C++ (thankfully) doesn’t not allow function overloading by varying the return type.
  4. Checks 1-3 ensure that the function can be rewritten, however one also needs to determine if the return type should be wrapped in getter_Addrefs<>. This can not be deterministically done from looking at the getter. So one has to scan the code for usage of the function to see if the outparam is ever passed getter_Addrefs.
  5. Check that none of the callers are within macros to minimize non-automatic rewriting.

Checks 2 and 3 require the complete class hierachy of Mozilla so I finally made a few more dehydra scripts to produce that. This was on my TODO list for a while and should make a few other interesting analyses possible (my favourite one is finding interfaces which only have 1 implementation to get rid of excessive virtual functions).

Checks 4 and 5 were easiest to implement as warnings in outparamdel.

One should keep in mind transitivity. Once the first outparamdel candidates are rewritten, some of their callers should become flagged for rewriting in the same manner.

Rewriting Code

Here is an example of how code will be simplified.

given a function:

nsresult getFoo(nsIFoo **out);

And usage like:

nsCOMPtr<nsIFoo> bla;
nsresult rv = getFoo(getter_Addrefs(bla));

if ((NS_SUCCEEDED(rv) && bla) {
...
} else {
return rv;
}
nsCOMPtr<nsIFoo> bla2;
return getFoo(getter_Addrefs(bla2));

The function definition will become:

nsIFoo* getFoo();

Before this can be done, several issues come up

  1. It is not clear if the original getFoo() is allowed to always override the value passed to it. This is hard to determine automatically, so we make the assumption that in the general case it is ok.
  2. It isn’t obvious if getFoo() is returns null in the outparam to indicate some non-error condition. This is rare so the parameter shall be annotated. Currently, the plan is to annotate with a NULLABLE_OUTPARAM() macro which would be detectable by dehydra and serve as documentation for the function behavior.

nsCOMPtr<nsIFoo> bla = getFoo(); //after XPCOMGC rewrite the left side will become nsIFoo* bla
if (bla) { // could even merge the above declaration into the if condition
...
} else {
return NS_ERROR;// or NULL if this is another outparamdel candidate
}
nsCOMPtr<nsIFoo> bla2 = getFoo();
return bla2 ? NS_OK : NS_ERROR_SOMETHING; // I'm not sure if outparamdel should use an explicit ternary operator or an inline function to convert new style errors into nsresult

Currently the code is being rewritten in a much uglier way. So this cleaner version will likely be implemented as an optimization pass (probably with a new tool outparamdel-opt?). There several tricks here:

  1. Connect the declaration with initialization of bla and bla2
  2. Detect the error check and replace it with “bla”(or !bla for NS_FAILED).Then realize that bla && bla contains a redundant statement and take it out.
  3. Do something similar to return statements.

Result

This should result in prettier code that compiles quicker and to a smaller, more efficient binary. It will also be more GC-friendly.

C++ Exceptions

Right now outparamdel does rewrites that are useful even if C++ exceptions will not be introduced. There are further code reduction gains possible if above error checks were converted into C++ exceptions, but I am not clear on performance characteristics of exceptions. We would also need to change tamarin exceptions to match C++ ones before any experimentation can be done.


12
Oct 07

Rewriting Tools for Mozilla 2: Moving Forward as Planned

In the Beginning There Was a Void

Approximately a year ago, Brendan discussed with me the crazy possibility of rewriting most of the Mozilla code automatically to modernize the codebase. The benefits were huge. Gecko would use the C++ standard library to improve code readability and reducing size, XPCOM would be ripped out of the core to improve performance and decrease footprint, etc.

It seemed like a good idea, but in reality no other giant C++ project has attempted this before so we were not sure of how realistic it was. I spent a year in a lonely corner of Mozilla trying to materialize the idea.

Brendan & Graydon pointed me to elsa, the C++ parser that supposedly could parse Mozilla. However, it turned out that it was only able to parse an old version of Mozilla and rejected the new source. One of the elsa maintainers even tried to convince us to it was not designed for source-to-source transformations and wouldn’t work that way.

After I patched up elsa and started devising ways to use it for source rewriting I ran into more pain. After a few false starts, I realized that C++ in Mozilla is actually a mix of CPP and C++ and one can not rewrite C++ without dealing with the mess that is macro expansion. MCPP was pointed out to me as a good starting point for hacking on a preprocessor. So I designed an inline log for macro expansion. To my surprise the maintainer of MCPP, Kiyoshi MATSUI, volunteered to implement the spec and thus saved me from a world of pain. (For which I am eternally grateful as I can’t imagine a more depressing pastime than working on the root of all evil: the C preprocessor).

In parallel with Kiyoshi’s work I modified elkhound & elsa to make the C++ parser a lot more suitable for source transformations. I learned about LR & GLR parsing and confirmed my suspicion that I don’t want to write parser generators for a living.

Happy Conclusion

All this work finally got us what we discussed last September: a framework for doing lots of boring code rewrites.

The first big Moz2 task is switching from reference counting to garbage collection. Today, garburator produced a gigantic patch for subset of the content/ module and all of the affected files compiled. Hopefully next week I’ll have a multi-megabyte patch for the whole of Mozilla that compiles and possibly runs.


30
Aug 07

Pork: The Brave New World

It doesn’t look like my oink patches are going to reach upstream anytime soon. In fact, in the past year no patches have landed in the oink tree. I think this is unfortunate, but I have my own repository at http://hg.mozilla.org and will continue working on my fork. So allow me to introduce Pork, the Oink fork. So if anyone has any exciting source location, pretty printing or other elsa improvements, I hope to eventually see those land in pork.

Note, pork is just an informal name for the fork, I am not currently planning to do any source renaming or repository renaming.

Pork: Now in a VMware flavour

bsmedberg pointed out that oink seems a bit intimidating to setup and it would be nice if it came in virtual machine. So I have oink/mcpp/etc packaged up in a virtual machine that can build Mozilla with gcc3.4, run dehydra analyses and produce automated patches. If you want to play with that, post a comment and I’ll reply with a download link.

A couple of people asked me about some simple code-scans to see for problems and optimization opportunities. Next week I am going to post few simple dehydra recipes on how to look for patterns in the code. It’s dead simple with JavaScript and some minimal shell knowledge to aggregate the results. The VM should provide a good way to get started.

Outparams: QueryInterface Rewrite Milestone Reached

I finally got Mozilla to compile with the outparam-less qi call. The outparamdel part of that was surprisingly easy and straightforward. I   also learned all kinds of fun details about multiple inheritance and fun ways to construct functions out of macro segments. The next step is to get the code to run. I’ll also have to look into a few minor MCPP issues that the rewrite uncovered. See QI bug for the gory details.


13
Jul 07

Dehydra, prcheck, squash – in mercurial

New Repository

Since I do not yet have write access to oink svn, I have been doing all of my development in ad-hoc repositories within the svn checkout. This made it rather hard to collaborate with others. I finally got sick of the situation (and stumbled upon hgsvn) and converted all 11 svn repositories to mercurial. To my surprise, mercurial even let me merge my repositories while preserving history (hg has yet to fail me!).

oink uses svn-externals to aggregate the repositories into a single checkout. hg doesn’t have anything similar, so to checkout all 11 repositories use a script:

checkout.sh http://hg.mozilla.org

Released Differences from Oink Mainline

  • New oink tool – prcheck: ensures that bool-like integer typedefs behave like bools
  • New oink tool – dehydra: source query tool with queries specified in JavaScript
  • New oink tool – squash: source refactoring tool. This is now deprecated since most of the code in it dealt with working around elsa limitations to do with macro expansion & lack of precise locations. The patching engine used in squash lives on to provide a simple refactoring API for use in other tools (like prcheck).
  • Minor grammar changes to parse more of Mozilla
  • Compilation fixes for OSX
  • Elsa fixes to parse OSX headers
  • make -j support for elsa
  • end-of-ast-node location support for elkhound & elsa
  • preprocessor expansion markup support for elsa

Coming Soon

  • Amazing new version of MCPP capable of preprocessing mozilla while outputting refactoring-friendly annotations.
  • Web front-end for squash which will likely be refactored to be tool-agnostic.
  • Front-end to run patch-producing tools in parallel for multi-core machines

Near Future

  • squash will be split up into a library with each major feature ripped out into a standalone tool. Two tools coming soon:outparam rewriter & class member renamer.
  • RAD for static analysis: oink tool templates to make it trivial to write custom new tools with minimal amount of boilerplate

Some time in the Future

  • Collaboration with the author of Olmar to provide an OCaml API for Elsa. If everything goes as expected it will be possible to write analyses that are more powerful and more concise than DeHydra ones except they will perform at C/C++ speeds. Plus it should be possible to perform them from a native interactive OCaml toplevel. Most of this work already exists in bits and pieces. It’s a matter of adding some AST transformations, fixing a few issues and tying it all together.
  • MapReduce inspired front-end: generic framework for executing transformations/analyses in-parallel and Mozilla-wide without blowing the 32bit address space (as it typical when static analysis tools meet Mozilla).