Comunidades de Software Livre

This is a portuguese version of the previous article. Esta é uma versão em português do artigo anterior.

Projetos de software livre são muito diferentes entre si. Alguns são muito técnicos e para engenheiros. Por exemplo, o principal objetivo do LLVM é build a set of modular compiler components(construir um conjunto modular de componentes de compiladores). Se tem que estar envolvido com desenvolvimento de software pare se importar com tais coisas.

Outros são muito mais ambiciosos. Observe por exemplo como a missão da mozilla fala sobre “construir comunidades”. Qualquer usuário da internet pode se importar com essa missão e muitos o fazem.

Dada a diferença entre projetos, é interessante observar como eles operam. Em um dos meus primeiros artigos eu escrevi sobre as diferenças nos processos usador para revisar e aceitar mudanças. Gostaria agora de falar sobre como as discussões acontecem e como elas são moderadas.

Um caso famoso nesta área é a glibc e um de seus mantenedores, Ulrich Drepper. Eu tive uma vez uma interessante troca de emails com ele, mas a coisa pode ficar muito pior.

A glibc é provavelmente um caso extremo. Em contraste, a eglibc tem em sua missão “Encourage cooperation, communication, civility, and respect among developers”(encorajar a cooperação, comunicação, civilidade, e respeito entre desenvolvedores). Esta parece ser um dos motivos pelo qual
a debian mudou para ela.

No llvm as coisas vão bem. O pior caso em que eu estive envolvido foi a discusão sobre uma modificação que eu escrevi e sobre um pouco da estratégia de desenvolvimento: Devemos implementar soluções de médio prazo ou discutir e esperar por, talvez melhores, soluções futuras?

Houve outras discussões grandes, mas elas parecem ser guiadas no sentido correto.

Na mozilla eu tive um começo um pouco árido quando eu reportei uma séria de problemas e recebi comentários como este aqui em quase todos eles.

As coisas tem funcionado bem desde então, mas houve uma inesperada conversa em um dos canais de IRC da mozilla. Ela começou com uma noticia local, mudou para uma discussão sobre o exato sentido de “regression” e dai foi ladeira abaixo:

<beltzner>	now we don't have filthy brazilians traipsing through a park

e

<beltzner>	you're a piece of work, espindola
<beltzner>	and another student at the foot of brendan, sadly

Outros disseras que estavam desconfortáveis com a situação, e foram motivo de chacota:

*	mconley is also getting uncomfortable
<beltzner>	get a fucking cushion?

Então, quem é beltzner? Estamos em melhor situação que a glibc, ele não é o mantenedor do firefox, não existe tal cargo. Até abril ele era “Director of Firefox at Mozilla Corporation”(diretor do firefox na corporação mozilla) e ainda é um membro ativo e visível da comunidade. Por exemplo, ele esteve no último Fireside Chat with Mozilla’s Mitchell Baker(conversa à lareira com a Mitchell Baker da Mozilla).

E então, onde estamos no final? Eu não acho que vamos ter uma bifurcação “civilized fox” como aconteceu na glibc, mas parece que não estamos indo tão bem quanto o menos ambicioso llvm. Eu não entendo como tolerar este tipo de comportamento é produtivo em “construir comunidades”. Assim como não entendo porque temos alguém com este comportamento como uma parte visível da comunidade.

Posted in Uncategorized | Comments Off

Open source communities

Open source projects are very different from one another. Some are very technical and engineer focused. For example, the primary mission of LLVM is build a set of modular compiler components. You have to be very involved in software development to care about such things.

Others are much more ambitious. Note for example how the mozilla mission statement talks about “building communities”. Any internet user can care about such statement and many do.

Given the difference among projects, it is interesting to see how they operate. In one of my first posts I observed some difference on the process used for reviewing and checking in changes. This time I would like to look at how discussions happen and are moderated.

One famous example in area is glibc and one of its maintainers, Ulrich Drepper. I had an interesting email exchange with him once, but things can get a lot worse.

glibc is probably an extreme case. In contrast, the competing eglibc has a mission statement that includes “Encourage cooperation, communication, civility, and respect among developers”. That looks to be one of the reasons why debian switched to it.

On the llvm side things are working OK. The worst case I have been involved with was a discussion about a patch I wrote and a bit of development strategy: Should we implement medium term solutions or discuss and wait for potentially better ones in the future?

There have been other large threads, but they seem to be nudged in the right direction.

In mozilla I had a bit of a rough start when I reported a series of bugs and got comments like this one in almost all of them.

Things were working OK ever since, but there was a really unexpected conversation in one of the mozilla IRC channels. It started with a local news, switched to a discussion about the precise meaning of “regression” and then went downhill:

<beltzner>	now we don't have filthy brazilians traipsing through a park

and

<beltzner>	you're a piece of work, espindola
<beltzner>	and another student at the foot of brendan, sadly

Others pointed out that they were uncomfortable with the situation, only to be made fun of:

*	mconley is also getting uncomfortable
<beltzner>	get a fucking cushion?

So, who is beltzner? We are better off than glibc, he is not the maintainer of firefox, there is no such thing. Until mid April he was “Director of Firefox at Mozilla Corporation” and is still a very active and visible member of the community. For example, he was in the last Fireside Chat with Mozilla’s Mitchell Baker.

So where are we in the end? I don’t see a “civilized fox” fork happening like in glibc, but we seem to be doing worse than the less ambitious llvm. I don’t see how tolerating this kind of behavior is productive in “building communities”. The same goes for having someone with such behavior be a very visible part of the community.

Posted in Uncategorized | 21 Comments

Reading email

Like most people doing software development, I receive a lot of email. Both in my personal gmail account (which I also used for llvm development) and my mozilla.com one (used for firefox, including bugzilla notification).

For a long time I had just used gmail’s web interface. It is an awesome one. In particular, its keyboard navigation is better than anything I have seen in a desktop client.

For the mozilla account I would use thunderbird via imap. Using two programs for the different accounts was a bit cumbersome. The gmail web interface also has the disadvantage that it is hard to backup. Google’s storage is probably as reliable as it gets, but redundancy is not backup. I can make mistakes that will be very reliably recorded.

Since gmail also supports imap, I decided to make imap with thunderbird my default for both accounts. The problem was that I have a lot of messages in my gmail account, and gmail’s imap can be a bit slow. That would not have been a big problem, except that thunderbird can get very unresponsive on slow IO.

I tried evolution too. It was a bit better as the UI itself would not freeze. It would reflow if I resized the window for example, but clicking a button would do nothing until the IO operation completed.

This is when I found offlineimap. It is a really nice tool that does two way synchronization from imap and maildir. I tried downloading my gmail account and it worked, just took a few days the first time :-)

The problem then was how to read the email. Thunderbird support for maildir is still under development on bug 402392. I tried evolution again. This time the problem was just the UI. Not having thunderbird’s “show threads with unread” and not marking messages as read automatically was a show stopper.

To fix the problem I installed dovecot. It can serve imap from a maildir, closing the loop and providing a fast imap connection for thunderbird to use. This found a small bug in offlineimap handling of flags, but a small patch fixed it.

So far I am very happy with the results. I have a single interface for all email, backups work and the UI is responsive. I think the only thing that is lost is being shown the summary of the new messages while they are still being downloaded.

I would go as far as suggesting that having a dedicated imap sync process is a reasonable design for a mail client. The process that handles imap knows nothing about the UI. The UI only access local storage and cannot block on the network. Any messages being shown are known to be safely stored on disk. Even the display of a summary of the messages still being downloaded should be doable by extending the maildir format a bit.

Posted in Uncategorized | 2 Comments

Tracking malloc on OS X

In the firefox debug builds we like to be notified when memory is allocated. That can be used to know if we are freeing it, and if not, where and when was the leaked memory allocated, etc. Knowing when memory is allocated can be a bit tricky. We use many third party libraries and they allocate memory without using any firefox specific functions.

Fortunately, OS X supports calling back a function every time malloc is used. This solves the problem of third party libraries allocating memory without us knowing about it. In fact, even other libc functions can call malloc and we still get notified about it.

One set of libc functions that can call malloc is the one that does mutex handling. This means that our malloc callback can be called with code up the stack holding locks, which means that whatever the callback does, it must not try to grab those locks again or we deadlock.

As with allocation, we don’t control all the locks. Simple libc functions like printf use looks. A particularly interesting case is that of the mutex handling functions. On Leopard and Snow Leopard, they use a memory pool, which requires locking to access. The function that looks responsible for handling the pool is new_sem_from_pool.

The interesting event that we would see from time to time is that our code would try to use a mutex, that would call new_sem_from_pool which would lock the pool and sometimes would have to grow it by allocating memory which would then call our callback.

Lets consider the situation our callback is in. Up the stack there is a function holding a lock on the pool where new semaphores come from. Any attempt to create a semaphore will deadlock. Calling printf can deadlock!

The code that we had for some time solved this problem by first checking if new_sem_from_pool was on the stack. If it was, it would return from the malloc callback without doing anything.

The idea works just fine, the problem is how to detect that new_sem_from_pool is on the stack. That function is internal to libc, so we cannot call dlsym on it. Debuggers find it by parsing the object files, probably not something we would like to do in firefox.

What was done was exploit a bug in the Leopard version of dlsym. It would consider a public symbol as extending until the next public symbol, including in that range any private ones that were in between. This allowed us to use dlsym on pthread_cond_wait$UNIX2003 and find a super set of the addresses of new_sem_from_pool.

The “problem” I was faced with was that the bug had been fixed on Snow Leopard. There was no pubilc symbol we could use as a replacement for ptherad_cond_wait$UNIX2003.

The good news was that dladdr was also fixed. Calling it on a pointer in new_sem_from_pool would return information about that function. Unfortunately calling dladdr for every address on the stack for every memory allocation was a bit too expensive.

What finally worked was adding a very early initialization step. We create a mutex and force what we know will be the very first call to new_sem_from_pool, which means it will have to allocate memory for the empty pool. We also set a malloc callback, and this once walk the stack to find the address in new_sem_from_pool where it calls realloc. Having recorded that address, regular walks just need to compare with it to know that they must ignore this allocation and return quietly.

One last issue was that the code still had to work on Leopard with its broken dladdr. Walking up the stack would show two frames with pthread_cond_wait$UNIX2003. Comparing with a gdb produced backtrace showed that the second call was the one that really was new_sem_from_pool, so we just have to make sure that on Leopard we look for the “wrong” name during initialization and keep the first we find during the walk.

What about Lion? It looks like the pool was reimplemented to use some internal allocation. Our callback is never called with the lock held. Many thanks to whoever fixed this :-)

Posted in Uncategorized | 4,408 Comments

Clang on the bots

For some time now clang can build firefox and it looks like the resulting binary works fine. For clang to be used “for real” in mozilla, we still need to avoid people breaking the build and we also need to give clang built binaries more testing.

A major step in that direction was the fixing of bug 693605. That bug replaced the version of clang we have on the bots with a newer one that included all the fixes we knew were needed.

Having clang on the bots allowed me to do a try run to test the clang built binaries in the same way we test our regular binaries.

The run found that OS X 64 bits is already all green, and we are now working on getting regular builds with clang (bug 695726). That should avoid the build being broken with clang, as the builders would quickly find that out.

The try run also found some interesting failures I was not seeing locally. This log is a good example.

From the log it looks like a clang built binary fails with

###!!! ASSERTION: anonymous nodes should not be in child lists:
'!aOldChild->IsRootOfAnonymousSubtree()',
file /builds/slave/try-osx-dbg/build/layout/base/nsCSSFrameConstructor.cpp,
line 11458

That is not the case. In Mozilla, assertions are more of an expectation than an assertion. In fact, a gcc built binary fires the very same assertion.

The real failure is what comes next:

INFO | runtests.py |
Received unexpected exception while running application
'need more than 4 values to unpack'

What is happening is that our test infrastructure is trying to read a symbol list, failing to parse it and crashing. The symbol list in created by running make buildsymbols. The script the bots use runs it, I was not running it locally, which is why I saw no failures. It is a pity that so much of the mozilla infrastructure is spread over so many repositories. It makes it hard to do something the same way the bots do.

Now, why was it failing to parse the symbol list? Because breakpad was failing to read the debug info to produce it. This is an issue that was originally found by the chromium developers.

I have fixed llvm’s debug generation to produce something more in line with the standard and that breakpad can parse. We now have a bug to upgrade clang in the bots again to include that fix. Lets see if this time we get more greens :-)

Posted in Uncategorized | Comments Off

Interesting c++ snippet

The snippet

Try compiling and running this code with and without the the first declaration of f commented out:

namespace X {
  template<int x> int f(int); // comment this
}
namespace A {
  struct B {};
  int operator>(int, B) {
    return 42;
  }
  template<int X> int f(B) {
    return 24;
  }
}
int f;
A::B x;
namespace X {
  int foo() {
    return f<42>(x);
  }
}
int main() {
  return X::foo();
}

It should return 24 when the declaration is present and 42 when it is not. No, this is not a bug in the compiler you are using. It is not even a bug in the c++ standard (a defect they call it). In fact, it is based in a note in [temp.arg.explicit] in a draft of the c++11 standard!

What is going on

To understand this code, one must know Argument Dependent Lookup. When reading code like

 Y y;
 x = foo(y);

most c++ programmers expect foo to be looked up from the innermost context out. First the function we are in, then the class, then the namespaces.

But then, how can the following program work?

#include 
namespace bar {
  class X {
  };
  void operator << (std::ostream& o, const class X&) {
    o << "zed\n";
  }
}
int main() {
  bar::X foo;
  std::cout << foo;
}

The operator was defined in the bar namespace and nothing in std::count << foo; mentions bar. The answer is that the compiler also looks in the namespaces of the types of the function call arguments.

Now, is f<42>(x) a function call? It depends. If a there is a function template named f, it is. That is why adding a declaration that is not used chages the meaning of the above program.

Considerations

The above snippet came up twice in discussions on what can a compiler do to avoid unnecessary work. Once in trying to avoid having to parse all the headers, which can be fairly big, and once trying to decide if, having parsed them once, the compiler could at least avoid unnecessary recompilation when they change.

In the first question, the example shows that the compiler would have to keep a fairly low level representation of the code. If the f<42>(x) was in an inline function in a header, it could not know if it was a function call or not.

The second case is a bit better, but the compiler would still have to keep a lot of state to remember that a function template showing up somewhere could force a recompilation.

Now, the above result is undesirable for both compiler writers and users. At this point it is unlikely that any change to ADL or the template syntax can be made, so what can we do? The best proposed solution I know is adding modules to c++. This would not fix the above example, but at least it would isolate it inside a module.

Unfortunately, modules are not in c++11, but it is very exciting to see that clang is adding support for them :-)

Posted in Uncategorized | Comments Off

Building firefox as c++11

Firefox configure script checks if the compiler supports the -std=c++0x option and if so uses it when building. Doing so means we get a moving target as compilers implement the new features and restrictions (which is one of the reasons for using it: 650304).

One interesting change in the new standard is in [dcl.init.list] which states:

If a narrowing conversion (see below) is required to convert any of
the arguments, the program is ill-formed.

and

A narrowing conversion is an implicit conversion ...
from an integer type or unscoped enumeration type to an integer
type that cannot represent all the values of the original type,
except where the source is a constant expression ...

For example, the following code is now invalid:

struct s { int x; };
void f(unsigned y) {
  s z = { y };
}

and clang correctly rejects it:

test.cc:3:11: error: non-constant-expression cannot be narrowed from type
      'unsigned int' to 'int' in initializer list
  s z = { y };
          ^

This found some interesting cases in the firefox code. Some are because of mismatch in system APIs like XGetGeometry using unsigned and GdkRectangle having gint fields.
Others are because of virtual methods that abstract different operating systems and cannot match all of them. Most of the cases were in our code and we were able to just use more precise types, so I think this was a good change.

Posted in Uncategorized | 1 Comment

Quick update on clang builds

After the last post I noticed that there were still some problems when building with

ac_add_options --with-macos-sdk=/Developer/SDKs/MacOSX10.7.sdk
ac_add_options --enable-macos-target=10.5

This was fixed as bug 675008. It should now be possible to use the latest SDK and still support 10.5.

Other good news is that thunderbird should soon build with 10.7 (bug 675300) and there is work under way to build for Android with clang (bug 674806).

Posted in Uncategorized | 2,047 Comments

Building firefox on OS X Lion

I recently got a copy of OS X Lion and have been trying to build firefox in it. Fortunately most of the work was already done, so this is just a quick summary of the last issues in case someone else has similar problems with other programs.

Problems with grep

This is bug 655339.

The 64 bit grep shipping with Lion is broken and complains that some of the expressions used in configure scripts are too big. Steven Michaud fixed this by explicitly using the 32 bit binary in configure.

Problems with python

This is bug 659881

Python fail if setuptools is imported when the environment variable MACOSX_DEPLOYMENT_TARGET is set to a value different from what it was when the module was built. Fortunately we don’t need setuptools during a regular build, so Jeff Hammel fixed this bug by wrapping the import in a try..execpt.

With the above issues fixed, it was possible to build on 10.7 using the 10.6 sdk.

PowerPC support is really gone

This is bug 673789

Breakpad (which is used by the crash reporter) used to include mach/ppc/thread_status.h, but that file is gone from the 10.7 sdk. Adding ifdefs to disable PowerPC support when targeting 10.7 solved this problem.

Problems with TLS and -dead_strip

This is bug 672501

Thread local storage is new in 10.7 and looks like there are still some issues to be fixed. In this case the linker crashes if both TLS and the -dead_strip option are used at the same time. The adopted fix was to change the configure test for TLS to use -dead_strip too. Once the linker bug is fixed, TLS will be automatically enabled again.

Next Steps

The patches for the above issues should be in mozilla-central soon. I am curious to see if it is possible to build a copy of firefox using the 10.7 sdk and still support 10.5 and 10.6. If not, we will probably have to build two versions for some time.

Posted in Uncategorized | 5,750 Comments

Misc updates

There have been some big changes since I last compared gcc and llvm builds of Firefox:

  • The Firefox build now uses -O3
  • LLVM has a new register allocator
  • LLVM uses the cfi directives

I also installed xcode 4. I wanted to do the same 3 builds I did last time, unfortunately the linker in xcode 4 has a bug when handling LTO with files that use computed gotos. Since the javascript interpreter does, I was unable to test LTO in this run :-(

Results

Build Time

Clang Gcc
Real 42m40.797s 62m22.595s
User 97m22.262s 157m8.463s
Sys 12m13.387s 17m37.419s

Build times went up, most of it probably because of -O3. I don’t know what caused the increase of about 4m in the system time (both gcc and clang). It looks unlikely that just the larger .o files could do it.

Sizes

Clang Gcc
.dmg 27988529 29610839
64 bit XUL 26009488 28033084
32 bit XUL 22469120 24897116

The sizes are also up, but not dramatically. The clang dmg is 687k larger and the gcc one 991k.

Dromaeo

Both gcc and clang show a marked improvement. Not sure if its because of -O3 or because of improvements in the mozilla source code.

Next

Running the firefox tests in a try bot found bugs both in firefox and clang 2.9. Those have been fixed and we should soon have a 3.0 snapshot installed. It might find more bugs to fix, but if not, it should provide many more benchmark runs to compare clang and gcc.

Posted in Uncategorized | 3 Comments