Axel Hecht Mozilla in Your Language

June 29, 2009

Roundup report from OTT 09

Filed under: L10n,ott09 — Tags: , — Axel Hecht @ 8:42 am

I’ve spent half of last week in Amsterdam joining the “Open Translation Tools 2009” unconference. It was a pretty interesting and diverse crowd to be with, and by far not as tool-author-only as it would sound like. Folks coming spread all over from sex-worker activists over global voices to translate.org.za to folks from the “professional translation companies”.

We started out with a few opening ceremonies. First of which was an introduction, the regular “who are you and where are you from”, along with a “how do you feel”. I was the only one that didn’t give a geolocation, but disclosed Mozilla as my point of origin. It’s obviously a more common theme in such events that localization folks are much more focused on their geographical background as defining their cultural background (what this should be about, right? This is not couch-surfing.) and not so much what they work on. The “how do you feel” was one of the hippie-pieces, along with a lot of twinkling. Reading the urbandictionary on twinkle makes me wonder, but it was just hands-not-clapping. Honestly.

Next up was a round of spectograms in the room. Two opposite oppinions where offered, and you had to stand across a line in the room on where between the two your opinion would be. Then Gunner, our head master of ceremonies, went in and poked people on why they’d be where they were. It’s an interesting exercise to figure out what kind of crowd you’re with, and scales pretty well.

Collecting agenda itemsAgenda-building worked pretty much like it does in most unconferences these days, we created tons of sticky notes and then tried to build themes and agendas from that. The resulting sticky notes are transcribed on the wiki. Mighty job by Lena. I feel quite fortunate that I didn’t have to distill those notes into an agenda. Gunner did that pretty loosely, which was probably a good combination. The resulting schedule is on the wiki, too. In the rest of my coverage, I’ll focus on those sessions that I’ve been in. The schedule links to the full notes of each session, the note takes were usually in good shape, so do take a look.

Ed talking professionalThe session that Ed Zad (both of which are shortened, the parents are not to blame for this one) led about the professional translation companies and ecosystem was OK. The actual translation is almost never done in-house, but contracted out to freelancers. The money you pay the companies goes into project management, and they hire translators, reviewers and editors to do the actual work, and get paid by the company. The interesting part here was really that those companies make their money from the project management and recruiting part. I didn’t get any useful feedback beyond “you must have hired the wrong guys” on my report that any time we had to contract translation out, the results were not really usable. Maybe asking that question the wrong guy :-). The main takeaway would be that the industry is really fragmented and diverse. I doubt there are any good rules for picking a partner when looking for a company for localization, either. The process Ed described about reviews and editing seem rather low-key compared to what you can do if you develop your localized content in the open.

Fran on MTThe next session I was in was about machine translation, led by Francis. He introduced the group to both statistical machine translation as well as rule-based MT. Interesting here are both the enormous amount of data you need for statistical MT, as well as the different stages. Rule-based MT on the other hand works well for closely related languages. For those that heard me talking about l10n-fork, that’d be rule-based MT. Francis offered to take a look at whether we can actually do better MT to share work in Mozilla localizations for at least closely related languages. All our romance languages with the Spanishs, Portugueses and French could benefit from that, possibly even patch-based.

We “closed” Monday in Vondelpark. Matt broke the aspiration techies on Monday night, though. Amsterdam is good at that. I ended up at Petra’s place, together with Tanya and Pawell. Thanks to Petra for a fun evening.

Tuesday started off with a crazy “crawl on the floor and draw as many workflows as you have”. I figured that I don’t know enough about 90% of the workflows we use at Mozilla, and just sketched out two extremes when it comes down to localizing Firefox. One is localizing patch by patch, like for example the French team does. And then we have a long tail of localizations that work within their toolchain and just occasionally export to hg and update the upstream repos. For the cats among you, there are 40 pictures of the workflow diagrams on flickr.

translating wikisNext was a pretty interesting session on translating wiki content. That made it a good fit for me to kill the session on “Localizing a hybrid organization – BOF on Mozilla”. I wanted to clone myself three or four times already, so not having to do a session myself was a win. Anyway. We had tikiwiki and mediawiki represented in the group. And me with some experience on those two, plus deki via MDC. The discussion turned up two fundamentally different ways of working:

  • Forking documents into different documents for different languages, with some cross-referencing that localizations exist. You’d know this from wikipedia.
  • Maintaining the different translations as variants of a single document. This is what tikiwiki does, as we see it on SUMO.

The discussion around one single living document in multiple languages was more lively, which gave me a good sense of what’s out there to address our needs at SUMO/MDC etc. There doesn’t seem to be anything blowing tikiwiki out of the water, so in terms of finding a wiki engine with l10n, SUMO made a good choice. We talked quite a bit about the multiple edits in various languages of the document, and what tiki defines to be 100% in the end. I showed off the l10n dashboard page we have on SUMO now, which was well received. The idea to not demand that people do as a bot tells them, and instead to empower them with relevant information seemed to resonate well. There was a different session about CMSes and l10n, read drupal etc. I only overheard the last bits, didn’t seem to have great answers over there. Judge yourself from the notes. Finding the right UE and UI paradigms for keeping a living document in multiple languages in sync seems to be an open item of work. In particular if your document isn’t bound to get value contributed in one single source languages. We would want to understand which changes are ports of fixes in other languages, and which are new fixes to the actual document that other translations of this document including the original source language would benefit from.

Next up was a round of speed-geeking. That’s similar to speed dating. A few geeks get a table each to present something to the rest of the group. The rest of the group is split up to watch one at a time. Each presentation is 4 minutes, then the groups rotate to the next table. If you’re bored by something presented, you just wasted 4 minutes. I took the challenge to present l20n 8 times in a row. That’s a pretty technical topic and a pretty diverse audience, so apart from being a stress test on ones vocal chords, it’s also pretty heavy on your brain. I must have been doing allright, though. The feedback was generally interested to positive. I got out with an action item to work with Dwayne on how we could actually present localization choices so that they’re options to fix and not just hell-bound confusion. On a general note, if you’re ever found speed geeking: Don’t sit in front of your computer. Don’t make people walk around the table to see something. It’s perfectly fine to sit next to your computer and have your laptop and yourself face your audience. Or do it like Dwayne did, just present without your damn laptop open :-). If feasible.

The last session on Tuesday was about building Volunteer Translation Communities. We had a few people there that are just starting to build such a community, but also a few people from Global Voices Online and yours truly from Mozilla. It’s pretty interesting how easy it is to think “I need to get such and such in language other, how do I ask for volunteers?” and how easily that fails. The common ground of those with living communities was that you don’t ask for translators, but you need to be open for contributors. At Mozilla, we’re hackable. We offer opportunities for all kinds of volunteer contributions, among which localization is one. That is something different than asking for some unit of work to be done for no pay. Another key is that you find your volunteers among those that are interested in the outcome of the localization work. The project management work you need to do to empower your translation community to actually do some work and get to the results shouldn’t be underestimated, too. There’s a reason why people make a living out of this one.

I moved from Tuesday to Wednesday through the Waterhole. As good as it used to be. Getting up in time was tough, but not as bad as it initially felt.

Dwayne chats about AfricaThe first session I joined on Wednesday was on localization issues in Africa. We had similar sessions for Central Asia, South Asia, and Asia Pacific, which I didn’t manage to get to. I even didn’t get to read the notes from those yet. Anyway, back to Africa. The challenges there aren’t all that surprising. Connectivity is really bad, cell phones are really big. During the OTT, though, the first cable made it to Kenya, so in terms of connectivity, things are changing. Fonts in Africa are mostly based on Latin script, so there’s not too much to do there, though a few characters usually need fixing. At least for web content, downloadable fonts offer a smooth upgrade path. In terms of technical abilities, a lot of the techies for African languages end up in Europe or the US and only occasionally visit home. For actual translators, there isn’t enough work to actually make a living of that, so you likely end up with part time night shifters. For many people with access to computers and internet, localization is a good thing, but not something on their own list of priorities, which leaves us with a rather small potential community there. Localizing really obvious things like cell phones or Firefox is a good way to start of a community, though. I’ve had some off-track discussions with Dwayne on how to work together with the ANLoc project he’s running, too.

The discussion about open corpora to be used for linguistic research and statistical machine translation training was OK, but not of that much interest for Mozilla. It’s a good thing to do, and if we can help in asking the right people, that’d be cool, though. There’s tons of politics to resolve first though, and they got enough folks for the initial group.

The next round of speed geeking had me on the consumer side. I already mentioned that you shouldn’t sit in front of the laptop that you use for presenting. John talked about Transifex, which is designed to be a system to bridge various version control systems for localizers, by having write access itself to the upstream repos. They start to offer an interface to actually translate a few strings in place, which they reuse from somewhere. It’s not pootle code, though. That was the one with most immediate touch point to what we do.

The last session for me was one driven by Dwayne again, closing the loop. We tried to find out how to get feedback from the localizers into tools, and into the software they localize. This was pretty interesting, thanks to the input from Rohana and Gisela, the two are actually localizers and could hint us at what they do and how. The main take away was that Localizers and l10n tool authors don’t talk enough to each other. Gisela, Dwayne and I have a follow-up conference in our heads to actually do that, I’ll talk about that in a different post. The other main point was that we need to get tools to support “l10n briefs” and annotations, and need to establish ways for that information to be exchanged. A localization brief might be something like a file-wide localization note that explains what the context for these strings is. Or that it’s about XSLT error messages, that you should leave in English unless you have a thriving local community in your language on that technology. Annotations are more diverse, and are both to communicate among localization teams and back to the original author. The idea is to create a system that allows localizers to communicate over a particular string or set of strings in an easier fashion than using hg blame to find the bug, and then having to read through all of the bug to find out how to reproduce a problem. We might want to have annotations as simple as “star a string”. If it’s helpful that a string is tricky, someone else can go in and offer help or a more constructive annotation beyond “I didn’t get it”. How to communicate that back and forth is another follow-up project from this session.

Adam Hyde ran a book sprint on open translation tools aside all sessions, with a real face-to-face book sprinting event that closes today. It’s going to be interesting to see what that comes down to. As I suck at writing (you can tell by reading this post), I didn’t participate in that one myself. There is a version on the net already on flossmanuals.net.

So much for the actual sessions. As always, floor communication was essential, too. I made contact with folks from the Tajik, Khmer, and Nepali localization efforts for Firefox, and there’s already traction on some. If you know someone willing to help with Nepali, please make them introduce themselves in m.d.l10n. I have met a ton of other interesting people, of course. I had some really great conversations with Dwayne on a bunch of different topics, ranging from technical bits in tools to mission statements. Generally, there was a lot of interest in Mozilla, and how we do things. Thanks to Aspiration for inviting me, and thanks to all the people at OTT for the warm welcome to this new community for us.

Last but not least, thanks to Mozilla. In environments like OTT it becomes really obvious how rare organisations like Mozilla are. We had a lot of discussion on how hard it is to do localization as an afterthought, and we just don’t. How valuable it is for the localization community to get acknowledged. Which happens throughout Mozilla, pretty independent on whether it’s John and Mitchell most anywhere they talk, or our developers fixing their patches to have a prettier localization note, or our marketing folks empowering our local communities to localize the message. And we’re still learning and eager to get better. It is an honor to represent such an organization.

Pictures in this post are by Lena under CC by-nc-nd.

June 12, 2009

150/2=73

Filed under: L10n,Mozilla — Tags: , — Axel Hecht @ 9:51 am

Our brave build folks have cut the tags on Firefox 3.5 RC 1, and I figured I give a little feedback on that from the l10n side.

As RC 1 was based on new strings, we required each localization to sign-off on the status of their localization to be ready for release. We’re still doing this by opening what we call a “opt-in thread”, a message sent to .l10n after the last l10n-impact landing to which localizers reply with a link to the revision of their localization that is good to go. Part of that communication is the message when code-freeze is planned to be, and the message that plans don’t always work out. So we’re keeping the the opt-in thread open actually up to the point where we really kick off the builds.

The output of that process are two files which control our release automation process, shipped-locales and l10n-changesets. For the curious, we’re tracking which locales ship on which platforms in the first, and it’s part of the code repo, and which locales ship which hg revision in the second, which is in the buildbot-configs repo.

The whole process lead to 150 different opt-in sourcestamps which came in by either public replies in the newsgroup, or as private mails in my inbox (or both). I pick those up, and click on some buttons on a version of the l10n dashboard running on my home machine, review the changes to previous sign-offs (yes, I do have a web interface that does comparison between two revisions in hg), and accept or reject the sign-off. If I reject, I follow up in .l10n with why I did that. That adds up to 159 posts in that thread, by 72 authors. Dependent on how imminent the release is, or seems to be, I “back up” my local data by attaching files to the tracking bug. This led to one version of shipped-locales, and a whopping 16 versions of l10n-changesets. Or, in short …

<bhearsum|afk> Pike: when should i expect an updated l10n-changestets?
<Pike> bhearsum: …
<Pike>
<Pike>
<Pike> now
<bhearsum> heh
<bhearsum> cool!

What’s really cool here is that we’re actually at a point where we pick up improvements to our localization up to the last minute, with tools that make us feel comfortable about that, and with a release environment that is able to digest all that noise and produce builds for 73 localizations in a matter of a few hours.

73

June 10, 2009

Running into builds, just testing

Filed under: L10n,Mozilla — Tags: , , , — Axel Hecht @ 10:29 am

I’ve blogged previously on how to set up a staging environment to test the l10n build system, but I didn’t go into any detail on how to actually do builds in that set up. That shall be fixed.

I’m picking you up at the point where

twistd get-pushes -n -t 1 -s l10n_site.settings

is running stable. You probably want to tail -f twistd.log to follow what it’s doing. This piece is going to watch the hg repositories in stage/repos and feed all the pushes to those into the db. Next is to make sure that you actually get builds.

The first thing you need to do is to configure the l10n-master to access the same database as the django-site. Just make sure that DATABASE_* settings in l10n-master/settings.py and django-site/l10n_site/settings.py match. The other setting to sync is REPOSITORY_BASE, which needs to match in both settings.pys. I suggest setting that to an empy dir next to the l10n-master. I wouldn’t use the stage/repos dir, mostly because I didn’t test that. Now you set up the master for buildbot, by running

buildbot create-master .

inside the l10n-master clone. Next thing is to create a slave dir, which is well put next to the l10n-master. Despite what buildbot usually likes, this slave should be on the same machine that the rest is running on.

mkdir slave
buildbot create-slave slave localhost:9021 cs pwd

So much for the general setup. With the twistd daemon running to get the pushes, you can now switch on the master and the slave, by doing a buildbot start . in both dirs. I suggest that you keep a tail -f twistd.log running on the master. If you decide to set things up to track the upstream repositories, I start the master, then the slave, and if both are up fine, I start the twistd daemon for get-pushes.

Now let’s trigger some builds:

Open stage/workdir/mozilla/browser/locales/en-US/file.properties in your favorite editor, and do a modification. I suggest to just do a whitespace edit, or to change the value of the existing entity, as that is going to keep your localizations green. Check the change in to the working clone, and then push. The get-pushes output should show that it got a new push, and then on the next cycle, feed the push into the database. You’ll notice by the output of a hg pull showing up in the log. On the next poll on the l10n-master, builds for all 4 locales should trigger. You should see an update of four builds on the waterfall, and 4 locales on the test tree on the local dashboard.

June 5, 2009

got l10n builds

Filed under: L10n,Mozilla — Tags: , , — Axel Hecht @ 3:22 am

’nuff said? Not even remotely.

We’ve had l10n builds as long as I’m working on l10n, actually, I got involved around the time when we started to do them upstream. They always were considerably better than each localizer doing their build at home on whatever (virus-infected) hardware they found, with help from other community members for the platforms they didn’t have. But at the light of day, it was more

That? Yeah, I know. That’s crap.

And I know you can hear my voice in your head right now :-).

Those days are gone. We’re running Firefox and Fennec builds on the releng infrastructure now for a few days that are actually sound builds made to service our l10n community. Some highlights:

  • Builds are finished some 10 minutes after a localizer landing, on all three platforms.
  • There’s no deadlock between different locales, thanks to all l10n builds running on a pool of slaves.
  • Builds are “l10n-merged”, against the actual build that’s repackaged. Independent of missing strings or files, you have a build that can be tested.
  • No more race conditions between nightly and trunk source status.

The impact of this shouldn’t be under-estimated. We are, for the first time in years, producing builds that allow a localizer to actually immediately test. Localizers can work incrementally, translate one feature, check-in, test. No worries if something landed in en-US in the meantime, or whatnot. With the new builds, I have seen various localizers coming from hundreds of missing strings to a tested build on two or three platforms in a matter of a few hours. Back in the days, that was the waiting time for the first build. The new locales all pull all-nighters to get their final bits in. They want to, and now they actually can.

I want to thank coop and armenzg for their great help in making this happen, aki for porting it over to fennec. Of course thanks go to joduinn and sethb, too, for bearing with the ongoing meetings we have, trying to battle the crap down. To dynamis for the initial work on l10n-merge. Also thanks to bsmedberg and Chase for the initial works on both automation and build process, and ted for the various reviews on making our build system catch up.

Finally, we’re not going to stop here. Armen is working on creating the necessary files to get l10n builds on a nightly update channel. Yep, you heard right, that’s where we are right now. I know that KaiRo is working on getting the goodness over to the comm-central apps. And yours truly is hacking on the dashboard together with gandalf, more on that in a different post.

June 2, 2009

Searching l10n

Filed under: L10n,Mozilla — Tags: , , — Axel Hecht @ 7:20 am

I’m contemplating adding search in l10n to the dashboard, and I figured I’d put my thoughts out for lazyweb super-review.

Things we might want to search for:

  • Localized strings
    • in a particular locale
    • in all locales
    • going into a particular app
  • entity names
    • in all of the above

As with the rest of the dashboard, I’d favour a pythonic solution. I’ve run across Whoosh, which seems to offer me what I’d need. In particular I can mark up searches in just keys or just values of our localized strings with the Schemas it offers.

All sounds pretty neat and contained, I’m just wondering if there’s something cool and shiny elsewhere that I’m missing, or if someone came back with “ugh, sucks” from trying Whoosh.

Ad-hoc design for the curious:

For each changeset, we’d parse the old and the new version of the file, getting a list of keys and values, and I’d create two searchable TEXT entries for all changed keys, and added entries. We’d tag that “document” with path, locale, apps, revision, branch. That way, you could search even for strings that aren’t currently in the tip, and get a versioned link to where it showed up first, and last, possibly. Given that we have a lot of data and history, I wouldn’t be surprised if that corpus would get large pretty quickly. I’d expect to not only index l10n but en-US, too. Thoughts?

June 1, 2009

L10n ecosystem in a fishbowl

Filed under: L10n,Mozilla — Tags: , , , — Axel Hecht @ 7:02 am

Building the infrastructure for our l10n builds is hard, mostly because it’s consisting of a ton of things that you don’t have control over. We’re building 3 and a half applications, Firefox, Thunderbird, Fennec, and Sunbird for calendar. Firefox is built on three versions, one of which is still coming from CVS. Thunderbird is one version on CVS, one on hg. We’re touching some 170 hg repos, and a single check-in can do anything between no build, one build, or up to almost 200 builds. Rinse, repeat, yeah, 200 builds for a single landing. Worse than that, you don’t have any control over who’s landing what when where. Bottom line, you really can’t test a change in the l10n build automation reliably in our production setup.

You can create a fake ecosystem, though, and I’ll explain a bit how that works. Of course it doesn’t end up being trivial, but it’s contained. It’s not trying to cover the CVS branches, those would require a setup of bonsai, which I chicken away from. Take this post with a grain of salt, I assume there are some errors here as most of it is typed from memory.

As with any recipe, here are the list of ingredients:

  • A set of hg repositories, both for a fake application and some fake l10n repos.
  • An hg server serving those repositories (make that port 8001).
  • Some buildbot infrastructure working on top of these repositories that you’re trying to test.
  • Possibly an instance of the l10n dashboard presenting both the build and the l10n data.

The initial chunk is creating the repositories. I created a helper script create-stage that does that, which is part of buildbotcustom/bin/l10n-stage. It’s main purpose is to get the templates, hooks etc that are part of our server-side setup on hg.mozilla.org, create some upstream repos for en-US and l10n, and push some initial content from a set of working clones. You call it like

python l10n-stage -l stagedir

The -l keeps the l10n repositories from pushing their initial content, which yields a scenario that is closer to what we have upstream, i.e., a flock of en-US pushes before the l10n repos start. This command creates a bunch of main repositories in the repos subdir of stagedir, and a bunch of working clones in workdir. It also creates a webdir.conf, that you’ll use to run the local hg http server. Let’s run that now, in stagedir:

hg serve -p 8001 --webdir-conf=webdir.conf -t repos/hg_templates

Now you have a local setup of a application repository called mozilla, and 4 localizations in l10n, ab, de, ja-JP-mac, x-testing. They’re all equipped with the same hooks that we run on hg.m.o, in particular, they support pushlog.

Now on to the buildbot infrastructure. There’s a sibling script to create-stage, create-buildbot, which should create a master setup that is rather close to what we run on releng. It supports various degrees of parallelism for multiple slaves on three platforms, does only dummy builds, though. IIRC. I want to go into more detail on how to set up the new dashboard master, though.

The dashboard master is merely running compare-locales on the actual source repositories. It does come with our bonsai replacement pushes, though. That’s basically a pushlog spanning repositories, including file and branch indexing. Here’s the basic software components you’ll need:

  • django 1.0.x (1.1pre might work, too)
  • buildbot 0.7.10p1, older versions won’t work

and from hg.m.o, you’ll need compare-locales, locale-inspector, l10n-master, buildbotcustom and django-site.

Firstly, you set up the db. sqlite and mysql should both work, mysql is actively tested. Edit the various settings.py files to reference your db, with an absolute path if sqlite, and create the schema. The main entry point to the django site is l10n_site, go in there to run python manage.py syncdb. Another edit you want to do is to point REPOSITORY_BASE to a dir where you can stage another set of clones of your repos. I suggest to not share the hg master repo dir here.

Next, create a buildbot master and a slave. You do that by running the buildbot create-master command on your local clone of l10n-master. You’ll need to adapt l10nbuilds.ini to the test set up,

[test]
app = browser
type = hg
locales = all
mozilla = mozilla
l10n = l10n
repo = http://localhost:8001/
l10n.ini = browser/locales/l10n.ini
builders = compare

I should put that into the repo somewhere, I guess.

Setting up the slave is trivial, you need to make sure it’s on the same machine, though. It will run on the django clones directly.

Before starting master and slave, make sure that all the deps are in your PYTHONPATH.

Last but not least, you need to get all the pushes from your repo setup into your django setup. First, you need to tell the db which repositories it’s supposed to get from where. I created sample data for the test app, which you can load by

python manage.py loaddata localhost

The repositories I use for the production environment are in hg_mozilla_org, fwiw. You fill the database with the actual push data by running a twisted daemon. Inside django-site, run

twistd get-pushes -n -t 1 -s l10n_site.settings

That will ping one repo per second, not update, with data from l10n_site.settings. Now you have everything set up, and you can start to edit en-US and l10n files in your workingdir, and push, and see how that changes your builds and dashboard.

The buildbot waterfall will be on port 8364, and with python manage.py runserver, the dashboard will show up on port 8000. None of this is setup to be on a production server at this point, but it’s good for testing.

Update: Forgot to mention that you need to bootstrap the repo lists.

Powered by WordPress