October « 2007 « Axel Hecht

October 30, 2007

Localizations for Firefox 3 Beta 1

Filed under: L10n — Axel Hecht @ 4:55 pm

I posted to mozilla.dev.l10n about how we’re going to ship localizations for Firefox 3 Beta 1, aka M9. In a very executive summary,

opt in — to do so, follow up on the post on m.d.l10n
go green
don’t mess up search etc
help testing

Find the full posting on news.m.o or google groups. Questions, opt-ins and follow-ups should all go there, thus I’m closing comments and pings right away.

Comments Off

Firefox 2 glossary

Filed under: L10n — Axel Hecht @ 3:01 am

As I have been blogging before, I try to create a glossary for Firefox 2. I’m leaving out all the gory details, but it’s been a hard fight between me trying to be clever and being dumb. I’m still not happy with the code that creates the data, but still, I think this iteration generates output that looks good enough to share, and get busted by others.

With all the grumpiness about my code, I’m pretty happy that I didn’t have to do the web part, so thanks to shaver for poking plasticmillion about exhibit. That went pretty slick up to a site of my standards. Read, with full visual suckage. I did find a bug in exhibit, though.

The dataset I have now went through a series of iterations to do the right thing to find phrases, which, by now, should be a nice set of educated guesses. I’m probably wrong with some, so if you find a string that shouldn’t be there, or should be and isn’t, that’s likely a bug.

Ok, beef. Here’s the link. It should have all phrases that appear in Firefox more than once, sortable by length and occurence, but not sequences that are just filler words. I have a short black list for the latter. For each phrase, you can click on it, and it will open an mxr search on the MOZILLA_1_8_BRANCH in all localizable files for it. I didn’t file the mxr bugs, nor did I yet try to work around them, searching for ‘foo’ including ticks doesn’t work, at least. And of course, you can search the glossary, it’d suck otherwise, right? Thanks to exhibit, that was easy.

Before going further into this, is this something worthwhile for the l10n community? Other RFEs? Right now, the source data is in sqlite, which I intend to share, if anyone’s interested, though the database schema and the way I create it needs work.

And before you ask, yes, I tried to run it on Thunderbird, too, but it seemed to make the story harder and dominate the results. I guess it’d be better to just create two separate apps. I’m having perf problems with the code, too, so I wasn’t too keen to do more than initially necessary. I don’t index security/manager, because the localizable files in there are just yucky, and I didn’t want to special case for stuff like # being replaced by a newline.

Comments (2)

October 17, 2007

Localization is hard, the math way

Filed under: L10n — Axel Hecht @ 2:48 pm

I thought I share some statistics about our localizable strings. I’m currently trying to create a tool that would help localizers to find terms or phrases that show up in Firefox often. Roughly, an automated way to do glossaries. So, the basic idea is to go through our localizable files, and search for phrases that come up elsewhere. Example:

I love to localize.

I don’t really love to blog.

should factor out “love to”. So, how does one do that? First, you get all localizable strings. Or not all, just the ones in properties files and DTDs. And neglect those that are just simple chars, those are likely access and commandkeys, not much to see there. Some of those entries have multiple sentences in them, so split those. Then split all words. Now you have a sequence of sequences of words. So you “just” have to find the unique subsequences of those sequences.

Let’s do the math. Here’s the number of sentences per length of sentence for Firefox 2, with a logarithmic y-axis.

So, the maximum length of a sentence is 71 words, and we have one of that. The other end of the diagram are 1791 localizable strings that are just one word, among of those are 858 uniques.

Now, we talked about subsequences, how many of those do we have then? I’m having a chart for that, too. I dropped the long tail, and don’t do logarithmic scale, as that is hiding some of the more interesting artifacts for the shorter strings.

So we’re starting to see repeating strings at a length of 15 words. They stay to be few down to a length of 4 or 5, with a majority of dupes only for single words. For 2-tuples, it’s almost a tie. Yet there are 20k words in our localizable files that need to be compared, and the same for all the subsequences, and doing my ad-hoc math, which I assume to be algorithm-wise not totally clueless, that amounts to half a million tuple compares to find the list of shared n-tuples in our localizable strings, sorted by frequency.

Which is my lengthy math-way of saying: Making a consistent localization is hard.