Axel Hecht Mozilla in Your Language

October 30, 2007

Firefox 2 glossary

Filed under: L10n — Axel Hecht @ 3:01 am

As I have been blogging before, I try to create a glossary for Firefox 2. I’m leaving out all the gory details, but it’s been a hard fight between me trying to be clever and being dumb. I’m still not happy with the code that creates the data, but still, I think this iteration generates output that looks good enough to share, and get busted by others.

With all the grumpiness about my code, I’m pretty happy that I didn’t have to do the web part, so thanks to shaver for poking plasticmillion about exhibit. That went pretty slick up to a site of my standards. Read, with full visual suckage. I did find a bug in exhibit, though.

The dataset I have now went through a series of iterations to do the right thing to find phrases, which, by now, should be a nice set of educated guesses. I’m probably wrong with some, so if you find a string that shouldn’t be there, or should be and isn’t, that’s likely a bug.

Ok, beef. Here’s the link. It should have all phrases that appear in Firefox more than once, sortable by length and occurence, but not sequences that are just filler words. I have a short black list for the latter. For each phrase, you can click on it, and it will open an mxr search on the MOZILLA_1_8_BRANCH in all localizable files for it. I didn’t file the mxr bugs, nor did I yet try to work around them, searching for ‘foo’ including ticks doesn’t work, at least. And of course, you can search the glossary, it’d suck otherwise, right? Thanks to exhibit, that was easy.

Before going further into this, is this something worthwhile for the l10n community? Other RFEs? Right now, the source data is in sqlite, which I intend to share, if anyone’s interested, though the database schema and the way I create it needs work.

And before you ask, yes, I tried to run it on Thunderbird, too, but it seemed to make the story harder and dominate the results. I guess it’d be better to just create two separate apps. I’m having perf problems with the code, too, so I wasn’t too keen to do more than initially necessary. I don’t index security/manager, because the localizable files in there are just yucky, and I didn’t want to special case for stuff like # being replaced by a newline.

October 17, 2007

Localization is hard, the math way

Filed under: L10n — Axel Hecht @ 2:48 pm

I thought I share some statistics about our localizable strings. I’m currently trying to create a tool that would help localizers to find terms or phrases that show up in Firefox often. Roughly, an automated way to do glossaries. So, the basic idea is to go through our localizable files, and search for phrases that come up elsewhere. Example:

I love to localize.

I don’t really love to blog.

should factor out “love to”. So, how does one do that? First, you get all localizable strings. Or not all, just the ones in properties files and DTDs. And neglect those that are just simple chars, those are likely access and commandkeys, not much to see there. Some of those entries have multiple sentences in them, so split those. Then split all words. Now you have a sequence of sequences of words. So you “just” have to find the unique subsequences of those sequences.

Let’s do the math. Here’s the number of sentences per length of sentence for Firefox 2, with a logarithmic y-axis.

Firefox 2 Sentences

So, the maximum length of a sentence is 71 words, and we have one of that. The other end of the diagram are 1791 localizable strings that are just one word, among of those are 858 uniques.

Now, we talked about subsequences, how many of those do we have then? I’m having a chart for that, too. I dropped the long tail, and don’t do logarithmic scale, as that is hiding some of the more interesting artifacts for the shorter strings.

List of tuples in Firefox 2 sentences

So we’re starting to see repeating strings at a length of 15 words. They stay to be few down to a length of 4 or 5, with a majority of dupes only for single words. For 2-tuples, it’s almost a tie. Yet there are 20k words in our localizable files that need to be compared, and the same for all the subsequences, and doing my ad-hoc math, which I assume to be algorithm-wise not totally clueless, that amounts to half a million tuple compares to find the list of shared n-tuples in our localizable strings, sorted by frequency.

Which is my lengthy math-way of saying: Making a consistent localization is hard.

September 11, 2007

Discussing Mozilla-Europe

Filed under: Mozilla — Axel Hecht @ 2:16 am

Remember Mozilla Europe? Tristan does, and poked the board. Mozilla Europe has gone a tad silent with other the other organizations rising in our ecosystem, so it’s about time to think about what Mozilla Europe is and what it could be.

The board of Mozilla-Europe is going to meet on September 17th, 18th to find out what to do. All board-members have some ideas on what we want to see and do, and even a bit of time to actually do something.

On the other hand, what do we know?

So I’m reaching out to you, Europeans and not, to share your ideas. Visions are good, plans are better, as always with Mozilla. The only restriction we’d have is that it should be about Mozilla and Europe.

You can leave your suggestions on the wiki, as trackbacks, or comments here. Or, pester your favorite member of the board directly. Not necessarily me.

Posted on behalf of the board of Mozilla Europe.

September 7, 2007

Farewell, SourceForge. Again.

Filed under: Mozilla — Axel Hecht @ 4:19 am

SourceForge.net is changing their privacy policy again. You can read the policy in effect right now, and the one coming up on their website. Let’s hope for the sanity of this post that the first is actually the link for the archived version, too.

In a sad attempt to actually comply to their own privacy policy, sf.net announced this change via email. The email to me was sent on Sept. 4th, announcing the switch date to be … hrm … Sept. 4th. This apparently violated

Concurrently with any substantive change to the Privacy Statement, we will email notice of the change to known users at least 15 days in advance (or such shorter or longer time as mandated by law or any judicial or government body).

Apparently, because they changed the date due to at least two tickets filed, one of which was mine.

The rationale for the change in privacy policy, according to that first mail, was

in connection with the ongoing SourceForge.net Marketplace open beta,
and our recent TRUSTe certification…

Just quoting because I found the “TRUST” in there funny. Sort of. Anyway, if you read the two versions linked above, you’ll have a hard time finding out the differences, but one difference is pretty remarkable: sf.net will not attempt to notify its users of privacy policy changes by email again. Let’s make that again-again. Long time ago, they did the very same thing, they changed the privacy policy, and dropped the mail-notification clause. Back then, I had them delete my account. At some point in time, they re-added the notification clause, and I opened up a new account to participate in buildbot-devel and translation-i18n discussions. Sadly, they lost it again, and I won’t even try to review their proposed privacy policy beyond the point that it can be changed at will and any change to it will be somewhere in the attic.

Shrink-to-fit is not the way to handle privacy, seems like I have to say farewell to sf.

Again.

August 21, 2007

RFE: ORM for mozstorage

Filed under: Mozilla — Axel Hecht @ 5:33 pm

Is anyone working on an object-relational-mapping for js and mozstorage? Say, something like sqlachemy does? Using JS1.7 iterators and generators, that should be totally feasible. That is, for someone that knows SQL, which is not me.

Thus RFE, any volunteers or takers out there?

July 30, 2007

Mail matters

Filed under: Mozilla — Axel Hecht @ 12:43 am

As everybody else, I’d like to give my 2 cts on the current mail/thunderbird discussion, and I’ll do it cent-wise. I’ll start off with mail.

Mail matters. As of today, your email address is the root cert for vast majority of your online identities.

POP/IMAP matters. Your emails are a valuable asset, and open and interoperable protocols (read, APIs) guarantee users power and choice over what to do with that data.

Email accounts matter. ISPs used to lock in their customers via their browser/connection software. Today, ISPs lock in their customers via email accounts. This is a real competitive advantage by raising the barrier to switch ISPs and thus leading to less competition between ISPs.

Sadly, email software matters. Todays email traffic has a malicious-content problem, so your email software has to be “safe”, no matter where it runs.

Webmail does not matter. But webmail is vendor lock-in par excellence, and it does not give the user power of her or his data. It’s nifty that you can access your mail through a browser, but in terms of empowerement, it doesn’t matter.

Email is a unique high-value asset of human being on the internet, with a particular impact on security and choice, and thus, IMHO, has an outstanding role with respect to the Mozilla mission and the manifesto. It’s not outweighing the web, but one without the other is vastly less valuable. Something that I wouldn’t say about calendaring or VOIP.

I’ll do a Thunderbird-specific piece shortly.

July 23, 2007

Hello? World?

Filed under: Uncategorized — Axel Hecht @ 3:32 am

Some adds just puzzle me.

This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.

Found on Buildbot-devel, though the archives strip the adds you get in the actual mails.

June 26, 2007

tools don’t help

Filed under: L10n,Mozilla — Axel Hecht @ 4:03 am

Localization tools are a pity. The Hebrew localizers switched from one tool to the other, and attached a patch to bug 373436, +1190/-818 lines for toolkit alone. Given that toolkit ‘only’ has some 2000 strings, that’s a pretty big diff.

I wrote a diff tool that ignores all the formatting and ordering, and oops. The actual changes boil down to +6/-6. Even worse, I can hardly a- that patch, if I want that localization to go forward.

To anyone pondering to write l10n tools, would you please consider writing them with Openess in your mind? Tool lock-in is not an option, even if it spares you half a day of work to write a decent serialization architecture.

Oh, and don’t ask me to endorse your tool until bugs like this are fixed.

</rant>

June 14, 2007

… and windows l10n builds

Filed under: L10n,Mozilla — Axel Hecht @ 4:34 pm

Coop picked up bug 374197 and thus we’re having l10n windows builds on the trunk again. To quote the log,

ar...busted.
bg...busted.
ca...testfailed.
cs...testfailed.
da...busted.
de...busted.
el...busted.
en-GB...busted.
es-AR...busted.
es-ES...testfailed.
eu...busted.
fi...busted.
fr...success.
fy-NL...busted.
ga-IE...busted.
gu-IN...busted.
he...busted.
hu...busted.
hy-AM...busted.
it...busted.
ja...busted.
ja-JP-mac...busted.
ka...busted.
ko...busted.
ku...busted.
lt...busted.
mk...busted.
mn...testfailed.
nb-NO...testfailed.
nl...testfailed.
nn-NO...busted.
pa-IN...busted.
pl...success.
pt-BR...testfailed.
ro...busted.
ru...success.
sk...busted.
sl...busted.
sq...busted.
sv-SE...testfailed.
tr...busted.
zh-CN...busted.
zh-TW...busted.

The corresponding builds are on ftp.

New localizations to test

Filed under: L10n,Mozilla — Axel Hecht @ 7:18 am

We have a few new localization teams putting a whole lot of effort in getting their languages up and running. For those, the l10n server is creating language packs right now, and as always, they can use as many eyes as possible.

The right welcome for Runa, Rajesh and Timothy is a Mozilla welcome, so pound their work and file some bugs ;-). If you know someone speaking Bengali, Hindi, or Ukrainian or happen to do so yourself, you want to download their language packs, set browser.useragent.locale to bn-IN, hi-IN, or uk resp. and restart.

The best way to provide feedback for those can be found on the respective team pages.

Update: Filmil fixed the Serbian language pack now, too. Woot.

« Newer PostsOlder Posts »

Powered by WordPress