-
Compiling Localizable Objects into Native JavaScript
As promised, here is the second post from Jeremy Hiatt’s work on our l20n project. This is a word-for-word reposting of his essay about compiling localizable objects in native JS.
====================================
One of the goals for my summer internship is to improve performance of l20n. The initial implementation was a parser written entirely in JavaScript that operated on .lol files. For more details about our choices for file formats, see my previous post. After some failed attempts to rework the parser’s use of regular expressions that regressed performance, I experimented with JSON as an alternative file format. The hope was that we could leverage the performance of Gecko’s built-in JSON parser to speed up l20n. We did see some tremendous improvements: on a large testcase constructed from browser.dtd, JSON cut our parsing time from ~140 milliseconds down to just a few ms. Unfortunately, we were still slow when it came to evaluating and displaying all those entities. We still had a big chunk of parsing left that we couldn’t outsource to JSON. Each string value in l20n may contain variable placeholders. Here’s an example (in JSON):
"droponbookmarksbutton" : { "value" : "Drop a link to bookmark it"}, "popupWarning" : { "value" : "${brandShortName}s prevented this site from opening a pop-up window."}(Line breaks inserted for clarity.) The first string doesn’t use any variables, but the second does. In order to catch all these placeholders, we scanned each string with a regular expression to match the ${…}s syntax, even though many strings don’t use any variables. That translated to a linear traversal of every single string before it could be returned, costing us a lot of time. In tests conducted in the xpcshell, rendering all the elements from browser.properties took roughly 40ms. In comparison, the current framework for properties files can parse and display all the elements in under 20ms. Since we can’t afford to regress overall performance, that meant we still had work to do to get faster.
One way to eliminate checking every single string is to add extra information to the encoding for strings. Many languages define different behavior for single- vs. double-quoted strings, performing replacements in one but not the other. We could also have added a special flag to indicate simple (no replacements) vs. complex strings. Either of these approaches would have added further complexity to the localization process, so we did not seriously consider this approach.
Instead, on the advice of the brilliant Staś Małolepszy, we embarked on an experiment to compile our l20n objects into native JavaScript. As a result, we saw another impressive performance jump. In an xpcshell test, we can load and display all of browser.properties in roughly 4ms (an order of magnitude improvement!). Here’s what our previous example looks like as compiled JavaScript:
this.droponbookmarksbutton="Drop a link to bookmark it"; this.__defineGetter__("popupWarning", function() { return "" + (brandShortName) + " prevented this site from opening a pop-up window.";});Another great thing about compilation is that our runtime performance doesn’t depend on our choice of source file format. Here’s a diagram showing the different ways an l20n file can get inflated into a localization context:
The performance numbers were collected using nsITimelineService in the xpcshell. The l20n runtime infrastructure can inflate a source file directly into a context, or it can load compiled JavaScript definitions for a significant performance boost. For comparison, here’s a diagram of Mozilla’s current l10n scheme:
Again, this time was measured in the xpcshell when loading the browser.properties string bundle. It’s not necessarily representative of performance for DTD files as well. As we can see, compilation now guarantees at least comparable performance to the current approach, no matter what file format we end up using. If you’d like to weigh in on that debate, please leave a comment on my previous post! And finally, we are also working on l20n support in Silme so that it will be easy to migrate existing DTD/.properties files to our new l20n format.
Silme will serve as a critical compatibility layer to ensure a smooth transition to our new l10n framework. Please let me know if you have any questions or comments!
-
The L20n Format Shootout
Jeremy Hiatt is our localization summer intern who has been doing some fantastic work to advance the conceptual idea of L20n into something more practical. Below is a word-for-word copy of a post he made on his blog. I am reposting his words to get more people reading what he has been working on. Tomorrow will come a second repost about compiling localizable objects into native JavaScript.
——————————-
L20n (for localization, 2.0) aims to empower localizers to describe complexities and subtleties of their language: gendered nouns, singular/plural forms, and just about any other quirk that might exist in the grammar. Like DTD and .properties formats, which we currently use to encode localizable strings, l20n objects associate entity IDs with string values. Localizers translate these values into the target language. L20n has all the power of the current framework, plus a lot more, and it’s just as simple to use (provided we choose the right format!). You can find some examples of l20n in action here. In the past weeks, we’ve experimented with JSON (JavaScript Object Notation) as a file format to represent localizable objects in hopes of achieving better performance by leveraging the new built-in JSON parser in Firefox. The performance gains were substantial, but still not enough to compete with the current DTD/properties framework in terms of speed. We’ve since adopted a new scheme to compile our l20n source files into native JavaScript (credit to Staś Małolepszy for suggesting this). This compilation now guarantees good performance independent of our choice of source file format. I will discuss the specifics of compilation in an upcoming post; this post will focus on the relative merits of the various formats under consideration.
Meet the contenders
LOL files
Before experimenting with JSON, we developed a novel format for l20n, playfully titled “localizable object lists” (.lol files). A lol file looks like a hybrid of DTD and .properties formats, with entities delimited by angle brackets and colons separating keys from values. Here’s a simple example, constructed from brand.dtd:
<brandShortName: "Minefield"> <brandFullName: "Minefield"> <vendorShortName: "Mozilla"> <logoCopyright: " ">
In this simple case, the lol file looks a lot like the original brand.dtd, which looks like this:
<!ENTITY brandShortName "Minefield"> <!ENTITY brandFullName "Minefield"> <!ENTITY vendorShortName "Mozilla"> <!ENTITY logoCopyright " ">
We lost the !ENTITY declaration and added a colon, but otherwise the lol format should look familiar. What if we want to do something more complex, like define an entity that involves a gendered noun? Here’s a German example encoded in a lol file:
/* This entity references a gendered noun */ <complex[appName.gender]: { male: "Ein hübscher ${appName}s.", female: "Ein hübsches ${appName}s."}> /* This is a gendered noun */ <appName: "Jägermeister" gender: "male">In the above example, we indicated the “complex” entity depends on the “gender” property of the “appName” entity. The ${…}s expander within the string is a placeholder that will be replaced with the value of “appName” (Jägermeister). Note that we can insert comments in the familiar /*…*/ style. If you’re curious to see more use cases for l20n and the lol format, be sure to check out the link above to Axel’s examples.
JSON
JSON is a well-known data exchange format. It’s simple to understand, and with implementations available in many different languages, simple to use. As mentioned above, our initial attempt to encode localizable objects in JSON was motivated by performance concerns. Even without a speed advantage, JSON still has some attractions, namely its existing implementations. Our JSON-based l20n infrastructure leverages Gecko’s built-in parser to do a lot of heavy lifting, meaning we have less code to maintain on our part. Plus, arrays and hashes, the fundamental datatypes available in JSON, are a natural fit for localizable objects. Still, JSON has some serious shortcomings, which we will see shortly.
As mentioned above, JSON is great for describing key-value pairs. Here’s the same brand.dtd example, now expressed in JSON:
{"brandShortName" : {"value" : "Minefield"}, "brandFullName" : {"value" : "Minefield"}, "vendorShortName" : {"value" : "Mozilla"}, "logoCopyright" : {"value" : " "}}Our localizable objects in JSON feature a “value” attribute denoting the string to be displayed. This makes our JSON example slightly more verbose, along with the required quotes surrounding the keys. Now here’s the sample gendered-noun example from above, this time in JSON:
{ "complex" : {"indices" : ["appName.gender"], "value" : { "male" : "Ein hübscher ${appName}s.", "female" : "Ein hübsches ${appName}s."}}, "appName" : {"value" : "Jägermeister", "gender" : "male"}}In JSON, we need to reserve some keywords for attributes, like “indices” here, to implement certain l20n features. Still, JSON works pretty well to express the structure of the object. One area where JSON doesn’t work so well is comments. In JSON, our top-level object is a hash that associates entity IDs with their definitions. There are a few apparent ways to integrate comments into this object:
- Assign each comment to the same identifier, e.g. “comment”.
- Assign each comment to a unique identifier, e.g. “comment0″, “comment1″, etc.
- Don’t allow top-level comments: each comment has to be an attribute of an entity
Option 1 makes sense for humans writing JSON, and it’s valid, but a little strange.
Option 2 is a little painful when writing the file, especially when it comes to inserting new comments. This scheme would make it possible to reference specific comments, which might be useful.
Option 3 is somewhat of a straw-man but still deserves some consideration. Most comments in a localizable file give instructions for how to translate a specific entity, and now that relationship would be explicitly enforced. This form of comment is likely the best choice in most situations, but it probably is too restrictive to make it the only choice.Another shortcoming in JSON is that it doesn’t support multiline strings. This is a serious problem when it comes to presenting long strings to localizers, since line breaks aren’t just for readability; they also give important cues for localization about logical separation between thoughts. As it turns out, the native JSON parser built into Gecko is perfectly content to accept multiline strings, but most other parsers will report an error.
YAML: A better JSON?
YAML is a data serialization language that is a superset of JSON. It supports comments, multiline strings, and user-defined data types. On the downside, it’s not nearly as well-known as JSON, it’s considerably more complex, and it’s not already built in to the Mozilla platform.
Here’s our first example from above, now in YAML:
brandShortName: Minefield brandFullName: Minefield vendorShortName: Mozilla logoCopyright:
And the second example:
complex: indices: appName.gender value: male: Ein hübscher ${appName}s. female: Ein hübsches ${appName}s. appName: {value: Jägermeister, gender: male}YAML has the same logical structure as JSON with a much cleaner look, since indentation can be used instead of curly braces to denote objects, and it doesn’t require strings to be delimited with quotes. That’s another attractive feature, since it reduces the chance for errors with improperly escaped quotes within strings, and missing trailing quotes, that cause a lot of frustration. The less rosy side of the picture is that we don’t have a YAML parser that we can simply drop into place like we did with JSON, so it would require a lot of work on our part to get it up and running. YAML does have a fair number of implementations floating around, but development seems to have stalled on many of these. For example, this JavaScript implementation hasn’t seen any updates in nearly 5 years.
Summary
So far we’ve seen examined three choices: LOL, JSON, and YAML. The first was designed specifically for l20n, so naturally it encodes the complex features of l20n most gracefully. The remaining two are established formats with implementations available in many different programming languages (JSON to a far greater extent than YAML). The lack of comments and multiline strings is probably enough to eliminate JSON from the discussion, since these deficits outweigh any advantage of interoperability, leaving us with LOL and YAML. If you’d like to make a case for one of these, or any other format dear to your heart, don’t hesitate to leave a comment! We’d love to get your input.
-
Northern Sotho
Northern Sotho is an official language of South Africa, and you’ve probably guessed why I am blogging about it. Thanks to the folks at Translate.org.za, Firefox is now available for use in this language (by way of an AMO Collection).
Since one of our localization community leaders, Dwayne Bailey, posted the following message via Facebook, I thought I would repost it on my blog. Sorry for lifting the email and reposting if you’ve already read this note, but I’m hoping to provide maximum coverage.
“Are you a Sepedi speaker, Firefox user or able to help test a Pedi version of Firefox? Yes that probably means all of you
“This work is soon to be part of the African Network for Localisation (ANLoc) (http://www.africanlocalisation.net/) activities where we’ll be localising Firefox into a number of African languages. So your help here can help change the way Africans view the internet, create content, etc, etc. You’re about to change the world!
“OK testers here we go:
- Make sure you have Mozilla Firefox. Visit http://www.mozilla.com and install Firefox if needed.
- Start Firefox
- Visit the Northern Sotho collection https://addons.mozilla.org/en-US/firefox/collection/northern-sotho
- Install both the Northern Sotho language pack and the Locale Switcher
- Restart Firefox
- Change your user interface language by selecting: Tools -> Languages -> Northern Sotho
- Restart Firefox
“Enjoy Firefox in Northern Sotho! Whenever we update the translations you should get new copies. Please provide any feedback on the Translate.org.za Facebook wall (or on Seth’s blog).
“If you would like to get involved in the actual translation or in fixing errors then please contact Dwayne via Facebook (or Seth’s Blog). If your interested we could have a Northern Sotho Firefox bug day at our offices and work at fixing any errors. But most of all HAVE FUN!”
-
A look at Microsoft’s website for downloading localized versions of IE8
A while back, Tristan passed me a blog post by the Internet Explorer 8 team that announced the availability of IE8 in over sixty languages. Their opening quote: “We are pleased to announce the availability of Internet Explorer 8 in 20 additional languages today. Internet Explorer 8 is now available in a total of 63 languages!” That’s a nice accomplishment by their team and congratulations.
I went to their site to investigate how Microsoft offers downloads of their localized versions and noticed some distinctions between the Firefox and IE8 experience.
Most interesting for me off the bat, IE8 is offering some languages Firefox does not have, including Konkani, Kyrgyz, Malay, and Uzbek (Latin). Those are the languages listed on their blog, but I am not sure if that is a comprehensive list of the differences, so I could use some help on determining where we fall short. It’s hard for me to tell from the official download page, which lists eighty-five different “county/region” selections, bringing me to my next observation.
Microsoft asks in English for its users to select a “country/region” and operating system. Then, they automatically send users to a localized download page. That seems to be pretty nice, perhaps a user experience person could give me an opinion. Looking at the list, I did see some Indian language fonts, indicating Indian versions, but I don’t know if that corresponds directly to a country or region. Either way, the count in that drop down selector is around eighty-five. Therefore, I don’t know if they offer sixty-three or eight-five localizations. Either number is impressive.
In addition to all this, Microsoft’s “Worldwide Sites” page lists fifty-four “country-languages” for a user to select that will change the UI of the website to that country-language. Mozilla does not distinguish based on country specifically; we simply list languages. Interestingly, they have several versions of French available for France, Canada, Switzerland, and North Africa. Mozilla lists only one version of French.
At Mozilla, we do not offer downloads based on “country/region” because we feel that it is best to showcase the language of the localization, not the geopolitical boundary. For instance, in India, we have eleven different localization teams, so isolating downloads to country/region didn’t seem to get users exactly what they wanted. Instead, we try to provide the best possible download by looking at the language of the browser in use at the moment of a visit to Mozilla for download and offering the version that matches that language. If we cannot recognize the existing language, we have a series of fall-back options in our website code that tries to offer the best possible download. If that doesn’t work, the en-US page provides an “Other Systems and Languages” link available just under the main download box. That takes users to our all.html page where all of our localizations can be seen.
Just looking at the copy of the Microsoft download site, the IE8 team states that the browser is available in many “locales/languages”. We use a bit different terminology in an attempt to distinguish the term “localization” from “language”. For Mozilla, localizations are partly identified by the language of the UI. But, a localization is customized to the region where the language is most prevalently spoken. For instance, using our eleven Indian localizations as an example again, each team is able to customize their version of the browser so that web services like search or protocol handlers are packaged together in one download. It may be a nuance, but for Mozilla, we try not to interchange language and localization.
Lastly, I am not quite sure how those sixty-three (eighty-five) languages are shipped to end-users. Does Microsoft ship each version simultaneously? Or, are versions offered as “downloadable” packs after major release in Englsh? In the past, I had heard that IE only ships one version (en-US) at the time of a major release. But, I suspect that has changed, I just couldn’t find the information anywhere. Please link me in the comments if you know.
-
Help me test two Kiswahili versions of Firefox
Surely, you saw me fire off a response two weeks ago about playing politics with our Kiswahili localization communities. Let’s move on from that flame war by summarizing our situation and presenting a path to a solution.
Presently, we have two communities, the tzLUG and the Kilinux teams, who have translated the Firefox application into Kiswahili (sw-TZ). Unfortunately, we have had tough luck in getting an unbiased, thorough evaluation of each body of work to help us decide which one to use. As it turned out, it was hard to find a number of individuals familiar enough with technical writing and Kiswahili who had time on their hands to volunteer for Mozilla. Furthermore, we didn’t have an easy package to evaluate, except for the “diff” of the code differences between the two. Yeah, that sounds ugly and it was. Still is.
To solve what has become a long-standing debate, we asked each team leader to create a Mozilla language pack of their work as an add-on that we would then host on and promote though our addons.mozilla.org website. Both teams agreed and uploaded their versions. Since then, I created two separate “collections” that bundle each language pack with Ben Smedberg’s Locale Switcher addon. Our hope is that end-users ready to test will install both versions and use the addons.mozilla.org site to provide feedback to each developer team.
If you are interested in testing each version, please install the following two collections:
- Kilinux: https://addons.mozilla.org/en-US/firefox/collection/kiswahili.kilinux
- tzLUG: https://addons.mozilla.org/en-US/firefox/collection/kiswahili.tzlug
Once you have installed these, you can switch between the two versions and your English interface by going to the menu item Tools –> Languages…
Now for testing…
Requirements: You must be able to read Swahili and English fluently and you must use Firefox.
If you choose to test these localization language packs, you’ll need to follow something similar to the “Firefox 3.5 Localizer Test Run” that has been created in Litmus, Mozilla’s testing application. If you use Litmus, please follow the steps I have posted in the first comment on this blog post.
You can also just use each language pack and keep notes of errors you spot. Whether you choose to use Litmus or not, please record any translation errors that you find in the user interface of each version. Please be very descriptive and thorough with any notes you keep, and write the notes in English. Take a look at the word choices, terminology, spelling, grammar, etc. and keep a record of errors you see. When you are finished, you can submit your evaluation to me. Just ping me on this blog.
As always, please ask some questions if you have them. Nothing is off limits.


