Axel Hecht Mozilla in Your Language

April 8, 2010

New version of compare-locales released, please update

Filed under: L10n,Mozilla — Tags: , — Axel Hecht @ 4:29 pm

I’ve released a new version of compare-locales, and you should really update.

The new version of compare-locales adds:

  • support for more than one filter.py
  • support for filter.py returning more than just bools, but “error”, “report”, or “ignore”.

Why?

Lorentz strings. They are missing, but them missing isn’t fatal. So we needed a third state.
And some of them are in toolkit, so we’re moving parts of the filter.py logic from all over the place into
releases/mozilla-1.9.2/toolkit/locales/filter.py. That means that we can remove the hacks in comm-central and mobile-browser, making our life so much more reliable and predictable.

The changes to 1.9.2 will land shortly, so versions of compare-locales prior to 0.7 will stop working.

Update paths:

easy_install -U compare-locales
is the easiest way to do it. Depending on OS and local settings, you might want to
sudo easy_install -U compare-locales

Or,
hg clone http://hg.mozilla.org/build/compare-locales/
and do whatever you did last time, setting paths or python setup.py install.

How does it look? Where’s beef? Or veggies?

Here’s what 0.8 spits out for German on 1.9.2 with patched filter.py:

de
   browser/chrome/browser
     browser.properties
         +crashedpluginsMessage.learnMore
         +crashedpluginsMessage.reloadButton.accesskey
         +crashedpluginsMessage.reloadButton.label
         +crashedpluginsMessage.submitButton.accesskey
         +crashedpluginsMessage.submitButton.label
         +crashedpluginsMessage.title
     preferences/advanced.dtd
         +submitCrashes.accesskey
         +submitCrashes.label
   toolkit/chrome/mozapps/plugins/plugins.dtd
       +reloadPlugin.middle
       +reloadPlugin.post
       +reloadPlugin.pre
       +report.disabled
       +report.failed
       +report.please
       +report.submitted
       +report.submitting
       +report.unavailable
de:
keys: 940
report: 17
unchanged: 634
changed: 4561
87% of entries changed

You can see the regular output of missing strings, and you’ll recognize all of lorentz. New here are three things:

The strings in the long display are not counted as missing, but are in a new summary item called “report”. Those strings are not fatal, but should get localized.
The return value of compare-locales is only dependent on *missing* strings, i.e., code checking the return value will see a successful run of compare-locales if there are reported strings, as long as there are none missing.
If you switch l10n-merge on, it won’t merge the reported strings, but rely on the real code falling back as intended.

Not-so-important feature update, compare-dirs is now supporting l10n-merge, too. That’s sweet for the upcoming weave stuff.

Questions are welcome here, bug reports are welcome in “Mozilla
Localizations”, “Infrastructure” component
.

February 5, 2010

L20n presentation at FOSDEM 2010

Filed under: L10n,Mozilla — Tags: , , — Axel Hecht @ 4:26 am

I’ll be presenting at FOSDEM on l20n, the infrastructure that we’re hoping to move our localization efforts to. The talk will be in the Mozilla Developers room on Sunday, 13:15. The FOSDEM program might still give a Sunday morning time, that changed.

I’ll focus on what tools can do to help localizers to use the power that l20n brings, without making things totally obscure. I’ll start with a quick recap for those that are new to it, and then discuss the challenges that l20n brings, and how tools can help. I’ll also present first thoughts on how to communicate data describing languages between tools, using html5 and microdata.

See you in Brussels, not just for that talk, of course.

PS: I’ll be giving a lightning talk on the new l10n site, too.

November 12, 2009

Crowdsourcing … exactly what?

Filed under: L10n,Mozilla — Tags: , , — Axel Hecht @ 9:45 am

I’ve just run across an interesting suggestion for translating “Smiley” into English. Screenshot of it would be

Crowd sourcing exactly what?

whereas the original (triple-licensed) translation suggestion is on l10n.mozilla.org/narro.

Another interesting aspect of crowd sourcing, box-of-chocolates style. You never know what you get.

November 6, 2009

PS: l10n-merge

Filed under: L10n,Mozilla — Tags: , , — Axel Hecht @ 8:02 am

Armen just blogged about this, and as it’s constantly mentioned around l10n, I wanted to add a bit more detail to l10n-merge.

l10n-merge is originally an idea by our Japanese localizer dynamis. The current implementation used in the builds is by me, integrated as an option to compare-locales. There are spin-offs of that algorithm in the silme library, too.

l10n-merge attempts to solve one reason for “yellow screens of death”, i.e., XML parsing errors triggered by incomplete localizations. This is really crucial as localizations don’t just pop up by swinging magic wands, they’re incremental work, and a huge chunk of that. So in order to test your work, you need to see the strings you have in, say, Firefox, without having the other 4000 strings done yet. Other l10n-infrastructures handle this by falling back to the original language at runtime (gettext), but doing that at runtime of course has perf impact, and size. l10n-merge does the same thing at compile (repackaging) time.

Design goals for l10n-merge were:

  • not mess with any source repositories
  • not do any file-io that’s not really needed

Thus, in order to not mess with the source repos, l10n-merge doesn’t modify the sources inline, but creates copies of the files it touches in a separate dir. Commonly, we’re using ‘merged‘ in the build dir. Now, creating a full copy of everything would be tons of file io, so l10n-merge only creates copies for those files which actually need to get entities added to existing localized content. This plays together with code in JarMaker.py which is able to pick up locale chrome content from several source dirs.

A Firefox localization contains some 450 files, and say for the current 9 B1-to-B2 missing strings in two files, it would copy over those two files from l10n, and add the missing entities to the end. Then JarMaker is called with the right options, and for those two files, will pick them up from merged, the rest of the localization is gotten from l10n. For missing files, it actually looks into the en-US sources, too, so we don’t have to do anything for those. To give an example, for chrome/browser/foo in the browser ‘module’, it searches:

  1. .../merged/browser/chrome/foo
  2. l10n/ab-CD/browser/chrome/foo
  3. mozilla/browser/locales/en-US/chrome/foo

Now it’s time to list some pitfalls that come with l10n-merge:

  • If you’re passing the wrong dir for mergedir, nothing breaks. All build logic breakage would come from missing files, and due to the fallback to en-US, there are no missing files.
  • l10n-merge, as compare-locales, doesn’t cover XML parsing errors inside entity values yet. Bug 504339 is filed, there are some tricky questions on reporting, as well as having to write an XML parser from scratch.
  • l10n-merge only appends entities, but that’s fine 95% of the time. Only counter-examples are DTDs including other DTDs.
  • People using l10n-merge need to manually maintain the merge dir. Pruning it via compare-locales is risky business if you specify the wrong path by accident, so I consider this a feature. But if you’re seeing Spanish in a French build, clobber the mergedir and build again :-)

September 9, 2009

String freezes on 1.9.2

Filed under: L10n,Mozilla — Tags: , — Axel Hecht @ 2:59 pm

As we’re struggling with our string freezes for 1.9.2, I figured I’d put something up on planet for everyone to notice:

  • String freezes are one only, and done with. No “I’ll get that in the next milestone”. First string freeze is the last.
  • String freeze for mobile-browser and toolkit was last Friday. For mobile-browser seriously this Friday, Sept 11th. (*).
  • String freeze for Firefox 3.6 is Sept 14th.

June 29, 2009

Roundup report from OTT 09

Filed under: L10n,ott09 — Tags: , — Axel Hecht @ 8:42 am

I’ve spent half of last week in Amsterdam joining the “Open Translation Tools 2009” unconference. It was a pretty interesting and diverse crowd to be with, and by far not as tool-author-only as it would sound like. Folks coming spread all over from sex-worker activists over global voices to translate.org.za to folks from the “professional translation companies”.

We started out with a few opening ceremonies. First of which was an introduction, the regular “who are you and where are you from”, along with a “how do you feel”. I was the only one that didn’t give a geolocation, but disclosed Mozilla as my point of origin. It’s obviously a more common theme in such events that localization folks are much more focused on their geographical background as defining their cultural background (what this should be about, right? This is not couch-surfing.) and not so much what they work on. The “how do you feel” was one of the hippie-pieces, along with a lot of twinkling. Reading the urbandictionary on twinkle makes me wonder, but it was just hands-not-clapping. Honestly.

Next up was a round of spectograms in the room. Two opposite oppinions where offered, and you had to stand across a line in the room on where between the two your opinion would be. Then Gunner, our head master of ceremonies, went in and poked people on why they’d be where they were. It’s an interesting exercise to figure out what kind of crowd you’re with, and scales pretty well.

Collecting agenda itemsAgenda-building worked pretty much like it does in most unconferences these days, we created tons of sticky notes and then tried to build themes and agendas from that. The resulting sticky notes are transcribed on the wiki. Mighty job by Lena. I feel quite fortunate that I didn’t have to distill those notes into an agenda. Gunner did that pretty loosely, which was probably a good combination. The resulting schedule is on the wiki, too. In the rest of my coverage, I’ll focus on those sessions that I’ve been in. The schedule links to the full notes of each session, the note takes were usually in good shape, so do take a look.

Ed talking professionalThe session that Ed Zad (both of which are shortened, the parents are not to blame for this one) led about the professional translation companies and ecosystem was OK. The actual translation is almost never done in-house, but contracted out to freelancers. The money you pay the companies goes into project management, and they hire translators, reviewers and editors to do the actual work, and get paid by the company. The interesting part here was really that those companies make their money from the project management and recruiting part. I didn’t get any useful feedback beyond “you must have hired the wrong guys” on my report that any time we had to contract translation out, the results were not really usable. Maybe asking that question the wrong guy :-). The main takeaway would be that the industry is really fragmented and diverse. I doubt there are any good rules for picking a partner when looking for a company for localization, either. The process Ed described about reviews and editing seem rather low-key compared to what you can do if you develop your localized content in the open.

Fran on MTThe next session I was in was about machine translation, led by Francis. He introduced the group to both statistical machine translation as well as rule-based MT. Interesting here are both the enormous amount of data you need for statistical MT, as well as the different stages. Rule-based MT on the other hand works well for closely related languages. For those that heard me talking about l10n-fork, that’d be rule-based MT. Francis offered to take a look at whether we can actually do better MT to share work in Mozilla localizations for at least closely related languages. All our romance languages with the Spanishs, Portugueses and French could benefit from that, possibly even patch-based.

We “closed” Monday in Vondelpark. Matt broke the aspiration techies on Monday night, though. Amsterdam is good at that. I ended up at Petra’s place, together with Tanya and Pawell. Thanks to Petra for a fun evening.

Tuesday started off with a crazy “crawl on the floor and draw as many workflows as you have”. I figured that I don’t know enough about 90% of the workflows we use at Mozilla, and just sketched out two extremes when it comes down to localizing Firefox. One is localizing patch by patch, like for example the French team does. And then we have a long tail of localizations that work within their toolchain and just occasionally export to hg and update the upstream repos. For the cats among you, there are 40 pictures of the workflow diagrams on flickr.

translating wikisNext was a pretty interesting session on translating wiki content. That made it a good fit for me to kill the session on “Localizing a hybrid organization – BOF on Mozilla”. I wanted to clone myself three or four times already, so not having to do a session myself was a win. Anyway. We had tikiwiki and mediawiki represented in the group. And me with some experience on those two, plus deki via MDC. The discussion turned up two fundamentally different ways of working:

  • Forking documents into different documents for different languages, with some cross-referencing that localizations exist. You’d know this from wikipedia.
  • Maintaining the different translations as variants of a single document. This is what tikiwiki does, as we see it on SUMO.

The discussion around one single living document in multiple languages was more lively, which gave me a good sense of what’s out there to address our needs at SUMO/MDC etc. There doesn’t seem to be anything blowing tikiwiki out of the water, so in terms of finding a wiki engine with l10n, SUMO made a good choice. We talked quite a bit about the multiple edits in various languages of the document, and what tiki defines to be 100% in the end. I showed off the l10n dashboard page we have on SUMO now, which was well received. The idea to not demand that people do as a bot tells them, and instead to empower them with relevant information seemed to resonate well. There was a different session about CMSes and l10n, read drupal etc. I only overheard the last bits, didn’t seem to have great answers over there. Judge yourself from the notes. Finding the right UE and UI paradigms for keeping a living document in multiple languages in sync seems to be an open item of work. In particular if your document isn’t bound to get value contributed in one single source languages. We would want to understand which changes are ports of fixes in other languages, and which are new fixes to the actual document that other translations of this document including the original source language would benefit from.

Next up was a round of speed-geeking. That’s similar to speed dating. A few geeks get a table each to present something to the rest of the group. The rest of the group is split up to watch one at a time. Each presentation is 4 minutes, then the groups rotate to the next table. If you’re bored by something presented, you just wasted 4 minutes. I took the challenge to present l20n 8 times in a row. That’s a pretty technical topic and a pretty diverse audience, so apart from being a stress test on ones vocal chords, it’s also pretty heavy on your brain. I must have been doing allright, though. The feedback was generally interested to positive. I got out with an action item to work with Dwayne on how we could actually present localization choices so that they’re options to fix and not just hell-bound confusion. On a general note, if you’re ever found speed geeking: Don’t sit in front of your computer. Don’t make people walk around the table to see something. It’s perfectly fine to sit next to your computer and have your laptop and yourself face your audience. Or do it like Dwayne did, just present without your damn laptop open :-). If feasible.

The last session on Tuesday was about building Volunteer Translation Communities. We had a few people there that are just starting to build such a community, but also a few people from Global Voices Online and yours truly from Mozilla. It’s pretty interesting how easy it is to think “I need to get such and such in language other, how do I ask for volunteers?” and how easily that fails. The common ground of those with living communities was that you don’t ask for translators, but you need to be open for contributors. At Mozilla, we’re hackable. We offer opportunities for all kinds of volunteer contributions, among which localization is one. That is something different than asking for some unit of work to be done for no pay. Another key is that you find your volunteers among those that are interested in the outcome of the localization work. The project management work you need to do to empower your translation community to actually do some work and get to the results shouldn’t be underestimated, too. There’s a reason why people make a living out of this one.

I moved from Tuesday to Wednesday through the Waterhole. As good as it used to be. Getting up in time was tough, but not as bad as it initially felt.

Dwayne chats about AfricaThe first session I joined on Wednesday was on localization issues in Africa. We had similar sessions for Central Asia, South Asia, and Asia Pacific, which I didn’t manage to get to. I even didn’t get to read the notes from those yet. Anyway, back to Africa. The challenges there aren’t all that surprising. Connectivity is really bad, cell phones are really big. During the OTT, though, the first cable made it to Kenya, so in terms of connectivity, things are changing. Fonts in Africa are mostly based on Latin script, so there’s not too much to do there, though a few characters usually need fixing. At least for web content, downloadable fonts offer a smooth upgrade path. In terms of technical abilities, a lot of the techies for African languages end up in Europe or the US and only occasionally visit home. For actual translators, there isn’t enough work to actually make a living of that, so you likely end up with part time night shifters. For many people with access to computers and internet, localization is a good thing, but not something on their own list of priorities, which leaves us with a rather small potential community there. Localizing really obvious things like cell phones or Firefox is a good way to start of a community, though. I’ve had some off-track discussions with Dwayne on how to work together with the ANLoc project he’s running, too.

The discussion about open corpora to be used for linguistic research and statistical machine translation training was OK, but not of that much interest for Mozilla. It’s a good thing to do, and if we can help in asking the right people, that’d be cool, though. There’s tons of politics to resolve first though, and they got enough folks for the initial group.

The next round of speed geeking had me on the consumer side. I already mentioned that you shouldn’t sit in front of the laptop that you use for presenting. John talked about Transifex, which is designed to be a system to bridge various version control systems for localizers, by having write access itself to the upstream repos. They start to offer an interface to actually translate a few strings in place, which they reuse from somewhere. It’s not pootle code, though. That was the one with most immediate touch point to what we do.

The last session for me was one driven by Dwayne again, closing the loop. We tried to find out how to get feedback from the localizers into tools, and into the software they localize. This was pretty interesting, thanks to the input from Rohana and Gisela, the two are actually localizers and could hint us at what they do and how. The main take away was that Localizers and l10n tool authors don’t talk enough to each other. Gisela, Dwayne and I have a follow-up conference in our heads to actually do that, I’ll talk about that in a different post. The other main point was that we need to get tools to support “l10n briefs” and annotations, and need to establish ways for that information to be exchanged. A localization brief might be something like a file-wide localization note that explains what the context for these strings is. Or that it’s about XSLT error messages, that you should leave in English unless you have a thriving local community in your language on that technology. Annotations are more diverse, and are both to communicate among localization teams and back to the original author. The idea is to create a system that allows localizers to communicate over a particular string or set of strings in an easier fashion than using hg blame to find the bug, and then having to read through all of the bug to find out how to reproduce a problem. We might want to have annotations as simple as “star a string”. If it’s helpful that a string is tricky, someone else can go in and offer help or a more constructive annotation beyond “I didn’t get it”. How to communicate that back and forth is another follow-up project from this session.

Adam Hyde ran a book sprint on open translation tools aside all sessions, with a real face-to-face book sprinting event that closes today. It’s going to be interesting to see what that comes down to. As I suck at writing (you can tell by reading this post), I didn’t participate in that one myself. There is a version on the net already on flossmanuals.net.

So much for the actual sessions. As always, floor communication was essential, too. I made contact with folks from the Tajik, Khmer, and Nepali localization efforts for Firefox, and there’s already traction on some. If you know someone willing to help with Nepali, please make them introduce themselves in m.d.l10n. I have met a ton of other interesting people, of course. I had some really great conversations with Dwayne on a bunch of different topics, ranging from technical bits in tools to mission statements. Generally, there was a lot of interest in Mozilla, and how we do things. Thanks to Aspiration for inviting me, and thanks to all the people at OTT for the warm welcome to this new community for us.

Last but not least, thanks to Mozilla. In environments like OTT it becomes really obvious how rare organisations like Mozilla are. We had a lot of discussion on how hard it is to do localization as an afterthought, and we just don’t. How valuable it is for the localization community to get acknowledged. Which happens throughout Mozilla, pretty independent on whether it’s John and Mitchell most anywhere they talk, or our developers fixing their patches to have a prettier localization note, or our marketing folks empowering our local communities to localize the message. And we’re still learning and eager to get better. It is an honor to represent such an organization.

Pictures in this post are by Lena under CC by-nc-nd.

June 12, 2009

150/2=73

Filed under: L10n,Mozilla — Tags: , — Axel Hecht @ 9:51 am

Our brave build folks have cut the tags on Firefox 3.5 RC 1, and I figured I give a little feedback on that from the l10n side.

As RC 1 was based on new strings, we required each localization to sign-off on the status of their localization to be ready for release. We’re still doing this by opening what we call a “opt-in thread”, a message sent to .l10n after the last l10n-impact landing to which localizers reply with a link to the revision of their localization that is good to go. Part of that communication is the message when code-freeze is planned to be, and the message that plans don’t always work out. So we’re keeping the the opt-in thread open actually up to the point where we really kick off the builds.

The output of that process are two files which control our release automation process, shipped-locales and l10n-changesets. For the curious, we’re tracking which locales ship on which platforms in the first, and it’s part of the code repo, and which locales ship which hg revision in the second, which is in the buildbot-configs repo.

The whole process lead to 150 different opt-in sourcestamps which came in by either public replies in the newsgroup, or as private mails in my inbox (or both). I pick those up, and click on some buttons on a version of the l10n dashboard running on my home machine, review the changes to previous sign-offs (yes, I do have a web interface that does comparison between two revisions in hg), and accept or reject the sign-off. If I reject, I follow up in .l10n with why I did that. That adds up to 159 posts in that thread, by 72 authors. Dependent on how imminent the release is, or seems to be, I “back up” my local data by attaching files to the tracking bug. This led to one version of shipped-locales, and a whopping 16 versions of l10n-changesets. Or, in short …

<bhearsum|afk> Pike: when should i expect an updated l10n-changestets?
<Pike> bhearsum: …
<Pike>
<Pike>
<Pike> now
<bhearsum> heh
<bhearsum> cool!

What’s really cool here is that we’re actually at a point where we pick up improvements to our localization up to the last minute, with tools that make us feel comfortable about that, and with a release environment that is able to digest all that noise and produce builds for 73 localizations in a matter of a few hours.

73

June 10, 2009

Running into builds, just testing

Filed under: L10n,Mozilla — Tags: , , , — Axel Hecht @ 10:29 am

I’ve blogged previously on how to set up a staging environment to test the l10n build system, but I didn’t go into any detail on how to actually do builds in that set up. That shall be fixed.

I’m picking you up at the point where

twistd get-pushes -n -t 1 -s l10n_site.settings

is running stable. You probably want to tail -f twistd.log to follow what it’s doing. This piece is going to watch the hg repositories in stage/repos and feed all the pushes to those into the db. Next is to make sure that you actually get builds.

The first thing you need to do is to configure the l10n-master to access the same database as the django-site. Just make sure that DATABASE_* settings in l10n-master/settings.py and django-site/l10n_site/settings.py match. The other setting to sync is REPOSITORY_BASE, which needs to match in both settings.pys. I suggest setting that to an empy dir next to the l10n-master. I wouldn’t use the stage/repos dir, mostly because I didn’t test that. Now you set up the master for buildbot, by running

buildbot create-master .

inside the l10n-master clone. Next thing is to create a slave dir, which is well put next to the l10n-master. Despite what buildbot usually likes, this slave should be on the same machine that the rest is running on.

mkdir slave
buildbot create-slave slave localhost:9021 cs pwd

So much for the general setup. With the twistd daemon running to get the pushes, you can now switch on the master and the slave, by doing a buildbot start . in both dirs. I suggest that you keep a tail -f twistd.log running on the master. If you decide to set things up to track the upstream repositories, I start the master, then the slave, and if both are up fine, I start the twistd daemon for get-pushes.

Now let’s trigger some builds:

Open stage/workdir/mozilla/browser/locales/en-US/file.properties in your favorite editor, and do a modification. I suggest to just do a whitespace edit, or to change the value of the existing entity, as that is going to keep your localizations green. Check the change in to the working clone, and then push. The get-pushes output should show that it got a new push, and then on the next cycle, feed the push into the database. You’ll notice by the output of a hg pull showing up in the log. On the next poll on the l10n-master, builds for all 4 locales should trigger. You should see an update of four builds on the waterfall, and 4 locales on the test tree on the local dashboard.

June 5, 2009

got l10n builds

Filed under: L10n,Mozilla — Tags: , , — Axel Hecht @ 3:22 am

’nuff said? Not even remotely.

We’ve had l10n builds as long as I’m working on l10n, actually, I got involved around the time when we started to do them upstream. They always were considerably better than each localizer doing their build at home on whatever (virus-infected) hardware they found, with help from other community members for the platforms they didn’t have. But at the light of day, it was more

That? Yeah, I know. That’s crap.

And I know you can hear my voice in your head right now :-).

Those days are gone. We’re running Firefox and Fennec builds on the releng infrastructure now for a few days that are actually sound builds made to service our l10n community. Some highlights:

  • Builds are finished some 10 minutes after a localizer landing, on all three platforms.
  • There’s no deadlock between different locales, thanks to all l10n builds running on a pool of slaves.
  • Builds are “l10n-merged”, against the actual build that’s repackaged. Independent of missing strings or files, you have a build that can be tested.
  • No more race conditions between nightly and trunk source status.

The impact of this shouldn’t be under-estimated. We are, for the first time in years, producing builds that allow a localizer to actually immediately test. Localizers can work incrementally, translate one feature, check-in, test. No worries if something landed in en-US in the meantime, or whatnot. With the new builds, I have seen various localizers coming from hundreds of missing strings to a tested build on two or three platforms in a matter of a few hours. Back in the days, that was the waiting time for the first build. The new locales all pull all-nighters to get their final bits in. They want to, and now they actually can.

I want to thank coop and armenzg for their great help in making this happen, aki for porting it over to fennec. Of course thanks go to joduinn and sethb, too, for bearing with the ongoing meetings we have, trying to battle the crap down. To dynamis for the initial work on l10n-merge. Also thanks to bsmedberg and Chase for the initial works on both automation and build process, and ted for the various reviews on making our build system catch up.

Finally, we’re not going to stop here. Armen is working on creating the necessary files to get l10n builds on a nightly update channel. Yep, you heard right, that’s where we are right now. I know that KaiRo is working on getting the goodness over to the comm-central apps. And yours truly is hacking on the dashboard together with gandalf, more on that in a different post.

June 2, 2009

Searching l10n

Filed under: L10n,Mozilla — Tags: , , — Axel Hecht @ 7:20 am

I’m contemplating adding search in l10n to the dashboard, and I figured I’d put my thoughts out for lazyweb super-review.

Things we might want to search for:

  • Localized strings
    • in a particular locale
    • in all locales
    • going into a particular app
  • entity names
    • in all of the above

As with the rest of the dashboard, I’d favour a pythonic solution. I’ve run across Whoosh, which seems to offer me what I’d need. In particular I can mark up searches in just keys or just values of our localized strings with the Schemas it offers.

All sounds pretty neat and contained, I’m just wondering if there’s something cool and shiny elsewhere that I’m missing, or if someone came back with “ugh, sucks” from trying Whoosh.

Ad-hoc design for the curious:

For each changeset, we’d parse the old and the new version of the file, getting a list of keys and values, and I’d create two searchable TEXT entries for all changed keys, and added entries. We’d tag that “document” with path, locale, apps, revision, branch. That way, you could search even for strings that aren’t currently in the tip, and get a versioned link to where it showed up first, and last, possibly. Given that we have a lot of data and history, I wouldn’t be surprised if that corpus would get large pretty quickly. I’d expect to not only index l10n but en-US, too. Thoughts?

« Newer PostsOlder Posts »

Powered by WordPress