Axel Hecht Mozilla in Your Language

June 2, 2009

Searching l10n

Filed under: L10n,Mozilla — Tags: , , — Axel Hecht @ 7:20 am

I’m contemplating adding search in l10n to the dashboard, and I figured I’d put my thoughts out for lazyweb super-review.

Things we might want to search for:

  • Localized strings
    • in a particular locale
    • in all locales
    • going into a particular app
  • entity names
    • in all of the above

As with the rest of the dashboard, I’d favour a pythonic solution. I’ve run across Whoosh, which seems to offer me what I’d need. In particular I can mark up searches in just keys or just values of our localized strings with the Schemas it offers.

All sounds pretty neat and contained, I’m just wondering if there’s something cool and shiny elsewhere that I’m missing, or if someone came back with “ugh, sucks” from trying Whoosh.

Ad-hoc design for the curious:

For each changeset, we’d parse the old and the new version of the file, getting a list of keys and values, and I’d create two searchable TEXT entries for all changed keys, and added entries. We’d tag that “document” with path, locale, apps, revision, branch. That way, you could search even for strings that aren’t currently in the tip, and get a versioned link to where it showed up first, and last, possibly. Given that we have a lot of data and history, I wouldn’t be surprised if that corpus would get large pretty quickly. I’d expect to not only index l10n but en-US, too. Thoughts?

3 Comments

  1. So, it’s basically an MXR, but across all revisions? If so, then yes, please! :)

    Two suggestions:
    – make it easy to search only in the tip
    – make it aware of the opt-in revisions for milestones. It would be awesome to be able to query for translations of a given entity in a given milestone for every locale (especially for things in region.properties and searchplugins, you guessed it)

    Comment by Staś Małolepszy — June 4, 2009 @ 2:14 am

  2. Not sure how easy it would be to search in a particular revision.

    I’m kinda torn on whether each entity needs to be in its own document. Your use cases make it sound more likely that it does. We’ll probably just need to try and see if that’s feasible in terms of index size and performance.

    Comment by Axel Hecht — June 4, 2009 @ 2:42 am

  3. I wonder if it would make sense to optimize this only for opt-in revisions. I have to admit that I’m not sure how this would look like, but I have a feeling that searching in or between milestones would be more common than on or between specific dates or revision ids. And by taking only opt-in revisions we would radically limit the size of data to store.

    Comment by Staś Małolepszy — June 4, 2009 @ 6:49 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress