Mozilla L10n’s Google Summer of Code project

The annual Google Summer of Code (GSoC) event attracts some of the brightest minds from around the world to work on projects they are passionate about with the help of a mentor and Google’s support. This year, Mozillian localizer Gautam Akiwate, had an idea for a project to help standardize a team’s localization work by leveraging its already localized content in a very unique and open source way: using machine translation (MT). MT has been a controversial topic between hackers and translators, as many have seen it as an alternative to human translation. Gautam, however, sought to use MT as a supplement and jump start for new L10n teams.

I had the chance to talk with Gautam about his ambitious project. Here is our conversation.

Gautam Akiwate

Started with Mozilla project: Started contributing for Mozilla in March. Not very long ago 🙂

Nationality: India (Pune, Maharashtra)

Languages: Marathi, Hindi, English

Background: Studied Information Technology from the College of Engineering, Pune (Equivalent to a Computer Science & Engineering elsewhere)

Role in L10n community: A new localizer 🙂

What inspired you to start this project and submit it to GSoC?

I met Arky in India and I talked with him about the need to help localize into Indic languages and getting involved with the L10n program. I began localizing Firefox into Marathi. I soon discovered linguistic ambiguity issues with terminology and it drove me mad! The Marathi team’s reviewers had to make a lot of corrections to my work because of terminology & orthographical ambiguity as well as a lack of standardization within the L10n team. Even those orthographic standards that existed weren’t part of a central resource for newcomers to refer to. I had to constantly refer back to previous localizations to find those standards, which doubled the time it took for me to localize strings accurately. As you can imagine, I became very interested in leveraging this previous work.

So I went to the #l10n channel and discussed this problem with Pike. I told him that I wanted to do something that would help me develop a termbase for L10n standardization. At the time, I did not initially consider MT. Pike referred me to Phillipe (French Mozilla) to leverage the content from the Transvision alignments. That idea transformed into using MT as an alignment tool and then utilizing the alignments to create a standard termbase. In other words, I would take a target and source content, align them, create a termbase from alignment, and it would provide terminology and orthography suggestions. I quickly realized that we could also use this tool to rate the quality of previous localizations, as well as localizations in other projects. If successful, we could begin bringing MT into the localization workflow, opening up a MT post-editing workflow within the Mozilla L10n project.

Around this time, GSoC came up and I saw the opportunity to do this under GSoC with Mozilla.

What is the current state of the project?

The project is not yet complete, but hopefully will be by the end of August.

What experience do you have with machine translation?

I’m a CS student. My only prior experience with MT was while in college. Since I had very limited experience with MT, there was a large learning curve. But, I felt so passionate about using MT in the L10n workflow that I didn’t have any issues with learning it. Since the project started, I’ve become very familiar with MT.

What resources did you use to learn about machine translation?

I found Google scholar articles on MT, I evaluated existing Mozilla l10n tools, and I read the Moses MT wiki. I also looked into data on which tools are most widely used.

I didn’t, however, spend much time looking into proprietary tools because of licensing issues. It can be really complicated to get tangled up in. Besides, I’m not a fan of proprietary tools anyway :).

How did you hear about the Moses MT project and why did you choose it over other MT systems?

I looked at the range of MT tools available and saw that Moses MT was open source and what I deemed to be the best out there. I also looked into incorporating termbases into MozillaTranslator, and it had a lot of potential. [Another thing that led me to Moses is that] there’s a lot of community support (it’s still a very active project) and supports more languages than most of the other open source MT projects.

Since your aim was essentially to create a large termbase, why did you decide to use MT to index terminology within the Transvision TMX instead of using a term extractor script and creating a TBX (i.e., what were the advantages of using MT over standard TBX)?

At first I wanted to create scripts that worked like term extractors then create TBXs (TermBase eXchange files), but MT has a wider scope and framework for future use than scripted term extractors. With MT, you can reduce your workload through its auto-translate function. Besides, I wasn’t very happy with the term extractor scripts’ results. I realized that while running them, I was lacking volume of translated strings. I wanted to give strings to the MT to learn the translation instead. Not only that, but the existing term extractors don’t work well with Indic languages. MT has a wider character support scope and is simply more scalable.

MT, however, has not matured enough to not require manual intervention. Plus, the community needs to have a say in it. I wanted to provide an option and let them either choose to fully implement MT or use a TBX file created from the MT utility.

What were some of your biggest challenges in this project?

Moses MT is still not very mature and I had installation issues with it. Some people have worked on creating installation packages for Linux (Debian, Fedora, and Ubuntu). Ubuntu seems to be the best supported platform for Moses and natural language processing tools.

MT also requires a lot of existing data. It was difficult to find enough data and convert it into a format that Moses could understand. Unfortunately, the amount of suitable Mozilla L10n data for this type of conversion is very limited. Moses needs this data in what’s know as a parallel corpus. I found that .lang files were the easiest to convert into a parallel corpus.

Special characters were also an issue when converting files into parallel corpuses. The challenge is basically to extract the data from .lang, .properties, and .dtd files, and only the .land files could do it well enough. I even ran into problems with .tmx files that contained special characters (e.g., brackets, etc.). I tried creating other data without going to the trouble of converting those files, but it would have required a lot of time. Most of the data that Mozilla has created are not suitable for MT.

Finally, I found that I cannot speak enough languages to really know if the resulting termbase data was good enough!

If you would like to learn more about Gautam’s GSoC project, visit the project’s wiki page.

Nidhi Joshi wrote on April 26, 2014 at 2:37 am:

Leave a ReplyCancel reply

Leave a Reply
Cancel reply