{"id":1653,"date":"2024-02-07T10:55:23","date_gmt":"2024-02-07T10:55:23","guid":{"rendered":"https:\/\/blog.mozilla.org\/l10n\/?p=1653"},"modified":"2024-02-23T15:33:04","modified_gmt":"2024-02-23T15:33:04","slug":"a-deep-dive-into-the-evolution-of-pretranslation-in-pontoon","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/l10n\/2024\/02\/07\/a-deep-dive-into-the-evolution-of-pretranslation-in-pontoon\/","title":{"rendered":"A Deep Dive Into the Evolution of Pretranslation in Pontoon"},"content":{"rendered":"<p>Quite often, an imperfect translation is better than no translation. So why even publish untranslated content when high-quality machine translation systems are fast and affordable? Why not immediately machine-translate content and progressively ship enhancements as they are submitted by human translators?<\/p>\n<p>At Mozilla, we call this process <i>pretranslation<\/i>. We began implementing it in Pontoon before COVID-19 hit, thanks to <a href=\"https:\/\/www.linkedin.com\/in\/vishalol\/\">Vishal<\/a> who landed the first patches. Then we caught some headwinds and didn\u2019t make much progress until 2022 after receiving a significant development boost and finally launched it for the general audience in September 2023.<\/p>\n<p>So far, 20 of our localization teams (locales) have opted to use pretranslation across 15 different localization projects. Over 20,000 pretranslations have been submitted and none of the teams have opted out of using it. These efforts have resulted in a higher translation completion rate, which was one of our main goals.<\/p>\n<p>In this article, we\u2019ll take a look at how we developed pretranslation in Pontoon. Let\u2019s start by exploring how it actually works.<\/p>\n<h2>How does pretranslation work?<\/h2>\n<p>Pretranslation is enabled upon a team\u2019s request (it\u2019s off by default). When a new string is added to a project, it gets automatically pretranslated using a 100% match from translation memory (TM), which also includes translations of glossary entries. If a perfect match doesn\u2019t exist, a locale-specific machine translation (MT) engine is used, trained on the locale\u2019s translation memory.<\/p>\n<div id=\"attachment_1656\" style=\"width: 2052px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Screenshot-2024-01-31-at-20.35.08.png\"><img aria-describedby=\"caption-attachment-1656\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-1656 size-full\" src=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Screenshot-2024-01-31-at-20.35.08.png\" alt=\"Pretranslation opt-in form\" width=\"2042\" height=\"2266\" srcset=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Screenshot-2024-01-31-at-20.35.08.png 2042w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Screenshot-2024-01-31-at-20.35.08-252x280.png 252w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Screenshot-2024-01-31-at-20.35.08-600x666.png 600w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Screenshot-2024-01-31-at-20.35.08-768x852.png 768w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Screenshot-2024-01-31-at-20.35.08-1384x1536.png 1384w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Screenshot-2024-01-31-at-20.35.08-1846x2048.png 1846w\" sizes=\"(max-width: 2042px) 100vw, 2042px\" \/><\/a><p id=\"caption-attachment-1656\" class=\"wp-caption-text\"><i>Pretranslation <\/i><a href=\"https:\/\/pontoon.mozilla.org\/sl\/\"><i>opt-in form<\/i><\/a><i>.<\/i><\/p><\/div>\n<p>After pretranslations are retrieved and saved in Pontoon, they get synced to our primary localization storage (usually a GitHub repository) and hence immediately made available for shipping. Unless they fail our quality checks. In that case, they don\u2019t propagate to repositories until errors or warnings are fixed during the review process.<\/p>\n<p>Until reviewed, pretranslations are visually distinguishable from user-submitted suggestions and translations. This makes post-editing much easier and more efficient. Another key factor that influences pretranslation review time is, of course, the quality of pretranslations. So let\u2019s see how we picked our machine translation provider.<\/p>\n<h2>Choosing a machine translation engine<\/h2>\n<p>We selected the machine translation provider based on two primary factors: quality of translations and the number of supported locales. To make translations match the required terminology and style as much as possible, we were also looking for the ability to fine-tune the MT engine by training it on our translation data.<\/p>\n<p>In March 2022, we compared Bergamot, Google\u2019s Cloud Translation API (generic), and Google\u2019s AutoML Translation (with custom models). Using these services we translated a collection of 1,000 strings into 5 locales (it, de, es-ES, ru, pt-BR), and used automated scores (<a href=\"https:\/\/en.wikipedia.org\/wiki\/BLEU\">BLEU<\/a>, <a href=\"https:\/\/github.com\/m-popovic\/chrF\">chrF++<\/a>) as well as manual evaluation to compare them with the actual translations.<\/p>\n<div id=\"attachment_1658\" style=\"width: 1788px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/performance.png\"><img aria-describedby=\"caption-attachment-1658\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-1658 size-full\" src=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/performance.png\" alt=\"Performance of tested MT engines for Italian (it).\" width=\"1778\" height=\"1056\" srcset=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/performance.png 1778w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/performance-252x150.png 252w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/performance-600x356.png 600w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/performance-768x456.png 768w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/performance-1536x912.png 1536w\" sizes=\"(max-width: 1778px) 100vw, 1778px\" \/><\/a><p id=\"caption-attachment-1658\" class=\"wp-caption-text\"><i>Performance of tested MT engines for Italian (it).<\/i><\/p><\/div>\n<p>Google\u2019s AutoML Translation outperformed the other two candidates in virtually all tested scenarios and metrics, so it became the clear choice. It supports over 60 locales. Google\u2019s Generic Translation API supports twice as many, but we currently don\u2019t plan to use it for pretranslation in locales not supported by Google\u2019s AutoML Translation.<\/p>\n<h2>Making machine translation actually work<\/h2>\n<p>Currently, around 50% of pretranslations generated by Google\u2019s AutoML Translation get approved without any changes. For some locales, the rate is around 70%. Keep in mind however that machine translation is only used when a perfect translation memory match isn\u2019t available. For pretranslations coming from translation memory, the approval rate is 90%.<\/p>\n<div id=\"attachment_1657\" style=\"width: 1970px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Posnetek-zaslona-2024-01-31-ob-20.29.29.png\"><img aria-describedby=\"caption-attachment-1657\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-1657 size-full\" src=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Posnetek-zaslona-2024-01-31-ob-20.29.29.png\" alt=\"Comparison of pretranslation approval rate between teams.\" width=\"1960\" height=\"910\" srcset=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Posnetek-zaslona-2024-01-31-ob-20.29.29.png 1960w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Posnetek-zaslona-2024-01-31-ob-20.29.29-252x117.png 252w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Posnetek-zaslona-2024-01-31-ob-20.29.29-600x279.png 600w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Posnetek-zaslona-2024-01-31-ob-20.29.29-768x357.png 768w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/Posnetek-zaslona-2024-01-31-ob-20.29.29-1536x713.png 1536w\" sizes=\"(max-width: 1960px) 100vw, 1960px\" \/><\/a><p id=\"caption-attachment-1657\" class=\"wp-caption-text\"><i>Comparison of <\/i><a href=\"https:\/\/pontoon.mozilla.org\/insights\/\"><i>pretranslation approval rate<\/i><\/a><i> between teams.<\/i><\/p><\/div>\n<p>To reach that approval rate, we had to make a series of adjustments to the way we use machine translation.<\/p>\n<p>For example, we convert multiline messages to single-line messages before machine-translating them. Otherwise, each line is treated as a separate message and the resulting translation is of poor quality.<\/p>\n<p><i>Multiline message:<\/i><br \/>\n<code><\/code><\/p>\n<pre style=\"background: #dfdfdf; padding: 15px; border-radius: 10px;\">Make this password unique and different from any others you use.\r\nA good strategy to follow is to combine two or more unrelated\r\nwords to create an entire pass phrase, and include numbers and symbols.<\/pre>\n<p><i>Multiline message converted to a single-line message:<\/i><br \/>\n<code><\/code><\/p>\n<p style=\"background: #dfdfdf; padding: 15px; border-radius: 10px;\">Make this password unique and different from any others you use. A good strategy to follow is to combine two or more unrelated words to create an entire pass phrase, and include numbers and symbols.<\/p>\n<p>Let\u2019s take a closer look at two of the more time-consuming changes.<\/p>\n<p>The first one is specific to our machine translation provider (Google\u2019s AutoML Translation). During initial testing, we noticed it would often take a long time for the MT engine to return results, up to a minute. Sometimes it even timed out! Such a long response time not only slows down pretranslation, it also makes machine translation suggestions in the translation editor less useful &#8211; by the time they appear, the localizer has already moved to translate the next string.<\/p>\n<p>After further testing, we began to suspect that our custom engine shuts down after a period of inactivity, thus requiring a cold start for the next request. We contacted support and our assumption was confirmed. To overcome the problem, we were advised to send a dummy query to the service every 60 seconds just to keep the system alive.<\/p>\n<div id=\"attachment_1655\" style=\"width: 280px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/giphy.gif\"><img aria-describedby=\"caption-attachment-1655\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-1655 size-full\" src=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/giphy.gif\" alt=\"Giphy: Oh No Wow GIF by Little Princess Ember\" width=\"270\" height=\"480\" \/><\/a><p id=\"caption-attachment-1655\" class=\"wp-caption-text\">Image source: <a href=\"https:\/\/giphy.com\/gifs\/xQyL1JfcCjGu3hUm1A\">Giphy<\/a>.<\/p><\/div>\n<p>Of course, it\u2019s reasonable to shut down inactive services to free up resources, but the way to keep them alive isn\u2019t. We have to make (paid) requests to each locale\u2019s machine translation engines every minute just to make sure they work when we need them. And sometimes even that doesn\u2019t help &#8211; we still see about a dozen <i>ServiceUnavailable<\/i> errors every day. It would be so much easier if we could just customize the default inactivity period or pay extra for an always-on service.<\/p>\n<p>The other issue we had to address is quite common in machine translation systems: they are not particularly good at <a href=\"https:\/\/issuetracker.google.com\/issues\/119256504?pli=1\">preserving placeholders<\/a>. In particular, extra space often gets added to variables or markup elements, resulting in broken translations.<\/p>\n<p><i>Message with variables:<\/i><\/p>\n<pre style=\"background: #dfdfdf; padding: 15px; border-radius: 10px;\">{ $partialSize } of { $totalSize }<\/pre>\n<p><i>Message with variables machine-translated to Slovenian (adding space after $ breaks the variable):<\/i><\/p>\n<pre style=\"background: #dfdfdf; padding: 15px; border-radius: 10px;\">{$ partialSize} od {$ totalSize}<\/pre>\n<p>We tried to mitigate this issue by wrapping placeholders in &lt;span translate=&#8221;no&#8221;&gt;&#8230;&lt;\/span&gt;, which tells Google\u2019s AutoML Translation to <a href=\"https:\/\/cloud.google.com\/translate\/troubleshooting\">not translate the wrapped text<\/a>. This approach requires the source text to be submitted as HTML (rather than plain text), which triggers a whole new set of issues \u2014 from adding spaces in other places to escaping quotes \u2014 and we couldn\u2019t circumvent those either. So this was a dead-end.<\/p>\n<p>The solution was to store every placeholder in the <a href=\"https:\/\/cloud.google.com\/translate\/docs\/advanced\/glossary\">Glossary<\/a> with the same value for both source string and translation. That approach worked much better and we still use it today. It\u2019s not perfect, though, so we only use it to pretranslate strings for which the default (non-glossary) machine translation output fails our placeholder quality checks.<\/p>\n<h2>Making pretranslation work with Fluent messages<\/h2>\n<p>On top of the machine translation service improvements we also had to account for the complexity of Fluent messages, which are used by most of the projects we localize at Mozilla. <a href=\"https:\/\/projectfluent.org\/\">Fluent<\/a> is capable of expressing virtually any imaginable message, which means it is the localization system you want to use if you want your software translations to sound natural.<\/p>\n<p>As a consequence, Fluent message format comes with a syntax that allows for expressing such complex messages. And since machine translation systems (as seen above) already have trouble with simple variables and markup elements, their struggles multiply with messages like this:<br \/>\n<code><\/code><\/p>\n<pre style=\"background: #dfdfdf; padding: 15px; border-radius: 10px;\">shared-photos =\r\n { $photoCount -&gt;\r\n    [one]\r\n      { $userGender -&gt;\r\n        [male] { $userName } added a new photo to his stream.\r\n        [female] { $userName } added a new photo to her stream.\r\n       *[other] { $userName } added a new photo to their stream.\r\n      }\r\n   *[other]\r\n      { $userGender -&gt;\r\n        [male] { $userName } added { $photoCount } new photos to his stream.\r\n        [female] { $userName } added { $photoCount } new photos to her stream.\r\n       *[other] { $userName } added { $photoCount } new photos to their stream.\r\n      }\r\n  }<\/pre>\n<p>That means Fluent messages need to be pre-processed before they are sent to the pretranslation systems. Only relevant parts of the message need to be pretranslated, while syntax elements need to remain untouched. In the example above, we extract the following message parts, pretranslate them, and replace them with pretranslations in the original message:<\/p>\n<ul>\n<li aria-level=\"1\"><i>{ $userName } added a new photo to his stream.<\/i><\/li>\n<li aria-level=\"1\"><i>{ $userName } added a new photo to her stream.<\/i><\/li>\n<li aria-level=\"1\"><i>{ $userName } added a new photo to their stream.<\/i><\/li>\n<li aria-level=\"1\"><i>{ $userName } added { $photoCount } new photos to his stream.<\/i><\/li>\n<li aria-level=\"1\"><i>{ $userName } added { $photoCount } new photos to her stream.<\/i><\/li>\n<li aria-level=\"1\"><i>{ $userName } added { $photoCount } new photos to their stream.<\/i><\/li>\n<\/ul>\n<p>To be more accurate, this is what happens for languages like German, which uses the same <a href=\"https:\/\/cldr.unicode.org\/index\/cldr-spec\/plural-rules\">CLDR plural forms<\/a> as English. For locales without plurals, like Chinese, we drop plural forms completely and only pretranslate the remaining three parts. If the target language is Slovenian, two additional plural forms need to be added (two, few), which in this example results in a total of 12 messages needing pretranslation (four plural forms, with three gender forms each).<\/p>\n<p>Finally, Pontoon translation editor uses <a href=\"https:\/\/pontoon.mozilla.org\/sl\/firefox\/all-resources\/?string=192297\">custom UI for translating access keys<\/a>. That means it\u2019s capable of detecting which part of the message is an access key and which is a label the access key belongs to. The access key should ideally be one of the characters included in the label, so the editor generates a list of candidates that translators can choose from. In pretranslation, the first candidate is directly used as an access key, so no TM or MT is involved.<\/p>\n<div id=\"attachment_1654\" style=\"width: 526px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/inter-keyboard-image3.png\"><img aria-describedby=\"caption-attachment-1654\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-1654 size-full\" src=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/inter-keyboard-image3.png\" alt=\"A screenshot of Notepad showing access keys in the menu.\" width=\"516\" height=\"319\" srcset=\"https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/inter-keyboard-image3.png 516w, https:\/\/blog.mozilla.org\/l10n\/files\/2024\/02\/inter-keyboard-image3-252x156.png 252w\" sizes=\"(max-width: 516px) 100vw, 516px\" \/><\/a><p id=\"caption-attachment-1654\" class=\"wp-caption-text\"><i>Access keys (not to be confused with shortcut keys) are used for accessibility to interact with all controls or menu items using the keyboard. Windows indicates access keys by underlining the access key assignment when the Alt key is pressed. Source: <\/i><a href=\"https:\/\/learn.microsoft.com\/en-us\/windows\/win32\/uxguide\/inter-keyboard\"><i>Microsoft Learn<\/i><\/a><i>.<\/i><\/p><\/div>\n<h2>Looking ahead<\/h2>\n<p>With every enhancement we shipped, the case for publishing untranslated text instead of pretranslations became weaker and weaker. And there\u2019s still room for improvements in our pretranslation system.<\/p>\n<p><a href=\"https:\/\/www.linkedin.com\/in\/ayanaa-rahman\/\">Ayanaa<\/a> has done extensive research on the impact of Large Language Models (LLMs) on translation efficiency. She\u2019s now working on integrating LLM-assisted translations into Pontoon\u2019s Machinery panel, from which localizers will be able to request alternative translations, including formal and informal options.<\/p>\n<p>If the target locale could set the tone to formal or informal on the project level, we could benefit from this capability in pretranslation as well. We might also improve the quality of machine translation suggestions by providing existing translations into other locales as references in addition to the source string.<\/p>\n<p>If you are interested in using pretranslation or already use it, we\u2019d love to hear your thoughts! Please leave a comment, reach out to us on <a href=\"https:\/\/chat.mozilla.org\/#\/room\/#pontoon:mozilla.org\">Matrix<\/a>, or <a href=\"https:\/\/github.com\/mozilla\/pontoon\/issues\">file an issue<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Quite often, an imperfect translation is better than no translation. So why even publish untranslated content when high-quality machine translation systems are fast and affordable? Why not immediately machine-translate content &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/l10n\/2024\/02\/07\/a-deep-dive-into-the-evolution-of-pretranslation-in-pontoon\/\">Read more<\/a><\/p>\n","protected":false},"author":451,"featured_media":1658,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12691,137,286406,610],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/posts\/1653"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/users\/451"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/comments?post=1653"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/posts\/1653\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/media\/1658"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/media?parent=1653"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/categories?post=1653"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/l10n\/wp-json\/wp\/v2\/tags?post=1653"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}