intl_pluralrules is a Rust crate, built to handle pluralization. Pluralization is the foundation for all localization and many internationalization APIs. With the addition of intl_pluralrules, any locale-aware date-, time- or unit-formatting (“1 second” vs “2 seconds”) and many other pluralization-dependent APIs can be added to Rust.
Rust joins the family of mature languages such as C++ and Java (via ICU), JavaScript (via ECMA 402), and Python (via Babel) which make writing multilingual software possible. All of those APIs use the same unified database for storing international plural rules, called Unicode CLDR.
intl_pluralrules determines the CLDR plural category for numeric input by leveraging Unicode Language Plural Rules and plural rules from Unicode CLDR. That category can be used to identify the correct string variant that should be used in localization.
The crate is available on crates.io, and can be used as a library in any Rust program. For example, the Rust implementation of the Fluent Project is the first system using intl_pluralrules for handling pluralization.
Why care about plurals?
In short, numbers can change the way words appear in a string. In a simple example, the English words “page” and “pages” are different because of pluralization. English plural forms are fairly simple; in other languages, the rules can be more complex. If software is built with one plural paradigm in mind, localizing for several languages (each with its own unique paradigm) becomes a complicated process–and unnecessarily so.
Flod’s blog post on the advantages of Fluent describes the problems and potential benefits intl_pluralrules crate addresses in his section, Plural Forms. If you would like to know more about why crates like intl_pluralrules are vital to i18n and l10n in Rust, start with Flod’s post.
How intl_pluralrules does plurals
The intl_pluralrules crate accepts numeric input and produces the appropriate plural category for that input in a given locale. In completing this process, intl_pluralrules performs several steps outlined here:
Making plural operands from numbers
Unicode’s Language Plural Rules provides a defining set of characteristics that affect the plural category the number belongs to in certain languages. These characteristics are called operands. The set comprises an absolute value, integer value, fraction value with and without trailing zeros, and the number of fraction digits with and without trailing zeros.
The intl_pluralrules crate creates a set of operands from the input and uses those operands when determining the plural category.
Notice the difference between the v value (number of visible fraction digits) for 1 and 1.0. Although 1 and 1.0 represent the same literal value (both represent a singular value), the presence of a decimal can change the plural category in some languages.
For example, although in English it is unlikely to see whole count nouns measured in float style (with a trailing zero decimal), the correct plural form for 1.0 is “other”, not “one.” This means that the pages example from the previous section would read, “You have 1.0 open pages” rather than “You have 1.0 open page.” This distinction may seem strange because, as mentioned, this is an unusual use for 1.0 in English. Nonetheless, you will find that it is the proper plural form.
Because Rust’s float types do not preserve trailing zeros when stringified, the from method uses Rust’s ToString trait on its input when generating plural operands. This allows a user of the crate to send string or numeric input to the system, and, so long as it is a valid float or integer value, it will be accepted.
CLDR resource parser and code generator
intl_pluralrules depends on two associated crates that reside in the same GitHub repository and are also available on crates.io: cldr_pluralrules_parser and make_pluralrules.
cldr_pluralrules_parser parses the plural rules from the JSON CLDR repository and builds an AST representation of the rules. The following code snippet shows the English and Russian plural rules from CLDR.
make_pluralrules generates a Rust file from that AST. The following code snippet shows the generated Rust plural logic for English.
The Rust file generated by these crates is used in intl_pluralrules to determine the plural form of a number.
intl_pluralrules crate
Using intl_pluralrules is a two-step process.
- The user must create an IntlPluralRule instance by providing a BCP 47 language tag* and a plural type (cardinal or ordinal) to the create method. This will use the generated Rust file to create an IntlPluralRule object.
- A number needs to be passed to the select method on the IntlPluralRule instance created in step 1.
The value returned from step 2 is the plural category.
*You can use the get_locales method to see what languages are available in the crate
Performance and implications
First, intl_pluralrules has landed in fluent-rs, meaning that the Rust implementation uses the crate for handling all plural-concerned instances. Because intl_pluralrules leverages the available data from Unicode, Fluent’s selection process for plural-concerned strings in any FTL file is completely automated. So long as the provided CLDR file has rules for your locale, the developer will not need to hard code plural logic into the software and localizers won’t need to report a bug in order for the correct plural string to be activated.
Second, intl_pluralrules is fast. The crate is still in prerelease because, although fully functional, some optimization features are still being discussed. In spite of intl_pluralrules’ WIP status regarding optimization, the system is still incredibly performant. Compared to ICU’s C PluralRules, intl_pluralrules is approximately 20 times faster in a simple benchmark test.
Intl_pluralrules’ comparative speed is due to the decision to store plural rules as compiled Rust code, rather than as CLDR syntax to be parsed at runtime. Using cldr_pluralrules_parser and make_pluralrules to generate the Rust version of the CLDR rules, the plural rules are compiled into the crate. This makes the crate slightly larger but also quicker because CLDR rules are not parsed at run time (as they are in ICU), which is the main source of the speed disparity. As intl_pluralrules moves towards 1.0, it is expected that performance will only increase.
In the bigger picture, the release of intl_pluralrules means that the Rust ecosystem gains a higher-level internationalization and localization API, hopefully the first of several. Conversely, the internationalization and localization ecosystem gains use of this API, which leverages the performance benefits of the Rust Language.
Relevant Links:
- intl_pluralrules on Crates.io
- GitHub Repo for all three crates
- Unicode CLDR Plural Rules
- Project Fluent
intl_pluralrules Developers:
No comments yet
Post a comment