CS Knowledge Base

Monday, April 17, 2023

ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices

By Shane Carr, Chair of the ICUX Subcommittee

Across the globe, people are coming online with smaller and more varied devices including smartphones, smart watches, and gadgets. An offshoot of the International Components for Unicode (ICU) Committee, the ICU4X Committee is responsible for enabling these next-generation devices to communicate with each other in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.

Since our first big release in September 2022, the ICU4X team has been busy building additional features and infrastructure. Today, the team is excited to announce ICU4X 1.2, featuring the first stable release of the Segmenter component, more Unicode properties, property names, a technology preview of language and script display names, HarfBuzz bindings, CLDR 43, full compliance with the Unicode Bidirectional Algorithm (UAX #9), and many smaller features and improvements to the ICU4X components.

Text segmentation is the process of dividing strings into meaningful units, such as words, sentences, or grapheme clusters (characters). It is a fundamental task in a wide range of applications, including cursor movement, highlighting spans of text, evaluating text for spelling and grammatical correctness, information retrieval, and text layout.

ICU4X 1.2 supports the two standards Unicode Text Segmentation (UAX #29) for word, sentence, and grapheme cluster segmentation and Unicode Line Breaking Algorithm (UAX #14) for line segmentation.

Given ICU4X's focus on being lightweight for deployment in resource-constrained environments, the team focused on ways to reduce data size versus ICU4C. The highest-impact differences come from the use of runtime tailoring (reducing the number of rule tables) and machine learning models (eliminating the need for Southeast Asian word dictionaries). Overall, ICU4X data for segmentation is 20.1% smaller than the equivalent data in ICU4C, and 60.7% smaller for line break segmentation.

In addition to being smaller in size, ICU4X's line and word segmenters are 19.1% and 52.2% faster in non-complex scripts and 46.9% and 32.1% faster in Chinese than the equivalents in ICU4C, respectively.

The machine learning models in ICU4X are used for word and line breaking in Southeast Asian languages including Thai, Lao, Khmer, and Myanmar. The models use an LSTM, are trained on large datasets, and achieve high accuracy while retaining small model size. By leveraging modern computer architecture features such as SIMD, the team optimized the performance of the LSTM inference to be about 3× faster than the naive implementation. However, the dictionary model remains the fastest, about two orders of magnitude faster than the LSTM. ICU4X offers both types of models for clients to choose.

Another focus of ICU4X 1.2 has been to support your text layout stack. A text layout engine requires more than the scope of either ICU4C and ICU4X, but any layout engine requires at least two ICU features: line break segmentation and the ability to correctly order bidirectional text. ICU4X 1.2 supports the segmentation and bidirectional text needs of Skia’s SkParagraph and HarfBuzz.

Finally, ICU4X 1.2 brings a number of smaller features to other components. The experimental Display Names component now supports language and script display names, in addition to region display names; the Properties component supports converting UCD property and value enum discriminants to their long and short names, and vice-versa; and all components have been upgraded to support CLDR 43.

Read the full ICU4X 1.2 release notes and then the ICU4X tutorial to start using ICU4X in your project.

To learn more about the latest release, be sure to attend our ICU4X Virtual Open House this Wednesday, April 19th at 9am PT.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Thursday, April 13, 2023

ICU 73 Released

Unicode® ICU 73 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 73 updates to CLDR 43 locale data with various additions and corrections.

ICU 73 improves Japanese and Korean short-text line breaking, reduces C++ memory use in date formatting, and promotes the Java person name formatter from tech preview to draft.

ICU 73 and CLDR 43 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15.1 which is planned for September.)

ICU 73 updates to the time zone data version 2023c (March 2023). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/73.

Wednesday, April 12, 2023

Unicode CLDR v43 released

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

It is important to review the Migration section for changes that might require action by implementations using CLDR directly or indirectly (eg, via ICU).

CLDR 43 is a limited-submission release, focusing on just a few areas:

Formatting Person Names
- Completing the data for formatting person names, allowing it to advance out of “tech preview”. For more information on the benefits of this feature, see Background.
Locales
- Adding substantially to the LikelySubtags data: This is used to find the likely writing system and country for a given language, used in normalizing locale identifiers and inheritance. The data has been contributed by SIL.
- Inheritance: Adding components to parentLocales, and documenting the different inheritance for rgScope data, which inherits primarily by region.
Other data updates
- In English, Türkiye is now the primary country name for the country code TR, and Turkey is available as an alternate. Other locales have been reviewed to see whether similar changes would be appropriate.
- Name for the new timezone Ciudad Juárez.
Structure
- Adding some structure and data needed for ICU4X & JavaScript, for calendar eras and parentLocales.
Collation & Searching
- Treat various quote marks as equivalent at a Primary strength, also including Geresh and Gershayim.

To find out more about these and other changes, see the CLDR v43 release page.

CS Knowledge Base

Monday, April 17, 2023

ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices

Thursday, April 13, 2023

ICU 73 Released

Wednesday, April 12, 2023

Unicode CLDR v43 released

Links of Interest

Blog Archive

Labels

Followers

CS Knowledge Base

Monday, April 17, 2023

ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices

Thursday, April 13, 2023

ICU 73 Released

Wednesday, April 12, 2023

Unicode CLDR v43 released

Links of Interest

Blog Archive

Labels

Followers

Subscribe to this blog