skip to main |
skip to sidebar
Unicode® ICU 73 has just been released. ICU is the
premier library for software
internationalization, used by a
wide array of companies and organizations to support the world's languages,
implementing both the latest version of the Unicode Standard and of the Unicode
locale data (CLDR). ICU 73 updates to
CLDR 43 locale data with various additions and corrections.
ICU 73 improves Japanese and Korean short-text line breaking, reduces C++ memory use in date formatting, and promotes the Java person name formatter from tech preview to draft.
ICU 73 and CLDR 43 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15.1 which is planned for September.)
ICU 73 updates to the time zone data version 2023c (March 2023). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.
For details, please see https://icu.unicode.org/download/73.
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.
CLDR provides key building blocks for software to support the
world's languages (dates, times, numbers, sort-order, etc.). For example, all
major browsers and all modern mobile phones use CLDR for language support. (See
Who uses CLDR?)
Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.
It is important to review the Migration section for changes that might require action by implementations using CLDR directly or indirectly (eg, via ICU).
CLDR 43 is a limited-submission release, focusing on just a few areas:
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.
Posts
Posts
Monday, April 17, 2023
ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices
By Shane Carr, Chair of the ICUX Subcommittee
Across the globe, people are coming online with smaller and more varied devices including smartphones, smart watches, and gadgets. An offshoot of the International Components for Unicode (ICU) Committee, the ICU4X Committee is responsible for enabling these next-generation devices to communicate with each other in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.
Since our first big release in September 2022, the ICU4X team has been busy building additional features and infrastructure. Today, the team is excited to announce ICU4X 1.2, featuring the first stable release of the Segmenter component, more Unicode properties, property names, a technology preview of language and script display names, HarfBuzz bindings, CLDR 43, full compliance with the Unicode Bidirectional Algorithm (UAX #9), and many smaller features and improvements to the ICU4X components.
Text segmentation is the process of dividing strings into meaningful units, such as words, sentences, or grapheme clusters (characters). It is a fundamental task in a wide range of applications, including cursor movement, highlighting spans of text, evaluating text for spelling and grammatical correctness, information retrieval, and text layout.
ICU4X 1.2 supports the two standards Unicode Text Segmentation (UAX #29) for word, sentence, and grapheme cluster segmentation and Unicode Line Breaking Algorithm (UAX #14) for line segmentation.
Given ICU4X's focus on being lightweight for deployment in resource-constrained environments, the team focused on ways to reduce data size versus ICU4C. The highest-impact differences come from the use of runtime tailoring (reducing the number of rule tables) and machine learning models (eliminating the need for Southeast Asian word dictionaries). Overall, ICU4X data for segmentation is 20.1% smaller than the equivalent data in ICU4C, and 60.7% smaller for line break segmentation.
In addition to being smaller in size, ICU4X's line and word segmenters are 19.1% and 52.2% faster in non-complex scripts and 46.9% and 32.1% faster in Chinese than the equivalents in ICU4C, respectively.
The machine learning models in ICU4X are used for word and line breaking in Southeast Asian languages including Thai, Lao, Khmer, and Myanmar. The models use an LSTM, are trained on large datasets, and achieve high accuracy while retaining small model size. By leveraging modern computer architecture features such as SIMD, the team optimized the performance of the LSTM inference to be about 3× faster than the naive implementation. However, the dictionary model remains the fastest, about two orders of magnitude faster than the LSTM. ICU4X offers both types of models for clients to choose.
Another focus of ICU4X 1.2 has been to support your text layout stack. A text layout engine requires more than the scope of either ICU4C and ICU4X, but any layout engine requires at least two ICU features: line break segmentation and the ability to correctly order bidirectional text. ICU4X 1.2 supports the segmentation and bidirectional text needs of Skia’s SkParagraph and HarfBuzz.
Finally, ICU4X 1.2 brings a number of smaller features to other components. The experimental Display Names component now supports language and script display names, in addition to region display names; the Properties component supports converting UCD property and value enum discriminants to their long and short names, and vice-versa; and all components have been upgraded to support CLDR 43.
Read the full ICU4X 1.2 release notes and then the ICU4X tutorial to start using ICU4X in your project.
To learn more about the latest release, be sure to attend our ICU4X Virtual Open House this Wednesday, April 19th at 9am PT.
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.
Across the globe, people are coming online with smaller and more varied devices including smartphones, smart watches, and gadgets. An offshoot of the International Components for Unicode (ICU) Committee, the ICU4X Committee is responsible for enabling these next-generation devices to communicate with each other in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.
Since our first big release in September 2022, the ICU4X team has been busy building additional features and infrastructure. Today, the team is excited to announce ICU4X 1.2, featuring the first stable release of the Segmenter component, more Unicode properties, property names, a technology preview of language and script display names, HarfBuzz bindings, CLDR 43, full compliance with the Unicode Bidirectional Algorithm (UAX #9), and many smaller features and improvements to the ICU4X components.
Text segmentation is the process of dividing strings into meaningful units, such as words, sentences, or grapheme clusters (characters). It is a fundamental task in a wide range of applications, including cursor movement, highlighting spans of text, evaluating text for spelling and grammatical correctness, information retrieval, and text layout.
ICU4X 1.2 supports the two standards Unicode Text Segmentation (UAX #29) for word, sentence, and grapheme cluster segmentation and Unicode Line Breaking Algorithm (UAX #14) for line segmentation.
Given ICU4X's focus on being lightweight for deployment in resource-constrained environments, the team focused on ways to reduce data size versus ICU4C. The highest-impact differences come from the use of runtime tailoring (reducing the number of rule tables) and machine learning models (eliminating the need for Southeast Asian word dictionaries). Overall, ICU4X data for segmentation is 20.1% smaller than the equivalent data in ICU4C, and 60.7% smaller for line break segmentation.
In addition to being smaller in size, ICU4X's line and word segmenters are 19.1% and 52.2% faster in non-complex scripts and 46.9% and 32.1% faster in Chinese than the equivalents in ICU4C, respectively.
The machine learning models in ICU4X are used for word and line breaking in Southeast Asian languages including Thai, Lao, Khmer, and Myanmar. The models use an LSTM, are trained on large datasets, and achieve high accuracy while retaining small model size. By leveraging modern computer architecture features such as SIMD, the team optimized the performance of the LSTM inference to be about 3× faster than the naive implementation. However, the dictionary model remains the fastest, about two orders of magnitude faster than the LSTM. ICU4X offers both types of models for clients to choose.
Another focus of ICU4X 1.2 has been to support your text layout stack. A text layout engine requires more than the scope of either ICU4C and ICU4X, but any layout engine requires at least two ICU features: line break segmentation and the ability to correctly order bidirectional text. ICU4X 1.2 supports the segmentation and bidirectional text needs of Skia’s SkParagraph and HarfBuzz.
Finally, ICU4X 1.2 brings a number of smaller features to other components. The experimental Display Names component now supports language and script display names, in addition to region display names; the Properties component supports converting UCD property and value enum discriminants to their long and short names, and vice-versa; and all components have been upgraded to support CLDR 43.
Read the full ICU4X 1.2 release notes and then the ICU4X tutorial to start using ICU4X in your project.
To learn more about the latest release, be sure to attend our ICU4X Virtual Open House this Wednesday, April 19th at 9am PT.
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.
Thursday, April 13, 2023
ICU 73 Released
Unicode® ICU 73 has just been released. ICU is the
premier library for software
internationalization, used by a
wide array of companies and organizations to support the world's languages,
implementing both the latest version of the Unicode Standard and of the Unicode
locale data (CLDR). ICU 73 updates to
CLDR 43 locale data with various additions and corrections.ICU 73 improves Japanese and Korean short-text line breaking, reduces C++ memory use in date formatting, and promotes the Java person name formatter from tech preview to draft.
ICU 73 and CLDR 43 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15.1 which is planned for September.)
ICU 73 updates to the time zone data version 2023c (March 2023). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.
For details, please see https://icu.unicode.org/download/73.
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.
Wednesday, April 12, 2023
Unicode CLDR v43 released
CLDR provides key building blocks for software to support the
world's languages (dates, times, numbers, sort-order, etc.). For example, all
major browsers and all modern mobile phones use CLDR for language support. (See
Who uses CLDR?)Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.
It is important to review the Migration section for changes that might require action by implementations using CLDR directly or indirectly (eg, via ICU).
CLDR 43 is a limited-submission release, focusing on just a few areas:
-
Formatting Person Names
- Completing the data for formatting person names, allowing it to advance out of “tech preview”. For more information on the benefits of this feature, see Background.
-
Locales
- Adding substantially to the LikelySubtags data: This is used to find the likely writing system and country for a given language, used in normalizing locale identifiers and inheritance. The data has been contributed by SIL.
- Inheritance: Adding components to parentLocales, and documenting the different inheritance for rgScope data, which inherits primarily by region.
-
Other data updates
- In English, Türkiye is now the primary country name for the country code TR, and Turkey is available as an alternate. Other locales have been reviewed to see whether similar changes would be appropriate.
- Name for the new timezone Ciudad Juárez.
-
Structure
- Adding some structure and data needed for ICU4X & JavaScript, for calendar eras and parentLocales.
-
Collation & Searching
- Treat various quote marks as equivalent at a Primary strength, also including Geresh and Gershayim.
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.
Subscribe to:
Comments (Atom)
Links of Interest
Blog Archive
Labels
CLDR
(78)
emoji
(75)
Unicode
(42)
ICU
(36)
AAC
(18)
beta
(17)
alpha
(13)
IUC
(12)
UTR #51
(11)
adopt-a-character
(11)
9.0
(10)
POD
(10)
conference
(10)
LDML
(9)
The Unicode Standard
(9)
UTS #51
(9)
Gold Sponsor
(8)
ICU4X
(8)
bidi
(8)
paperback
(8)
Arabic
(7)
IVD
(7)
UTC
(7)
UTS #18
(7)
UTS #46
(7)
Unicode 16.0
(7)
cover art
(7)
Collation
(6)
Survey Tool
(6)
UTS #10
(6)
UTS #39
(6)
Unicode 14
(6)
board of directors
(6)
cldr 43
(6)
locales
(6)
10646
(5)
7.0
(5)
8.0
(5)
SEI
(5)
emoji 12.0
(5)
membership
(5)
regular expression
(5)
security
(5)
unicode 15.1
(5)
10.0
(4)
CJK
(4)
CLDR 26
(4)
CLDR 36
(4)
CLDR 37
(4)
CLDR 39
(4)
CLDR 44
(4)
IDNA
(4)
Mayan
(4)
Rust
(4)
UAX #9
(4)
UTR #50
(4)
UTW
(4)
Unicode 12
(4)
Unicode 13.0
(4)
cldr 38
(4)
cldr 40
(4)
cldr 41
(4)
cldr 42
(4)
emoji 15.0
(4)
regex
(4)
repertoire
(4)
vertical text
(4)
11.0
(3)
12.0
(3)
Bob Jung
(3)
CLDR 35
(3)
CLDR 45
(3)
CLDR 46
(3)
FFI
(3)
Greg Welch
(3)
I18n
(3)
Jennifer Daniel
(3)
Mark Davis
(3)
UAX #29
(3)
UCA
(3)
UTS #37
(3)
Unicode 11
(3)
Unicode 12.1
(3)
Unicode 13
(3)
adoption
(3)
board
(3)
candidates
(3)
cldr 32
(3)
cldr 33
(3)
cldr 34
(3)
core specification
(3)
diversity
(3)
emoji 11.0
(3)
emoji 5.0
(3)
flags
(3)
keynote
(3)
officers
(3)
properties
(3)
reiwa
(3)
schedule
(3)
spoofing
(3)
tutorial
(3)
webinar
(3)
13.0
(2)
14.0
(2)
Addison Phillips
(2)
Adobe-Japan1
(2)
Alolita Sharma
(2)
Anshuman Pandey
(2)
BCP47
(2)
Berkeley
(2)
Beta Review
(2)
CLDR 24
(2)
CLDR 30
(2)
Cherokee
(2)
DDL
(2)
ESC
(2)
Egyptian hieroglyphs
(2)
Elymaic
(2)
Emoji2019
(2)
Extension G
(2)
Georgian
(2)
Google
(2)
Hanifi Rohingya
(2)
ICU 62
(2)
ICU 72
(2)
ICU 73
(2)
IUC 37
(2)
IUC 38
(2)
IUC 41
(2)
IUC 42
(2)
IUC 43
(2)
IUC 45
(2)
Japanese era
(2)
Kristi Lee
(2)
MSARG
(2)
Message Format Working Group
(2)
Microsoft
(2)
Moji Jōhō Kiban
(2)
Moji_Joho
(2)
Nandinagari
(2)
PDAM
(2)
Peter Constable
(2)
RGI
(2)
Roozbeh Pournader
(2)
Salesforce
(2)
Sunuwar
(2)
Teresa Marshall
(2)
Toral Cowieson
(2)
UAX
(2)
UAX #31
(2)
UAX #38
(2)
UAX #44
(2)
UTR #36
(2)
UTR #53
(2)
UTW2024
(2)
Unicode 15
(2)
Unicode Technology Workshop
(2)
Unihan
(2)
Vint Cerf
(2)
World Emoji Day
(2)
award
(2)
bidirectional
(2)
bulldog
(2)
calendar
(2)
candidate
(2)
design
(2)
egyptian
(2)
emoji 13.0
(2)
emoji 13.1
(2)
event
(2)
frequency
(2)
grant
(2)
holiday
(2)
ideographic
(2)
internationalization
(2)
keyboard
(2)
message format 2
(2)
person names
(2)
script
(2)
script_extensions
(2)
scripts
(2)
source code
(2)
standards
(2)
unicode 14.0
(2)
15.0
(1)
2021
(1)
6.3
(1)
AMTRA
(1)
Adlam
(1)
Adobe
(1)
Andy Heninger
(1)
Anne Gundelfinger
(1)
Apple
(1)
Arika Okrent
(1)
Babel
(1)
Bhojpuri
(1)
Bravanese
(1)
Brent Getlin
(1)
CJK Radical
(1)
CLDR 23
(1)
CLDR 25
(1)
CLDR 27
(1)
CLDR 28
(1)
CLDR 29
(1)
CLDR 33.1
(1)
CLDR 36.1
(1)
CLDR 47
(1)
CLDR 48
(1)
CLDR 50
(1)
CLDR-TC
(1)
Caddo
(1)
CanadaDay
(1)
Carlos Pallan Gayol
(1)
Carrier
(1)
Cathy Wissink
(1)
Chorasmian
(1)
Chuvash
(1)
DAM 1
(1)
DNS
(1)
Dachuan Zhang
(1)
David Singer
(1)
Dhives-Akuru
(1)
Dives Akuru
(1)
Dogri
(1)
Du Lilyu
(1)
Ebrima
(1)
Elango Cheran
(1)
Emoji 14.0
(1)
Emoji One
(1)
Emoji12
(1)
Eric Muller
(1)
Extension I
(1)
FAQ
(1)
Facebook
(1)
French
(1)
Fulani
(1)
Gabee Ayres
(1)
Gabrielle Vail
(1)
Garay
(1)
Georgian Mtavruli
(1)
GivingTuesday
(1)
Gonggong
(1)
Gretchen McCulloch
(1)
Gurung Khema
(1)
Hanyo Denshi
(1)
Harald Alvestrand
(1)
Haryanvi
(1)
Haumea
(1)
Hindi
(1)
Hinglish
(1)
Huijun Shan
(1)
IAU
(1)
IBM
(1)
ICU 58
(1)
ICU 59
(1)
ICU 63
(1)
ICU 64
(1)
ICU 65
(1)
ICU 66
(1)
ICU 67
(1)
ICU 68
(1)
ICU 69
(1)
ICU 70
(1)
ICU 71
(1)
ICU 74
(1)
ICU 75
(1)
ICU 76
(1)
ICU 78
(1)
ICU4X 1.3
(1)
IDC
(1)
IDS
(1)
IRG
(1)
IUC 39
(1)
IUC 40
(1)
IUC IUC 39
(1)
Igbo
(1)
Indigenous
(1)
Iris Orriss
(1)
JSON
(1)
Japan
(1)
Jennifer 8 Lee
(1)
Jeremy Burge
(1)
John H. Jenkins
(1)
KRName
(1)
Kaktovik Numerals
(1)
Kangxi
(1)
Kashmiri
(1)
Kawi
(1)
Khitan
(1)
Khwarezmian
(1)
Kirat Rai
(1)
Kulpreet Chilana
(1)
LDML Keyboard
(1)
LanguagePreservation
(1)
Lari
(1)
Linkification
(1)
Luce Foundation
(1)
Macao
(1)
Maithili
(1)
Makemake
(1)
Malayalam
(1)
Manat
(1)
Manipuri
(1)
Mark Jamra
(1)
Mazahua
(1)
Medefaidrin
(1)
Michele Coady
(1)
Monica Tang
(1)
NEH
(1)
Nag Mundari
(1)
Naija
(1)
National Endowment for the Humanities
(1)
Nattilik
(1)
Ned Holbrook
(1)
Nepal Bhasa
(1)
Neptune
(1)
Netflix
(1)
New Tai Lue
(1)
Nigerian Pidgin
(1)
Nigerian-Pidgin
(1)
Norbert Lindenberg
(1)
Norwegian
(1)
Nyiakeng Puachue Hmong
(1)
Ojibway
(1)
Ol Onal
(1)
Orcus
(1)
Osage
(1)
PDAM 2.2
(1)
PRI #359
(1)
PRI #365
(1)
PRI #366
(1)
PRI #408
(1)
PRI #418
(1)
PRI #435
(1)
Pahlavi
(1)
Peter Edberg
(1)
Phoreus
(1)
Pluto
(1)
Public Review Issues
(1)
QID
(1)
Quaoar
(1)
RBNF
(1)
Rajasthani
(1)
Rathna Ramanathan
(1)
Rohingya
(1)
Ruble
(1)
SC2
(1)
SCWG
(1)
Saagar Setu
(1)
Salvatore Giammarresi
(1)
Sanskrit
(1)
Santali
(1)
Sayisi
(1)
SignWriting
(1)
Sindhi
(1)
Sinhala
(1)
Siyaq
(1)
Sogdian
(1)
Stanford
(1)
Stanford SILICON
(1)
Support Unicode
(1)
Swiftkey
(1)
Syloti Nagri
(1)
TNO
(1)
Tableaux des caractères
(1)
Tangsa
(1)
Tayfun Karadeniz
(1)
Thomas Mullaney
(1)
Todhri
(1)
Tom Mullaney
(1)
Toto
(1)
Tulu-Tigalari
(1)
Typotheque
(1)
UAX #14
(1)
UAX #15
(1)
UAX #45
(1)
UCA UCD
(1)
UCD
(1)
UTC #175
(1)
UTC #177
(1)
UTC #179
(1)
UTC #180
(1)
UTC #181
(1)
UTC #182
(1)
UTR #23
(1)
UTS #35
(1)
UTS #52
(1)
UTS #55
(1)
Uighur
(1)
Unicode 15.0
(1)
Unicode 16
(1)
Unicode 17.0
(1)
Unicode Fellows
(1)
Unicode Technical Committee
(1)
UnicodeEmoji
(1)
UnicodeEmojiMirror
(1)
Vithkuqi
(1)
Wancho
(1)
Warsh
(1)
Webdings
(1)
Wingdings
(1)
Xhosa
(1)
Yezidi
(1)
Youtube
(1)
ZWJ
(1)
Zawgyi
(1)
Znamenny
(1)
alpha review
(1)
amendment
(1)
annotations
(1)
art
(1)
astronomy
(1)
beta 6.3 bidi
(1)
bloomberg
(1)
cambridge
(1)
character property model
(1)
cldr 31
(1)
cldr 35.1
(1)
community engagement
(1)
compatibility
(1)
conjoining form
(1)
corrigendum
(1)
currency
(1)
customization
(1)
directionality
(1)
document register
(1)
domain names
(1)
donations
(1)
draft
(1)
dwarf planets
(1)
emoji 12.1
(1)
emoji 16.0
(1)
emoji proposal
(1)
emojixpress
(1)
era name
(1)
executive director
(1)
family
(1)
feedback
(1)
flag
(1)
font
(1)
française
(1)
gender
(1)
general category
(1)
general counsel
(1)
glyphs
(1)
grafematik
(1)
graphemics
(1)
guide
(1)
hashtag
(1)
hentaigana
(1)
hieroglyphs
(1)
highlights
(1)
icu 60
(1)
icu 61
(1)
icu 64.2
(1)
ideographic description characters
(1)
interview
(1)
iuc 44
(1)
keyboards
(1)
language
(1)
locale
(1)
maya
(1)
mongolian
(1)
myanmar
(1)
noncharacters
(1)
oman
(1)
participation
(1)
person-names
(1)
phone
(1)
planning
(1)
playlist
(1)
policies
(1)
publication
(1)
publishing
(1)
quick start
(1)
reference code
(1)
release
(1)
resources
(1)
segmentation
(1)
shopify
(1)
smiley face
(1)
soyombo
(1)
space
(1)
speaker
(1)
sponsor
(1)
stability policies
(1)
submission
(1)
syllabics
(1)
symbol
(1)
technical preview
(1)
text segmentation
(1)
turkey
(1)
typography
(1)
unicodeaac
(1)
valentines day
(1)
variation
(1)
workshop
(1)
文字情報盤
(1)
Followers
Subscribe to this blog