skip to main |
skip to sidebar
By: Mark Davis, Cofounder and CTO
The Unicode Consortium is providing a new resource to help programming tooling developers, programming language developers, and programming language users to deal with Unicode spoofing.
In 2004, the Unicode Consortium began working to address this issue, focusing on URLs and other identifiers that could be spoofed, and produced a specification and technical report with best practices for detecting such cases. Implementations using those specifications have been widely deployed in operating systems.
In November of 2021, another class of problems was documented. It was demonstrated that malicious agents could write source code that would look to human reviewers as if it was secure, but actually contain hidden traps. There are three main categories of these spoofs: line-break spoofs, confusable spoofs, and bidirectional ordering spoofs.

The earlier work on spoofing identifiers was relevant to this work, but did not explicitly deal with the environment surrounding software development. Moreover, the guidance was aimed at internationalization experts, not programming language and software tooling developers.
The first results of this group were a number of enhancements to core Unicode specifications in September of 2022. UAX #9 provided an extended example of use of the important higher-level protocol HL4, and emphasized the use to mitigate misleading bidirectional ordering of source code, including potential spoofing attacks; UAX #31 provided important guidance on profiles for default identifiers and clarified that requirement on Pattern_White_Space and Pattern_Syntax characters applies to programming languages, and is relevant to issues of bidirectional ordering and potential spoofing attacks.
Coordinated security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms.
This work would not have been possible without the set of dedicated and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine “whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler, Liang Hai (梁海), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning, Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam, Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more details on their contributions, see Acknowledgements.
Having completed its main task, the SCWG is formally being retired — but we are keeping the list of participants in case we need to call on their expertise in the future!
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.
Wednesday, September 13, 2023
Source Code Handling: Preventing Spoofing at the Source
By: Mark Davis, Cofounder and CTO
The Unicode Consortium is providing a new resource to help programming tooling developers, programming language developers, and programming language users to deal with Unicode spoofing.
Background
Encompassing letters and symbols (over 149,000 in Unicode 15.1) across the world’s writing systems, it was inevitable that many of them would look similar — and sometimes identical. And of course, there are those who would take advantage of that to swindle. An example of this is “pаypal.com”, where the first ‘а’ is actually a Cyrillic character that is confusable with the Latin alphabet ‘a’. 😵💫In 2004, the Unicode Consortium began working to address this issue, focusing on URLs and other identifiers that could be spoofed, and produced a specification and technical report with best practices for detecting such cases. Implementations using those specifications have been widely deployed in operating systems.
In November of 2021, another class of problems was documented. It was demonstrated that malicious agents could write source code that would look to human reviewers as if it was secure, but actually contain hidden traps. There are three main categories of these spoofs: line-break spoofs, confusable spoofs, and bidirectional ordering spoofs.
Examples
- Line-break spoofs can cause what appears to
be a line of code to be actually commented out, as far as the compiler is
concerned. This can happen with C11, for example:

To a reviewer, this is an active line of code. But when U+2028 Line Separator is at the end of the first line, the C11 compiler will interpret this as one line consisting only of a comment!
- The “pаypal.com” above is an example of a
confusable spoof.
- As for a bidirectional spoof, take pair of
variables named Aא1 and A1א; these look identical, but the former consists
of the letters A and א followed by the digit 1, whereas the latter consists
of the letter A, the digit 1, and the letter א, in that order.

The earlier work on spoofing identifiers was relevant to this work, but did not explicitly deal with the environment surrounding software development. Moreover, the guidance was aimed at internationalization experts, not programming language and software tooling developers.
Process
In response to this problem, the Consortium started a project in early 2022 to put together a cross-functional group of experts in Unicode processing, programming languages, and software development tooling to address these problems. That project resulted in the Source Code Working Group (SCWG), which brought together a set of experts to work through the possible problems.The first results of this group were a number of enhancements to core Unicode specifications in September of 2022. UAX #9 provided an extended example of use of the important higher-level protocol HL4, and emphasized the use to mitigate misleading bidirectional ordering of source code, including potential spoofing attacks; UAX #31 provided important guidance on profiles for default identifiers and clarified that requirement on Pattern_White_Space and Pattern_Syntax characters applies to programming languages, and is relevant to issues of bidirectional ordering and potential spoofing attacks.
Impact
The final output of the group is Unicode Technical Standard #55, Source Code Handling. This new specification brings together in one place a description of the problems specific to source code, together with guidance and best practices for programming language and software tooling developers. Many of the APIs necessary for supporting those best practices were already specified and implemented in ICU, Unicode’s software library that is already in all modern operating systems. However, one new useful API has been added to ICU, and will be released in October 2023. This is the new bidiSkeleton function, used to detect identifiers such as Aא1 above.Coordinated security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms.
This work would not have been possible without the set of dedicated and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine “whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler, Liang Hai (梁海), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning, Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam, Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more details on their contributions, see Acknowledgements.
Having completed its main task, the SCWG is formally being retired — but we are keeping the list of participants in case we need to call on their expertise in the future!
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.
Labels:
bidi,
SCWG,
source code,
spoofing,
UTS #55
Links of Interest
Blog Archive
Labels
CLDR
(78)
emoji
(75)
Unicode
(42)
ICU
(36)
AAC
(18)
beta
(17)
alpha
(13)
IUC
(12)
UTR #51
(11)
adopt-a-character
(11)
9.0
(10)
POD
(10)
conference
(10)
LDML
(9)
The Unicode Standard
(9)
UTS #51
(9)
Gold Sponsor
(8)
ICU4X
(8)
bidi
(8)
paperback
(8)
Arabic
(7)
IVD
(7)
UTC
(7)
UTS #18
(7)
UTS #46
(7)
Unicode 16.0
(7)
cover art
(7)
Collation
(6)
Survey Tool
(6)
UTS #10
(6)
UTS #39
(6)
Unicode 14
(6)
board of directors
(6)
cldr 43
(6)
locales
(6)
10646
(5)
7.0
(5)
8.0
(5)
SEI
(5)
emoji 12.0
(5)
membership
(5)
regular expression
(5)
security
(5)
unicode 15.1
(5)
10.0
(4)
CJK
(4)
CLDR 26
(4)
CLDR 36
(4)
CLDR 37
(4)
CLDR 39
(4)
CLDR 44
(4)
IDNA
(4)
Mayan
(4)
Rust
(4)
UAX #9
(4)
UTR #50
(4)
UTW
(4)
Unicode 12
(4)
Unicode 13.0
(4)
cldr 38
(4)
cldr 40
(4)
cldr 41
(4)
cldr 42
(4)
emoji 15.0
(4)
regex
(4)
repertoire
(4)
vertical text
(4)
11.0
(3)
12.0
(3)
Bob Jung
(3)
CLDR 35
(3)
CLDR 45
(3)
CLDR 46
(3)
FFI
(3)
Greg Welch
(3)
I18n
(3)
Jennifer Daniel
(3)
Mark Davis
(3)
UAX #29
(3)
UCA
(3)
UTS #37
(3)
Unicode 11
(3)
Unicode 12.1
(3)
Unicode 13
(3)
adoption
(3)
board
(3)
candidates
(3)
cldr 32
(3)
cldr 33
(3)
cldr 34
(3)
core specification
(3)
diversity
(3)
emoji 11.0
(3)
emoji 5.0
(3)
flags
(3)
keynote
(3)
officers
(3)
properties
(3)
reiwa
(3)
schedule
(3)
spoofing
(3)
tutorial
(3)
webinar
(3)
13.0
(2)
14.0
(2)
Addison Phillips
(2)
Adobe-Japan1
(2)
Alolita Sharma
(2)
Anshuman Pandey
(2)
BCP47
(2)
Berkeley
(2)
Beta Review
(2)
CLDR 24
(2)
CLDR 30
(2)
Cherokee
(2)
DDL
(2)
ESC
(2)
Egyptian hieroglyphs
(2)
Elymaic
(2)
Emoji2019
(2)
Extension G
(2)
Georgian
(2)
Google
(2)
Hanifi Rohingya
(2)
ICU 62
(2)
ICU 72
(2)
ICU 73
(2)
IUC 37
(2)
IUC 38
(2)
IUC 41
(2)
IUC 42
(2)
IUC 43
(2)
IUC 45
(2)
Japanese era
(2)
Kristi Lee
(2)
MSARG
(2)
Message Format Working Group
(2)
Microsoft
(2)
Moji Jōhō Kiban
(2)
Moji_Joho
(2)
Nandinagari
(2)
PDAM
(2)
Peter Constable
(2)
RGI
(2)
Roozbeh Pournader
(2)
Salesforce
(2)
Sunuwar
(2)
Teresa Marshall
(2)
Toral Cowieson
(2)
UAX
(2)
UAX #31
(2)
UAX #38
(2)
UAX #44
(2)
UTR #36
(2)
UTR #53
(2)
UTW2024
(2)
Unicode 15
(2)
Unicode Technology Workshop
(2)
Unihan
(2)
Vint Cerf
(2)
World Emoji Day
(2)
award
(2)
bidirectional
(2)
bulldog
(2)
calendar
(2)
candidate
(2)
design
(2)
egyptian
(2)
emoji 13.0
(2)
emoji 13.1
(2)
event
(2)
frequency
(2)
grant
(2)
holiday
(2)
ideographic
(2)
internationalization
(2)
keyboard
(2)
message format 2
(2)
person names
(2)
script
(2)
script_extensions
(2)
scripts
(2)
source code
(2)
standards
(2)
unicode 14.0
(2)
15.0
(1)
2021
(1)
6.3
(1)
AMTRA
(1)
Adlam
(1)
Adobe
(1)
Andy Heninger
(1)
Anne Gundelfinger
(1)
Apple
(1)
Arika Okrent
(1)
Babel
(1)
Bhojpuri
(1)
Bravanese
(1)
Brent Getlin
(1)
CJK Radical
(1)
CLDR 23
(1)
CLDR 25
(1)
CLDR 27
(1)
CLDR 28
(1)
CLDR 29
(1)
CLDR 33.1
(1)
CLDR 36.1
(1)
CLDR 47
(1)
CLDR 48
(1)
CLDR 50
(1)
CLDR-TC
(1)
Caddo
(1)
CanadaDay
(1)
Carlos Pallan Gayol
(1)
Carrier
(1)
Cathy Wissink
(1)
Chorasmian
(1)
Chuvash
(1)
DAM 1
(1)
DNS
(1)
Dachuan Zhang
(1)
David Singer
(1)
Dhives-Akuru
(1)
Dives Akuru
(1)
Dogri
(1)
Du Lilyu
(1)
Ebrima
(1)
Elango Cheran
(1)
Emoji 14.0
(1)
Emoji One
(1)
Emoji12
(1)
Eric Muller
(1)
Extension I
(1)
FAQ
(1)
Facebook
(1)
French
(1)
Fulani
(1)
Gabee Ayres
(1)
Gabrielle Vail
(1)
Garay
(1)
Georgian Mtavruli
(1)
GivingTuesday
(1)
Gonggong
(1)
Gretchen McCulloch
(1)
Gurung Khema
(1)
Hanyo Denshi
(1)
Harald Alvestrand
(1)
Haryanvi
(1)
Haumea
(1)
Hindi
(1)
Hinglish
(1)
Huijun Shan
(1)
IAU
(1)
IBM
(1)
ICU 58
(1)
ICU 59
(1)
ICU 63
(1)
ICU 64
(1)
ICU 65
(1)
ICU 66
(1)
ICU 67
(1)
ICU 68
(1)
ICU 69
(1)
ICU 70
(1)
ICU 71
(1)
ICU 74
(1)
ICU 75
(1)
ICU 76
(1)
ICU 78
(1)
ICU4X 1.3
(1)
IDC
(1)
IDS
(1)
IRG
(1)
IUC 39
(1)
IUC 40
(1)
IUC IUC 39
(1)
Igbo
(1)
Indigenous
(1)
Iris Orriss
(1)
JSON
(1)
Japan
(1)
Jennifer 8 Lee
(1)
Jeremy Burge
(1)
John H. Jenkins
(1)
KRName
(1)
Kaktovik Numerals
(1)
Kangxi
(1)
Kashmiri
(1)
Kawi
(1)
Khitan
(1)
Khwarezmian
(1)
Kirat Rai
(1)
Kulpreet Chilana
(1)
LDML Keyboard
(1)
LanguagePreservation
(1)
Lari
(1)
Linkification
(1)
Luce Foundation
(1)
Macao
(1)
Maithili
(1)
Makemake
(1)
Malayalam
(1)
Manat
(1)
Manipuri
(1)
Mark Jamra
(1)
Mazahua
(1)
Medefaidrin
(1)
Michele Coady
(1)
Monica Tang
(1)
NEH
(1)
Nag Mundari
(1)
Naija
(1)
National Endowment for the Humanities
(1)
Nattilik
(1)
Ned Holbrook
(1)
Nepal Bhasa
(1)
Neptune
(1)
Netflix
(1)
New Tai Lue
(1)
Nigerian Pidgin
(1)
Nigerian-Pidgin
(1)
Norbert Lindenberg
(1)
Norwegian
(1)
Nyiakeng Puachue Hmong
(1)
Ojibway
(1)
Ol Onal
(1)
Orcus
(1)
Osage
(1)
PDAM 2.2
(1)
PRI #359
(1)
PRI #365
(1)
PRI #366
(1)
PRI #408
(1)
PRI #418
(1)
PRI #435
(1)
Pahlavi
(1)
Peter Edberg
(1)
Phoreus
(1)
Pluto
(1)
Public Review Issues
(1)
QID
(1)
Quaoar
(1)
RBNF
(1)
Rajasthani
(1)
Rathna Ramanathan
(1)
Rohingya
(1)
Ruble
(1)
SC2
(1)
SCWG
(1)
Saagar Setu
(1)
Salvatore Giammarresi
(1)
Sanskrit
(1)
Santali
(1)
Sayisi
(1)
SignWriting
(1)
Sindhi
(1)
Sinhala
(1)
Siyaq
(1)
Sogdian
(1)
Stanford
(1)
Stanford SILICON
(1)
Support Unicode
(1)
Swiftkey
(1)
Syloti Nagri
(1)
TNO
(1)
Tableaux des caractères
(1)
Tangsa
(1)
Tayfun Karadeniz
(1)
Thomas Mullaney
(1)
Todhri
(1)
Tom Mullaney
(1)
Toto
(1)
Tulu-Tigalari
(1)
Typotheque
(1)
UAX #14
(1)
UAX #15
(1)
UAX #45
(1)
UCA UCD
(1)
UCD
(1)
UTC #175
(1)
UTC #177
(1)
UTC #179
(1)
UTC #180
(1)
UTC #181
(1)
UTC #182
(1)
UTR #23
(1)
UTS #35
(1)
UTS #52
(1)
UTS #55
(1)
Uighur
(1)
Unicode 15.0
(1)
Unicode 16
(1)
Unicode 17.0
(1)
Unicode Fellows
(1)
Unicode Technical Committee
(1)
UnicodeEmoji
(1)
UnicodeEmojiMirror
(1)
Vithkuqi
(1)
Wancho
(1)
Warsh
(1)
Webdings
(1)
Wingdings
(1)
Xhosa
(1)
Yezidi
(1)
Youtube
(1)
ZWJ
(1)
Zawgyi
(1)
Znamenny
(1)
alpha review
(1)
amendment
(1)
annotations
(1)
art
(1)
astronomy
(1)
beta 6.3 bidi
(1)
bloomberg
(1)
cambridge
(1)
character property model
(1)
cldr 31
(1)
cldr 35.1
(1)
community engagement
(1)
compatibility
(1)
conjoining form
(1)
corrigendum
(1)
currency
(1)
customization
(1)
directionality
(1)
document register
(1)
domain names
(1)
donations
(1)
draft
(1)
dwarf planets
(1)
emoji 12.1
(1)
emoji 16.0
(1)
emoji proposal
(1)
emojixpress
(1)
era name
(1)
executive director
(1)
family
(1)
feedback
(1)
flag
(1)
font
(1)
française
(1)
gender
(1)
general category
(1)
general counsel
(1)
glyphs
(1)
grafematik
(1)
graphemics
(1)
guide
(1)
hashtag
(1)
hentaigana
(1)
hieroglyphs
(1)
highlights
(1)
icu 60
(1)
icu 61
(1)
icu 64.2
(1)
ideographic description characters
(1)
interview
(1)
iuc 44
(1)
keyboards
(1)
language
(1)
locale
(1)
maya
(1)
mongolian
(1)
myanmar
(1)
noncharacters
(1)
oman
(1)
participation
(1)
person-names
(1)
phone
(1)
planning
(1)
playlist
(1)
policies
(1)
publication
(1)
publishing
(1)
quick start
(1)
reference code
(1)
release
(1)
resources
(1)
segmentation
(1)
shopify
(1)
smiley face
(1)
soyombo
(1)
space
(1)
speaker
(1)
sponsor
(1)
stability policies
(1)
submission
(1)
syllabics
(1)
symbol
(1)
technical preview
(1)
text segmentation
(1)
turkey
(1)
typography
(1)
unicodeaac
(1)
valentines day
(1)
variation
(1)
workshop
(1)
文字情報盤
(1)