Copyright
©
2004-2020
2004-2021
W3C
®
(
MIT
,
ERCIM
,
Keio
,
Beihang
).
W3C
liability
,
trademark
and
permissive
document
license
rules
apply.
This document builds upon Character Model for the World Wide Web 1.0: Fundamentals [ CHARMOD ] to provide authors of specifications, software developers, and content developers a common reference on string identity matching on the World Wide Web and thereby increase interoperability.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL.
This document was published by the Internationalization Working Group as an Editor's Draft.
GitHub Issues are preferred for discussion of this specification.
Publication as an Editor's Draft does not imply endorsement by the W3C Membership.
This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 1 August 2017 W3C Patent Policy . The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 15 September 2020 W3C Process Document .
The goal of the Character Model for the World Wide Web is to facilitate use of the Web by all people, regardless of their language, script, writing system, or cultural conventions, in accordance with the W3C goal of universal access . One basic prerequisite to achieve this goal is to be able to transmit and process the characters used around the world in a well-defined and well-understood way.
This document builds on Character Model for the World Wide Web: Fundamentals [ CHARMOD ]. Understanding the concepts in that document are important to being able to understand and apply this document successfully.
This part of the Character Model for the World Wide Web covers string matching—the process by which a specification or implementation defines whether two string values are the same or different from one another. It describes the ways in which texts that are semantically equivalent can be encoded differently and the impact this has on matching operations important to formal languages (such as those used in the formats and protocols that make up the Web).
The main target audience of this specification is W3C specification developers. This specification and parts of it can be referenced from other W3C specifications and it defines conformance criteria for W3C specifications, as well as other specifications.
Other audiences of this specification include software developers, content developers, and authors of specifications outside the W3C . Software developers and content developers implement and use W3C specifications. This specification defines some conformance criteria for implementations (software) and content that implement and use W3C specifications. It also helps software developers and content developers to understand the character-related provisions in W3C specifications.
The character model described in this specification provides authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text manipulation on the World Wide Web. Working together, these three groups can build a globally accessible Web.
This document defines one of the basic building blocks for the Web related to this problem by defining rules and processes for String Identity Matching in document formats. These rules are designed for the identifiers and structural markup ( syntactic content ) used in document formats to ensure consistent processing of each and are targeted to Specification writers. This section is targeted to implementers.
This document is divided into two main sections.
The first section lays out the problems involved in string matching; the effects of Unicode and case folding on these problems; and outlines the various issues and normalization mechanisms that might be used to address these issues.
The second section provides requirements and recommendations for string identity matching for use in formal languages , such as many of the document formats defined in W3C Specifications. This primarily is concerned with making the Web functional and providing document authors with consistent results.
This section provides some historical background on the topics addressed in this specification.
At the core of the character model is the Universal Character Set (UCS), defined jointly by the Unicode Standard [ Unicode ] and ISO/IEC 10646 [ ISO10646 ]. In this document, Unicode is used as a synonym for the Universal Character Set. A successful character model allows Web documents authored in the world's writing systems, scripts, and languages (and on different platforms) to be exchanged, read, and searched by the Web's users around the world.
The first few chapters of the Unicode Standard [ Unicode ] provide useful background reading.
For information about the requirements that informed the development of important parts of this specification, see Requirements for String Identity Matching and String Indexing [ CHARREQ ].
This section contains terminology and notation specific to this document.
The Web is built on text-based formats and protocols. In order to describe string matching or searching effectively, it is necessary to establish terminology that allows us to talk about the different kinds of text within a given format or protocol, as the requirements and details vary significantly.
A Unicode code point (or "code point") refers to the numeric value assigned to each Unicode character. Unicode code points range from 0 to 0x10FFFF 16 . (See Section 4.1 in [ CHARMOD ] for a deeper discussion of character encoding terminology.)
Unicode
code
points
are
denoted
as
U+
hhhh
,
where
hhhh
is
a
sequence
of
at
least
four,
and
at
most
six
hexadecimal
digits.
For
example,
the
character
€
[
U+20AC
EURO
SIGN
]
has
the
code
point
U+20AC
,
while
the
character
😺
[
U+1F63A
SMILING
CAT
FACE
WITH
OPEN
MOUTH
]
has
the
code
point
U+1F63A
.
Some characters used in this document's examples might not appear as intended on your specific device or display. Usually this is due to lack of a script-specific font installed locally or due to other limitations of your specific rendering system. This document uses a Webfont to provide fallback glyphs for many of the non-Latin characters, but your device might not support displaying the font. To the degree possible, the editors have tried to ensure that the examples nevertheless remain understandable.
A legacy character encoding is a character encoding form that does not encode the full repertoire of characters in the Unicode character set.
A transcoder is a process that converts text between two character encodings. Most commonly in this document it refers to a process that converts from a legacy character encoding to a Unicode encoding form , such as UTF-8.
Natural language is the spoken, written, or signed communications used by human beings (see also here [ LTLI ])
Syntactic content is any text in a document format or protocol that belongs to the structure of the format or protocol. This definition includes values that are typically thought of as "markup" but can also include other values, such as the name of a field in an HTTP header. Syntactic content consists of all of the characters that form the structure of a format or protocol. For example, < and > (as well as the element name and various attributes they surround) are part of the syntactic content in an HTML document.
Syntactic content usually is defined by a specification or specifications and includes both the defined, reserved keywords for the given protocol or format as well as string tokens and identifiers that are defined by document authors to form the structure of the document (rather than the "content" of the document).
A vocabulary is the list of reserved keywords and/or rules for assigning user-supplied values (such as identifiers) in a format or protocol. This can include restrictions on range, order, or type of characters that can appear in different places.
For example, HTML defines the names of its elements and attributes, as well as enumerated attribute values, which defines the "vocabulary" of HTML syntactic content . Another example would be ECMAScript, which restricts the range of characters that can appear at the start or in the body of an identifier or variable name. It applies different rules for other cases, such as to the values of string literals.
Values within a vocabulary fall into two broad classes: those that are meant to be seen, read, or interacted with by humans (and thus might be expected to contain natural language text); and those that are application or protocol internal and not intended for human interaction.
A user-facing identifier is an identifier defined by or assigned by a user in a vocabulary that is intended to be at least potentially visible to end-users (and thus is localizable content ).
An application internal identifier is an identifier defined by or assigned by a user in a vocabulary that is internal to the document format or protocol and not intended for human interaction. Such values are generally not localizable content .
A
user-supplied
value
is
unreserved
syntactic
content
in
a
vocabulary
that
is
assigned
by
users,
as
distinct
from
reserved
keywords
in
a
given
format
or
protocol.
Users
generally
expect
that
their
user-supplied
values
can
be
words
or
phrases
in
their
preferred
natural
language
.
This
is
why
[
CHARMOD
]
recommends
that
"Specifications
SHOULD
NOT
arbitrarily
exclude
code
points
from
the
full
range
of
Unicode
code
points
from
U+0000
to
U+10FFFF
inclusive."
Natural
language
Localizable
content
refers
to
the
language-bearing
content
in
a
document
contents
intended
as
human-readable
text
and
not
to
any
of
the
surrounding
or
embedded
syntactic
content
that
form
part
of
the
document
structure.
You
can
think
of
it
as
the
actual
"content"
of
the
document
or
the
"message"
in
a
given
protocol.
Note
that
syntactic
content
can
contain
natural
language
content,
have
localizable
content
embedded
in
it,
such
as
when
an
[
HTML
]
img
element
has
an
alt
attribute
containing
a
description
of
the
image.
A
resource
,
in
the
context
of
this
document,
is
a
given
document,
file,
or
protocol
"message"
which
includes
both
the
natural
language
localizable
content
as
well
as
the
syntactic
content
such
as
identifiers
surrounding
or
containing
it.
For
example,
in
an
HTML
document
that
also
has
some
CSS
and
a
few
script
tags
with
embedded
JavaScript,
the
entire
HTML
document,
considered
as
a
file,
is
a
resource.
This
term
is
intentionally
similar
to
the
term
'resource'
as
used
in
[
RFC3986
],
although
here
the
term
is
applied
loosely.
A
user-supplied
value
is
unreserved
syntactic
content
in
a
vocabulary
that
is
assigned
by
users,
as
distinct
from
reserved
keywords
in
a
given
format
or
protocol.
For
example,
CSS
class
names
are
part
of
the
syntax
of
a
CSS
style
sheet.
They
are
not
reserved
keywords,
predefined
by
any
CSS
specification,
but
they
are
still
are
subject
to
the
syntactic
rules
of
CSS.
Users
generally
expect
that
keywords
they
define
can
be
representative
of
words
or
concepts
in
their
given
natural
language,
subject
to
the
limitations
of
those
rules.
A
vocabulary
provides
the
list
of
reserved
names
as
well
as
the
set
of
rules
and
specifications
controlling
how
user-supplied
values
(such
as
identifiers)
can
be
assigned
in
a
format
or
protocol.
This
can
include
restrictions
on
range,
order,
or
type
of
characters
that
can
appear
in
different
places.
For
example,
HTML
defines
the
names
of
its
elements
and
attributes,
as
well
as
enumerated
attribute
values,
which
defines
the
"vocabulary"
of
HTML
syntactic
content
.
Another
example
would
be
ECMAScript,
which
restricts
the
range
of
characters
that
can
appear
at
the
start
or
in
the
body
of
an
identifier
or
variable
name.
It
applies
different
rules
for
other
cases,
such
as
to
the
values
of
string
literals.
Values
within
a
vocabulary
fall
into
two
broad
classes:
those
that
are
meant
to
be
seen,
read,
or
interacted
with
by
humans
(and
thus
might
be
expected
to
contain
natural
language
text);
and
those
that
are
application
or
protocol
internal
and
not
intended
for
human
interaction.
A
user-facing
identifier
is
an
identifier
defined
by
or
assigned
by
a
user
in
a
vocabulary
that
is
intended
to
be
at
least
potentially
visible
to
end-users.
An
application
internal
identifier
is
an
identifier
defined
by
or
assigned
by
a
user
in
a
vocabulary
that
is
internal
to
the
document
format
or
protocol
and
not
intended
for
human
interaction.
A
grapheme
is
a
sequence
of
one
or
more
characters
in
a
visual
representation
of
some
text
that
a
typical
user
would
perceive
as
being
a
single
unit
(
character
).
Graphemes
are
important
for
a
number
of
text
operations
such
as
sorting
or
text
selection,
so
it
is
necessary
to
be
able
to
compute
the
boundaries
between
each
user-perceived
character.
Unicode
defines
the
default
mechanism
for
computing
graphemes
in
Unicode
Standard
Annex
#29:
Text
Segmentation
[
UAX29
]
and
calls
this
approximation
a
grapheme
cluster
.
There
are
two
types
of
default
grapheme
cluster
defined.
Unless
otherwise
noted,
grapheme
cluster
in
this
document
refers
to
an
extended
default
grapheme
cluster.
(A
discussion
of
grapheme
clusters
is
also
given
in
Section
2
of
the
Unicode
Standard
,
[
Unicode
].
Cf.
near
the
end
of
Section
2.11
in
version
8.0
of
The
Unicode
Standard)
Because different natural languages have different needs, grapheme clusters can also sometimes require tailoring. For example, a Slovak user might wish to treat the default pair of grapheme clusters "ch" as a single grapheme cluster. Note that the interaction between the language of string content and the end-user's preferences might be complex.
This section illustrates some of the terminology defined above. For illustration purposes we'll use the following small HTML file as an example (line numbers added for reference):
1 < html lang =" en " dir =" ltr ">
2 < head >
3 < meta charset =" UTF-8 ">
4 < title > Shakespeare </ title >
5 </ head >
6 < body >
7 < img src =" shakespeare.jpg " alt =" William Shakespeare " id =" shakespeare_image ">
8 < p > What ’ s in a name? That which we call a rose by any other name would smell as sweet. </ p >
9 </ body >
10 </ html >
’embedded in the sentence on line 8 is part of the syntactic content.)
William Shakespeare)
All
of
the
text
above
(all
text
in
a
text
file)
makes
up
the
resource.
It's
possible
that
a
given
resource
will
contain
no
natural
language
localizable
content
at
all
(consider
an
HTML
document
consisting
of
four
empty
div
elements
styled
to
be
orange
rectangles).
It's
also
possible
that
a
resource
will
contain
no
syntactic
content
and
consist
solely
of
natural
language
content:
localizable
content
:
for
example,
a
plain
text
file
with
a
soliloquy
from
Hamlet
in
it.
Notice
too
that
the
HTML
entity
’
appears
in
the
natural
language
localizable
content
and
belongs
to
both
the
natural
language
localizable
content
and
the
syntactic
content
in
this
resource.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY , MUST , MUST NOT , OPTIONAL , RECOMMENDED , SHOULD , and SHOULD NOT in this document are to be interpreted as described in BCP 14 [ RFC2119 ] [ RFC8174 ] when, and only when, they appear in all capitals, as shown here.
This document describes best practices for the authors of other specifications, as well as recommendations for implementations and content authors. These best practices can also be found in the Internationalization Working Group's document Internationalization Best Practices for Spec Developers [ INTERNATIONAL-SPECS ], which is intended to serve as a general reference for all Internationalization best practices in W3C specifications.
[C] When a best practice or recommendation appears in this document, it has been styled like this paragraph. Recommendations for specifications and spec authors are labelled [S]. Recommendations for implementations and software developers are labelled [I]. Recommendations for content and content authors are labelled [C].
Best practices in this document use [ RFC2119 ] keywords in order to clarify the Internationalization Working Group's intentions regarding a specific recommendation. Following the recommendations in this document can help avoid issues during the W3C 's "wide review" process, during implementation, or in the content that authors produce. This document is not, itself, normative and can be revised from time to time.
Specifications can claim conformance to this document if they:
Requirements placed on specifications might indirectly cause requirements to be placed on implementations or content that claim to conform to those specifications.
Where this specification contains a procedural description, it is to be understood as a way to specify the desired external behavior. Implementations MAY use other means of achieving the same results, as long as observable behavior is not affected.
The Web is primarily made up of document formats and protocols based on character data. These formats or protocols can be viewed as a set of resources consisting mainly of text files that include some form of structural markup or syntactic content . Processing such syntactic content or document data requires string-based operations such as matching (including regular expressions), indexing, searching, sorting, and so forth.
Users, particularly implementers, sometimes have naïve expectations regarding the matching or non-matching of similar strings or of the efficacy of different transformations they might apply to text, particularly to syntactic content, but including many types of text processing on the Web.
Because fundamentally the Web is sensitive to the different ways in which text might be represented in a document, failing to consider the different ways in which the same text can be represented can confuse users or cause unexpected or frustrating results. In the sections below, this document examines the different types of text variation that affect both user perception of text on the Web and the string processing on which the Web relies.
Some scripts and writing systems make a distinction between UPPER, lower, and Title case characters. Most scripts, including the Brahmic scripts of India, the Arabic script, and the scripts used to write Chinese, Japanese, or Korean, do not have a case distinction, but some important ones do. Examples of such scripts include the Latin script used in the majority of this document, as well as scripts such as Greek, Armenian, and Cyrillic.
Case mapping is the process of transforming characters to a specific case, such as UPPER, lower, or Titlecase. For those scripts that have a case distinction, Unicode defines a default UPPER, lower, and Titlecase character mapping for each Unicode code point. Case mapping, at first, appears simple. However there are variations that need to be considered when treating the full range of Unicode in diverse languages.
Case folding is the process of making two texts which differ only in case identical for comparison purposes, that is, it is meant for the purpose of string matching. This is distinct from case mapping , which is primarily meant for display purposes. As with the default case mappings, Unicode defines default case fold mappings ("case folding") for each Unicode code point. Unicode defines two forms of case folding, which we'll examine below.
Since
most
scripts
do
not
have
a
case
distinction,
as
with
case
mappings,
most
Unicode
code
points
do
not
require
a
case
folding.
For
those
code
points
that
have
a
case
folding,
the
majority
have
a
simple,
straight-forward
mapping
to
another
single
matching
(generally
lowercase)
code
point.
Unicode
calls
this
set
of
foldings
common
,
since
both
types
of
case
folding
defined
by
Unicode
includes
these.
A
few
characters
have
a
case
folding
that
map
one
Unicode
code
point
to
two
or
more
code
points.
This
set
of
case
foldings
are
called
the
full
case
foldings.
The
full
and
common
case
foldings
are
used
together
to
provide
the
default
case
folding
for
all
of
Unicode.
We
refer
to
this
form
of
case
folding
as
full
casefolding
or
Unicode
full
in
this
document.
Because
some
applications
cannot
allocate
additional
storage
when
performing
a
case
fold
operation,
Unicode
provides
a
simple
case
folding
that
maps
a
code
point
that
would
normally
fold
to
more
or
fewer
code
points
to
use
a
single
code
point
for
comparison
purposes
instead.
Unlike
the
full
folding,
this
folding
invariably
alters
the
content
(and
potentially
the
meaning)
of
the
text.
As
with
full
casefolding
,
the
simple
casefolding
or
Unicode
simple
casefold,
is
a
combination
of
simple
and
common
mappings
so
as
to
cover
the
full
range
of
Unicode.
Unicode
simple
is
not
appropriate
for
use
on
the
Web.
Note that case folding removes information from a string which cannot be recovered later. For example, two s letters in German do not necessarily represent ß in unfolded text.
Another aspect of case mapping and case folding is that it can be language sensitive. Unicode defines default case mappings and case foldings for each encoded character, but these are only defaults and are not appropriate in all cases. Some languages need case mapping to be tailored to meet specific linguistic needs. One example of this are Turkic languages written in the Latin script:
While
the
example
above
(and
this
document
in
general)
focuses
on
case
folding
for
the
purpose
of
matching,
note
that
case
mapping
is
also
language-specific.
The
name
of
the
second
largest
city
in
Turkey
is
"
Diyarbakır
",
which
contains
both
the
dotted
and
dotless
letters
i
.
Some document formats or protocols seek to aid interoperability or provide an aid to content authors by ignoring case variations in the vocabulary they define or in user-supplied values permitted by the format or protocol.
Sometimes case can vary in a way that is not semantically meaningful or is not fully under the user's control. This is particularly true when searching a document, but may sometimes also apply when defining rules for matching user- or content-generated values, such as identifiers. In these situations, case- in sensitive matching might be desirable instead.
When defining a vocabulary , one important consideration is whether the values are restricted to the ASCII [ ASCII ] subset of Unicode or if the vocabulary permits the use of characters (such as accents on Latin letters or a broad range of Unicode including non-Latin scripts) that potentially have more complex case folding requirements. To address these different requirements, there are four types of casefold matching defined by this document for the purposes of string identity matching in document formats or protocols:
Case sensitive matching : code points are compared directly with no case folding.
ASCII case-insensitive matching is defined in [ INFRA ]. This definition compares two sequences of code points as if all ASCII code points in the range 0x41 to 0x5A (A to Z) were mapped to the corresponding code points in the range 0x61 to 0x7A (a to z). ASCII case-insensitive matching can be required when a vocabulary is itself constrained to ASCII.
Unicode case-insensitive matching compares two sequences of code points as if the Unicode full casefolding (see above) had been applied to both input sequences.
Language-sensitive case-sensitive matching is useful in the rare case where a document format or protocol contains information about the language of the syntactic content and where language-sensitive case folding might sensibly be applied. These case foldings are defined in the Common Locale Data Repository [ UAX35 ] project of the Unicode Consortium.
For advice on how to handle case folding see § 3.2.6 Additional Considerations for Case Folding .
A different kind of variation can occur in Unicode text: sometimes several different Unicode code point sequences can be used to represent the same abstract character. When searching or matching text by comparing code points, these variations in encoding cause text values not to match that users expect to be the same.
Because applications need to find the semantic equivalence in texts that use different code point sequences, Unicode defines a means of making two semantically equivalent texts identical: the Unicode Normalization Forms [ UAX15 ].
Resources
are
often
susceptible
to
the
effects
of
these
variations
because
their
specifications
and
implementations
on
the
Web
do
not
require
Unicode
Normalization
of
the
text,
nor
do
they
take
into
consideration
the
string
matching
algorithms
used
when
processing
the
syntactic
content
(including
user-supplied
values
)
and
natural
language
localizable
content
later.
For
this
reason,
content
developers
need
to
ensure
that
they
have
provided
a
consistent
representation
in
order
to
avoid
problems
later.
However, it can be difficult for users to assure that a given resource or set of resources uses a consistent textual representation because the differences are usually not visible when viewed as text. Tools and implementations thus need to consider the difficulties experienced by users when visually or logically equivalent strings that "ought to" match (in the user's mind) are considered to be distinct values. Providing a means for users to see these differences and/or normalize them as appropriate makes it possible for end users to avoid failures that spring from invisible differences in their source documents. For example, the W3C Validator warns when an HTML document is not fully in Unicode Normalization Form C.
Unicode defines two types of equivalence between characters: canonical equivalence and compatibility equivalence .
Canonical equivalence is a fundamental equivalency between Unicode code points or sequences of Unicode code points that represent the same abstract character. Canonically equivalent sequences ideally should have the same visual appearance (although there are many factors that can cause them to appear somewhat differently) and they should be treated and processed as if they were identical. Unicode defines a process called canonical decomposition that removes these primary distinctions between two differently-encoded but canonically equivalent texts.
Examples of canonical equivalence defined by Unicode include:
Compatibility equivalence is a weaker equivalence between Unicode characters or sequences of Unicode characters that represent the same abstract character, but may have a different visual appearance or behavior. Generally the process called compatibility decomposition removes formatting variations, such as superscript, subscript, rotated, circled, and so forth, but other variations also occur. In many cases, characters with compatibility decompositions represent a distinction of a semantic nature; replacing the use of distinct characters with their compatibility decomposition can therefore change the meaning of the text. Texts that are equivalent after compatibility decomposition often were not perceived as being identical beforehand and SHOULD NOT be treated as equivalent by a formal language.
In the above table, it is important to note that the characters illustrated are actual Unicode codepoints , not just presentational variations due to context or style. Each character was encoded into Unicode for compatibility with various legacy character encodings. They should not be confused with the normal kinds of presentational processing used on their non-compatibility counterparts.
For example, most Arabic-script text uses the characters in the Arabic script block of Unicode (starting at U+0600 ). The actual glyphs used to display the text are selected using fonts and text processing logic based on the position inside a word (initial, medial, final, or isolated), in a process called "shaping". In the table above, the four presentation forms of the Arabic letter ه [ U+0647 ARABIC LETTER HEH ] are shown. The characters shown are compatibility characters in the U+FE00 block, each of which represents a specific "positional" shape and each of the four code points shown have a compatibility decomposition to the regular Arabic letter ه [ U+0647 ARABIC LETTER HEH ] . These presentation forms are intended only for the support of round-trip encoding conversions with the legacy character encodings that include equivalent presentation forms. Otherwise a string containing a sequence of letters 'heh' is just encoded as a series of U+0647 code points, with the rendering system and font supplying the appropriate shapes.
Similarly, the variations in half-width and full-width forms and rotated characters (for use in vertical text) are encoded as separate code points, mainly for compatibility with legacy character encodings. In many cases these variations are associated with the Unicode properties described in East Asian Width [ UAX11 ]. See also Unicode Vertical Text Layout [ UTR50 ] for a discussion of vertical text presentation forms.
In the case of characters with compatibility decompositions, such as those shown above, the K Unicode Normalization forms convert the text to the "normal" or "expected" Unicode code point. But the existence of these compatibility characters cannot be taken to imply that similar appearance variations produced in the normal course of text layout and presentation are affected by Unicode Normalization. They are not.
These two types of Unicode-defined equivalence are then grouped by another pair of variations: "decomposition" and "composition". In "decomposition", separable logical parts of a visual character are broken out into a sequence of base characters and combining marks and the resulting code points are put into a fixed, canonical order. In "composition", the decomposition is performed and then combining marks are recombined according to certain rules with their base characters.
Roughly speaking, NFC is defined such that each combining character sequence (a base character followed by one or more combining characters) is replaced, as far as possible, by a canonically equivalent precomposed character.
It is rather important to notice what this does not mean. The resulting character sequence can still contain combining marks, since not all character sequences have a precomposed equivalent. Indeed, as we've seen, many scripts offer no alternative to the use of combining marks, such as the Devanagari vowels in this example . In other cases, a given base character and combining mark is not replaced with a precomposed character because the combination is blocked by normalization rules. For example, some Indic scripts do not compose certain sequences of base plus diacritic, even though a matching precomposed character exists, due to composition exclusion rules. Composition may also be blocked by another combining mark between the two characters that would otherwise combine.
There are four Unicode Normalization Forms. Each form is named using a letter code:
Unicode Normalization reduces these (and other potential sequences of escapes representing the same character) to just three possible variations. However, Unicode Normalization doesn't remove all textual distinctions and sometimes the application of Unicode Normalization can remove meaning that is distinctive or meaningful in a given context. For example:
NFKD
or
NFKC
),
becomes
an
ASCII
character
sequence:
81/2
.
Many
users
are
surprised
to
find
that
two
identical-looking
strings—including
those
that
have
had
a
specific
Unicode
normalization
form
applied—might
not
in
fact
use
the
same
underlying
Unicode
code
points
.
This
includes
strings
that
have
had
the
more-destructive
NFKC
and
NFKD
compatibility
normalization
forms
applied
to
them.
Even
when
strings,
tokens,
or
identifiers
appear
visually
to
be
the
same,
they
can
be
encoded
differently.
The
Unicode
canonical
normalization
forms
are
concerned
with
folding
the
multiple
different
code
point
sequences
that
can
be
used
for
a
given
abstract
character
or
grapheme
cluster
to
use
the
same
code
point
sequence.
However,
logically
distinct
characters
or
grapheme
clusters
can
still
look
the
same
or
very
similar.
When
a
pair
of
graphemes
look
identical
(or
very
similar),
they
are
called
homographs
.
When
a
pair
of
graphemes
look
similar
or
are
homographs
but
actually
represent
logically
different
characters
or
character
sequences,
they
are
said
to
be
confusable
.
Examples of identical or identical-seeming appearance can appear even within a single script. This can take the form of similarly shaped characters, such as "0" and "O" or "l" and "1". But other scripts or the use of different compatibility characters can present much less readily distinguished variations. In some cases, Unicode Normalization brings these together, but in many other cases it does not.
Characters
that
are
identical
or
confusable
in
appearance
can
present
spoofing
and
other
security
risks.
This
can
be
true
within
a
single
script
or
for
similar
characters
in
separate
scripts.
For
further
discussion
and
examples
of
homoglyphs
and
confusability,
one
useful
reference
is
[
UTS39
].
In
addition
to
identical
or
similar-appearing
characters,
the
opposite
problem
also
exists:
Unicode
Normalization,
even
the
NFKC
and
NFKD
Compatibility
forms,
does
not
bring
together
characters
that
have
the
same
intrinsic
meaning
or
function
but
which
vary
in
appearance
or
usage.
For
example,
U+002E
(.)
and
U+3002
(。)
both
function
as
sentence
ending
punctuation,
but
the
distinction
is
not
removed
by
normalization
because
the
characters
have
a
distinct
identity.
One
additional
When
matching
strings
in
a
case-insensitive
manner,
one
complication
is
that
the
case
folding
or
case
mapping
process
can
affect
the
normalization
of
produce
strings
that
are
not
normalized,
even
if
the
original
string
was
normalized.
Since
string
comparison
relies
on
matching
code
point
sequences,
each
case
folded
string
must
be
normalized
if
the
casing
operation
is
applied
to.
If
the
goal
matching
process
is
to
compare
two
strings
in
be
reliable.
The
Unicode
canonical
normalization
forms
(
NFC
or
NFD)
and
case
folding,
when
used
together,
are
closed:
once
a
string
has
been
case
folded
and
then
had
NFD
or
NFC
or
NFD
applied
to
it,
further
applications
of
the
same
case
folding
and
or
Unicode
normalization
form
do
not
result
in
a
different
string.
When comparing strings for compatibility equivalence between characters (in other words, the NFKC/NFKD forms), the case fold-and-normalize operation must be performed twice because the compatibility decomposition step can result in characters that need to be case folded and the subsequent case fold can result in a sequence that must then be normalized.
When
comparing
strings
while
ignoring
Unicode's
definitions
of
canonical
casefold
matching
(rule
[D145]
)
and
compatibility
variations
between
characters,
casefold
matching
(rule
[D146]
)
include
an
additional
level
initial
step
of
normalization
has
to
be
applied
because
Unicode
Normalization
Form
NFD.
This
increases
the
compatibility
complexity
and
cost
of
performing
caseless
matching.
This
pre-casefold
normalization
exists
to
address
the
somewhat
obscure
quirk
of
certain
ancient
Greek
code
point
sequences
described
in
this
section.
Sixty-three
Greek
precomposed
characters
have
a
decomposition
step
can
result
mapping
(i.e.
normalize
to
form
NFD)
that
contains
the
character
[
U+0345
COMBINING
GREEK
YPOGEGRAMMENI
],
a
diacritical
mark
representing
a
subscripted
iota
(referred
to
as
a
prosgegrammei
or
ypogegrammeni
).
This
mark
represents
an
orthographic
form
found
in
ancient
or
classical
Greek
for
a
sound
not
present
in
more
modern
forms
of
the
language.
The
uppercase
and
titlecase
mappings
of
these
characters
separate
this
combining
mark
as
the
separate,
base,
letter
iota
.
For
consistency
with
title/uppercase
mappings,
the
case
fold
mapping
of
these
characters
therefore
contains
a
ι
[
U+03B9
GREEK
SMALL
LETTER
IOTA
]
(
recall
that
need
Unicode
case
fold
is
generally
to
lower
case).
If and only if one of these 63 characters is followed by a combining mark, failing to apply canonical decomposition before case folding can result in potential comparison mismatches. Such character sequences are not in a normalized form and are difficult to produce "naturally" (through keyboards and other input processes).
For example, if one starts with the precomposed ( NFC ) character ᾌ [ U+1F8C GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI ] (which is the most common way to represent this combination of base and diacritic characters) and just runs case fold transformations one ends up with: ἄι [ U+1F04 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA + U+03B9 GREEK SMALL LETTER IOTA ].
If instead one starts from a fully decomposed (NFD) sequence representing the same letter, ᾌ [ U+0391 GREEK CAPITAL LETTER ALPHA + U+0313 COMBINING COMMA ABOVE + U+0301 COMBINING ACUTE ACCENT + U+0345 COMBINING GREEK YPOGEGRAMMENI ] one ends up with ἄι [ U+03B1 GREEK SMALL LETTER ALPHA + U+0313 COMBINING COMMA ABOVE + U+0301 COMBINING ACUTE ACCENT + U+03B9 GREEK SMALL LETTER IOTA ]. Normalizing this string to NFC produces the same character sequence as the first example above:
In both of those cases, the acute accent is associated with the alpha base character rather than the trailing iota.
If, however, one begins with the half-precomposed sequence ᾌ [ U+1F88 GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI + U+0301 COMBINING ACUTE ACCENT ], one ends up with ἀί [ U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI + U+03B9 GREEK SMALL LETTER IOTA + U+0301 COMBINING ACUTE ACCENT ] where the acute accent is associated with the iota. This produces a sequence that can't be normalised to match the others (and is actually incorrect, as it has a different meaning than the original user-perceived character.)
As
mentioned
above,
Unicode
solves
this
matching
problem
by
normalizing
the
text
to
NFD
before
performing
the
case
folded
fold
operation.
Then
ᾌ
[
U+1F8C
GREEK
CAPITAL
LETTER
ALPHA
WITH
PSILI
AND
OXIA
AND
PROSGEGRAMMENI
]
and
then
normalized.
ᾌ
[
U+1F88
GREEK
CAPITAL
LETTER
ALPHA
WITH
PSILI
AND
PROSGEGRAMMENI
+
U+0301
COMBINING
ACUTE
ACCENT
]
both
end
up
the
same
as
the
decomposed
version,
ie.
ᾌ
[
U+0391
GREEK
CAPITAL
LETTER
ALPHA
+
U+0313
COMBINING
COMMA
ABOVE
+
U+0301
COMBINING
ACUTE
ACCENT
+
U+0345
COMBINING
GREEK
YPOGEGRAMMENI
].
If
one
now
case
folds
that
sequence
and
normalizes,
it
produces
a
match
for
all
cases:
Most document formats or protocols provide an escaping mechanism to permit the inclusion of characters that are otherwise difficult to input, process, or encode. These escaping mechanisms provide an additional equivalent means of representing characters inside a given resource. They also allow for the encoding of Unicode characters not represented in the character encoding scheme used by the document.
For further discussion of character escapes, including guidelines for the definition of escaping mechanisms in specifications, see: Section 4.6 of [ CHARMOD ].
The
expansion
of
character
escapes
and
includes
is
dependent
on
context,
that
is,
on
which
syntactic
content
or
programming
language
is
considered
to
apply
when
the
string
matching
operation
is
performed.
Consider
a
search
for
the
string
suçon
in
an
XML
document
containing
suçon
but
not
suçon
.
If
the
search
is
performed
in
a
plain
text
editor,
the
context
is
plain
text
(no
syntactic
content
or
programming
language
applies),
the
ç
character
escape
is
not
recognized,
hence
not
expanded
and
the
search
fails.
If
the
search
is
performed
in
an
XML
browser,
the
context
is
XML
,
the
character
escape
(defined
by
XML)
is
expanded
and
the
search
succeeds.
An intermediate case would be an XML editor that purposefully provides a view of an XML document with entity references left unexpanded. In that case, a search over that pseudo-XML view will deliberately not expand entities: in that particular context, entity references are not considered includes and need not be expanded
For
example,
€
U+20AC
EURO
SIGN
can
also
be
encoded
in
HTML
as
the
hexadecimal
entity
€
or
as
the
decimal
entity
€
.
In
a
JavaScript
or
JSON
file,
it
can
appear
as
\u20ac
or
as
\u{20AC}
while
in
a
CSS
stylesheet
it
can
appear
as
\20ac
.
All
of
these
representations
encode
the
same
literal
character
value:
€
.
Character escapes are normally interpreted before a document is processed and strings within the format or protocol are matched. Returning to an example we used above:
You would expect that text to display like the following: Hello world!
In
order
for
this
to
work,
the
user-agent
(browser)
had
to
match
two
strings
representing
the
class
name
héllo
,
even
though
the
CSS
and
HTML
each
used
a
different
escaping
mechanism.
The
above
fragment
demonstrates
one
way
that
text
can
vary
and
still
be
considered
"the
same"
according
to
a
specification:
the
class
name
h\e9llo
matched
the
class
name
in
the
HTML
mark-up
héllo
(and
would
also
match
the
literal
value
héllo
using
the
code
point
é
[
U+00E9
LATIN
SMALL
LETTER
E
WITH
ACUTE
]
).
Formal languages and document formats often offer facilities for including a piece of text from one resource inside another. An include is a mechanism for inserting content into the body of a resource . Include mechanisms import content into a resource at processing time. This affects the structure of the document and potentially matching against the vocabulary of the document. Examples of includes are entity references in XML, the XInclude [ XInclude ] specification, and @import rules in CSS.
An include is said to be include normalized if it does not begin with a combining mark (either in the form of a character escape or as a character literal in the included resource).
Unicode provides a number of special-purpose characters that help document authors control the appearance or performance of text. Because many of these characters are invisible or do not have keyboard equivalents, users are not always aware of their presence or absence. As a result, these characters can interfere with string matching when they are part of the encoded character sequence but the expected matching text does not also include them. Some examples of these characters include:
The Unicode control characters U+200D Zero Width Joiner (also known as ZWJ ) and U+200C Zero Width Non-Joiner (also known as ZWNJ ). While these characters can be used to control ligature formation—either preventing the formation of undesirable ligatures or encouraging the formation of desirable ones—their primary use is to control the joining and shape selection in complex scripts such as the Arabic or various of the Indic scripts. Some Indic scripts use the ZWJ and ZWNJ characters to allow authors to control the shape that certain conjuncts take. See the discussion in Chapter 12 of [ Unicode ].
The Zero Width Non-Joiner is used in Persian to prevent certain "normal" Arabic script joining. In these cases, the presence or absence of the character does affect the meaning. For example, the word تنها ("alone") and the word تنها ("bodies" or "corpuses") are encoded as " U+062A U+0646 U+0647 U+0627 " and " U+062A U+0646 U+200C U+0647 U+0627 " respectively, the only difference being the ZWNJ in the latter word.
The ZWJ character is also used in forming certain emoji sequences, which is discussed in more detail below .
Variation selectors ( U+FE00 through U+FE0F ) are characters used to select an alternate appearance or glyph (see Character Model: Fundamentals [ CHARMOD ]). For example, they are used to select between black-and-white and color emoji. These are also used in predefined ideographic variation sequences ( IVS ). Many examples are given in the "Standardized Variants" portion of the Unicode Character Database (UCD).
A few scripts also provide a way to encode visual variation selection: a prominent example of this are the Mongolian script's free variation selectors ( U+180B through U+180D ).
The character U+034F Combining Grapheme Joiner , whose name is misleading (as it does not join graphemes), is used to separate characters that might otherwise be considered a grapheme for the purposes of sorting or to provide a means of maintaining certain textual distinctions when applying Unicode normalization to text.
Whitespace variations can also affect the interpretation and matching of text. For example, the various non-breaking space characters, such as NBSP, NNBSP, etc.
U+200B Zero Width Space is a character used to indicate word boundaries in text where spaces do not otherwise appear. For example, it might be used in a Thai language document to assist with word-breaking.
The U+00AD Soft Hyphen can be used in text to indicate a potential or preferred hyphenation position. It only becomes visible when the text is reflowed to wrap at that position.
The U+2060 WORD JOINER , sometimes called WJ , is a zero-width non-breaking space character. Its purpose is to prevent line breaks between two characters. Except for purposes of line-breaking, it should be ignored. It serves as a replacement for the character U+FEFF ZERO WIDTH NO-BREAK SPACE because U+FEFF is more commonly known as the "Byte Order Mark" (BOM). A byte order mark is used at the start of some plain text files to signal that the file is in a Unicode character encoding.
Finally,
most
scripts,
when
written
horizontally,
proceed
from
left-to-right.
However,
some
scripts,
such
as
Arabic
and
Hebrew,
are
written
predominantly
from
right-to-left.
Texts
can
be
written
in
a
mix
of
these
scripts
or
include
character
sequences,
such
as
numbers
or
quotes
in
another
script,
that
run
in
the
opposite
direction
to
other
parts
of
the
text.
This
intermixing
of
text
direction
is
called
bidirectional
text
or
bidi
for
short.
The
Unicode
Bidirectional
Algorithm
[
UAX9
]
describes
how
such
mixed-direction
text
is
processed
for
display.
For
most
text,
the
directional
handling
can
be
derived
from
the
text
itself.
However,
there
are
many
cases
in
which
the
algorithm
needs
additional
information
in
order
to
present
text
correctly.
For
more
examples,
see
[
html-bidi
].
One of the ways that Unicode defines to address the ambiguity of text direction are a set of invisible control characters to mark the start and end of directional runs. While bidirectional controls can have an affect on the appearance of the text (since they help the Unicode Bidirectional Algorithm with the presentation of text), they might have no effect on the text if the text would naturally have fallen into bidirectional runs without the controls. Because these controls are, like the characters mentioned above, invisible, they can have an unintentional effect on matching.
In almost all of these cases, users may not be aware of or cannot be sure if a given document or text string has included or omitted one of these characters. Because text matching depends on matching the underlying codepoints, variation in the encoding of the text due to these markers can cause matches that ought to succeed to mysteriously fail (from the point of view of the user).
A newer feature of Unicode are the emoji characters. In [ UTR51 ], Unicode describes these as:
Emoji are pictographs (pictorial symbols) that are typically presented in a colorful cartoon form and used inline in text. They represent things such as faces, weather, vehicles and buildings, food and drink, animals and plants, or icons that represent emotions, feelings, or activities.
Emoji can be used with a variety of emoji modifiers, including U+200D ZERO WIDTH JOINER or ZWJ , to form more complex emoji.
For example, the emoji ( 👪 [ U+1F46A FAMILY ] ) can also be formed by using ZWJ between emoji characters in the sequence U+1F468 U+200D U+1F469 U+200D U+1F466 . Altering or adding other emoji characters can alter the composition of the family. For example the sequence 👨👩👧👧 U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F467 results in a composed emoji character for a "family: man, woman, girl, girl" on systems that support this kind of composition. Many common emoji can only be formed using ZWJ sequences. For more information, see [ UTR51 ].
Emoji characters can be followed by emoji modifier characters. These modifiers allow for the selection of skin tones for emoji that represent people. These characters are normally invisible modifiers that follow the base emoji that they modify. For example: 👨 👨🏻 👨🏼 👨🏽 👨🏾 👨🏿
An emoji character can also be followed by a variation selector to indicate text (black and white, indicated by U+FE0E Variation Selector 15 ) or color (indicated by U+FE0F Variation Selector 16 ) presentation of the base emoji.
Still another wrinkle in the use of emoji are flags. National flags can be composed using country codes derived from the [ BCP47 ] registry, such as the sequence 🇿 [ U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z ] 🇲 [ U+1F1F2 REGIONAL INDICATOR SYMBOL LETTER M ] , which is the country code ( ZM ) for the country Zambia: 🇿🇲. Other regional or special purpose flags can be composed using a flag emoji with various symbols or with regional indicator codes terminating in a cancel tag. For example, the flag of Scotland (🏴) can be composed like this:
Each of these mechanisms can be used together, so quite complex sequences of characters can be used to form a single emoji grapheme or image. Even very similar emoji sequences might not use the same exact encoded sequence. In most cases the modifiers and combinations mentioned above are generated by the end-user's keyboard (where they are presented as a single emoji "character"). This diversity of encoding options is partially addressed to the extent that different vendors use only (and exactly) the sequences that are "recommended for interchange" by Unicode. This helps vendors ensure that fonts and keyboards are prepared to give users the options they expect. Still, users generally won't be aware of the underlying encoding complexity and generative mechanisms are not limited to those that are recommended. Emoji sequences are evolving rapidly, so there could be additional developments to either help or hinder matching of emoji in the near future. Unicode normalization does not reorder these sequences or insert or remove any of the modifiers. Users and implementers are therefore cautioned that users who employ emoji characters in namespaces and other matching contexts can easily encounter unexpected "character" mismatches due to variations in encoding.
Resources can use different character encoding schemes, including legacy character encodings , to serialize document formats on the Web. Each character encoding scheme uses different byte values and sequences to represent a given subset of the Universal Character Set.
Choosing a Unicode character encoding, such as UTF-8, for all documents, formats, and protocols is a strongly encouraged recommendation , since there is no additional utility to be gained from using a legacy character encoding and the considerations in the rest of this section would be completely avoided.
For
example,
€
[
U+20AC
EURO
SIGN
]
is
encoded
as
the
byte
sequence
0xE2.82.AC
in
the
UTF-8
character
encoding.
This
same
character
is
encoded
as
the
byte
sequence
0x80
in
the
legacy
character
encoding
windows-1252
.
(Other
legacy
character
encodings
may
not
provide
any
byte
sequence
to
encode
the
character.)
Specifications mainly address these resulting variations by considering each document to be a sequence of Unicode characters after converting from the document's character encoding (be it a legacy character encoding or a Unicode encoding such as UTF-8) and then unescaping any character escapes before proceeding to process the document.
Even
within
a
single
legacy
character
encoding
there
can
be
variations
in
implementation.
One
famous
example
is
the
legacy
Japanese
encoding
Shift_JIS
.
Different
transcoder
implementations
faced
choices
about
how
to
map
specific
byte
sequences
to
Unicode.
So
the
byte
sequence
0x80.60
(
0x2141
in
the
JIS
X
0208
character
set)
was
mapped
by
some
implementations
to
U+301C
WAVE
DASH
while
others
chose
U+FF5E
FULL
WIDTH
TILDE
.
This
means
that
two
reasonable,
self-consistent,
transcoders
could
produce
different
Unicode
character
sequences
from
the
same
input.
The
Encoding
[
Encoding
]
specification
exists,
in
part,
to
ensure
that
Web
implementations
use
interoperable
and
identical
mappings.
However,
there
is
no
guarantee
that
transcoders
consistent
with
the
Encoding
specification
will
be
applied
to
documents
found
on
the
Web
or
used
to
process
data
appearing
in
a
particular
document
format
or
protocol.
One additional consideration in converting to Unicode is the existence of legacy character encodings of bidirectional scripts (such as Hebrew and Arabic) that use a visual storage order. That is, unlike Unicode and other modern encodings, the characters are stored in memory in the order that they are printed on the screen from left-to-right (as with a line printer). When converting these encodings to Unicode or when comparing text in these encodings, care must be taken to place both the source and target text into logical order. For more information, see Section 3.3.1 of [ CHARMOD ]
There are additional kinds of equivalence or processing that are appropriate when performing natural language searching or "find" features. These are described in another part of the Character Model series of documents ([ STRING-SEARCH ]). Specifications for a vocabulary or which define a matching algorithm for use in a formal syntax SHOULD avoid trying to apply additional custom folding, mapping, or processing such as described in that document, since these interfere with producing consistent, predictable results.
In the Web environment, where strings can use different character encodings, use different character sequences within those encodings, and have other variations (such as case) described in this document, it's important to establish a consistent process for evaluating string identity.
This chapter defines the requirements for specifying and implementing string matching in syntactic content .
One of the ways in which string matching can be made more effective and consistent is by applying restrictions to the content that is to be matched. The definition of a vocabulary , especially one that permits user-supplied values within that vocabulary, necessarily includes the rules for what makes a "valid identifier". This usually includes length and content restrictions. Some best practices for defining these restrictions include the following:
[S] Specifications SHOULD NOT allow surrogate code points ( U+D800 to U+DFFF ) or non-character code points in identifiers.
[S]
Specifications
SHOULD
NOT
allow
the
C0
(
U+0000
to
U+001F
)
and
C1
(
U+0080
to
U+009F
)
control
characters
in
identifiers.
There are two broad classes of identifier: user-facing identifiers and application internal identifiers .
Application internal identifiers are the part of a document format or protocol's vocabulary that are machine readable and not intended for display. These are often given meaningful names (generally in English) as an affordance for developers or content authors who have to work with or debug the contents of a document format or protocol.
[S] Specifications that define application internal identifiers (which are never shown to users and are always used for matching or processing within an application or protocol) SHOULD limit the content to a printable subset of ASCII. ASCII case-insensitive matching is RECOMMENDED .
[S][I] Application internal identifier fields or values MUST be wrapped with a localizable display value when displayed to end-users.
User-facing identifiers are the part of a document format or protocol's vocabulary that are assigned or edited by users or presented to the user for selection. Examples of user-facing identifiers include network names (such as SSIDs); device names; class, style, or attribute names; or user-defined settings or values. Identifiers of this sort are more complex to match due to the issues described in this document, but provide the best experience, particularly for users who do not speak English or who are less familiar with the Latin script.
Many user-facing identifiers are also user-supplied values and can be assigned by users of the document format or protocol. The ability to use the natural language preferred by the user or the user's community or culture provides a superior user experience and makes features more accessible to audiences that may have limited language skills, particularly in English.
[S] When identifiers are visible or potentially visible to users, specifications SHOULD allow the use of non-ASCII Unicode characters, in order to ensure that users in all languages can use the resulting document format or protocol with equal access. Case sensitivity (i.e. no case folding) is RECOMMENDED .
While a wide range of Unicode characters ought to be permitted, specifications can still impose certain practical limits on the content of user-facing identifiers . One example of a specification that defines content rules of this type can be found in Unicode Identifier and Pattern Syntax [ UAX31 ].
The basic decision for choosing a matching algorithm to use for a given specification is the level of text normalization (which includes both case sensitivity and Unicode normalization) to apply to the strings being matched. Historically on the Web most specifications have opted for case-sensitive matching without Unicode normalization and this is the RECOMMENDED form of matching for all new specifications. However, there are cases where case-insensitivity and normalization are useful.
A specification can choose to be case-insensitive if the benefit to users of the format or protocol in question outweighs the cost and complexity of the implementation. Because both case folding and normalization can affect the values being compared, including the presentation and, in some cases, the meaning of the text and because these operations are relatively expensive, choosing case-insensitivity is generally discouraged.
A
special
case
of
case-insensitivity
are
vocabularies
limited
to
the
ASCII/Basic
Latin
range
(that
is,
code
points
U+0000
to
U+007F
).
These
specifications
can
choose
to
be
case-insensitive
only
over
that
range
of
characters.
This
greatly
simplifies
the
implementation
of
matching.
However,
this
form
of
matching
is
not
appropriate
for
specifications
that
allow
a
larger
range
of
Unicode
in
identifiers
or
syntax,
since
it
is
difficult
for
users
to
understand
the
behavior
of
the
matching
and
it
is
a
disadvantage
to
users
of
non-ASCII
scripts
and
languages.
That
is,
users
find
it
to
be
weird
and
unpredictable
when
green
matches
GREEN
but
grüß
doesn't
match
GRÜẞ
or
possibly
GRÜSS
(but
instead
matches
GRüß
).
The matching algorithm describes the series of steps needed to compare two strings.
The right text normalization for a given specification depends on requirements in the format or protocol's vocabulary. There are four choices for text normalization:
The default normalization step doesn't apply Unicode normalization nor does it perform case folding. This means that it is sensitive to differences in case and to the code points in the source strings being compared. Content authors need to be aware of and ensure that they use consistent case and consistent character sequences to encode affected text if they expect different tokens to match.
The ASCII Case Fold normalization step performs case folding only of the ASCII range.
For each string, perform the following steps:
Specifications
that
have
vocabularies
that
allow
non-ASCII
characters
(which
should
include
most
new
vocabularies)
and
which
do
not
want
to
be
sensitive
to
case
distinctions
SHOULD
specify
this
step.
This
Case
insensitivity
is
not
recommended.
recommended
for
most
specifications.
Case
Unicode
case
folding
is
affected
by
the
input
code
point
sequence.
It
can
also,
by
itself,
produce
a
denormalized
code
point
sequence.
As
a
result,
this
normalization
step
includes
Unicode
normalization
both
before
and
after
case
folding.
This
step
is
character
sequences,
so,
in
order
matching
to
be
consistent
with
user
expectations,
any
Unicode
case
fold
needs
to
be
followed
by
Unicode
normalization.
See
§
2.4
Interaction
of
Normalization
and
Case
Folding
for
examples.
[ Unicode ] requirement D145 requires a canonical decomposition (form NFD) normalization before the case fold operation to address the corner case described in § 2.4.1 The Optional Initial Normalization Step . Inclusion of the pre-case fold normalization is optional because of the rarity of denormalized data affected by this. This is a WILLFUL VIOLATION of D145 .
For each string, perform the following steps:
Specifications that have vocabularies that allow non-ASCII characters and which need to match Unicode compatibility equivalents might use this normalization step. Because the compatibility normalization forms ( NFKC and NFKD ) change the meaning, appearance, and processing of the text, this step SHOULD NOT be used for most applications on the Web.
Case
folding
is
affected
by
the
input
code
point
sequence.
It
can
also
produce
a
denormalized
code
point
sequence.
The
interaction
of
compatibility
decomposition
with
case
folding
requires
multiple
passes
to
produce
a
consistent
match.
As
a
result,
this
normalization
step
includes
multiple
uses
of
Unicode
normalization,
including
both
the
NFKD
form
(which
supplies
the
compatibility
mapping)
normalization.
See
§
2.4
Interaction
of
Normalization
and
the
NFD
form.
This
step
is
consistent
with
Case
Folding
for
examples.
[ Unicode ] requirement D146 requires a canonical decomposition (form NFD) normalization before the initial case fold operation to address the corner case described in § 2.4.1 The Optional Initial Normalization Step . Inclusion of the pre-case fold normalization is optional because of the rarity of denormalized data affected by this. This is a WILLFUL VIOLATION of D146 .
For each string, perform the following steps:
[C] Content authors SHOULD enter and store resources in a Unicode character encoding (generally UTF-8 on the Web).
The first step in comparing text is to ensure that both use the same digital representation. This means that implementations need to convert any text in a legacy character encoding to a sequence of Unicode code points. Normally this is done by applying a transcoder to convert the data to a consistent Unicode encoding form (such as UTF-8 or UTF-16). This allows bitwise comparison of the strings in order to determine string equality.
[C] Content authors SHOULD choose a normalizing transcoder when converting legacy encoded text or resources to Unicode unless the mapping of specific characters interferes with the meaning.
A normalizing transcoder is a transcoder that performs a conversion from a legacy character encoding to Unicode and ensures that the result is in Unicode Normalization Form C ( NFC ). For most legacy character encodings, it is possible to construct a normalizing transcoder (by using any transcoder followed by a normalizer); it is not possible to do so if the legacy character encoding 's repertoire contains characters not represented in Unicode. While normalizing transcoders only produce character sequences that are in NFC , the converted character sequence might not be include normalized (for example, if it begins with a combining mark).
Because document formats on the Web often interact with or are processed using additional, external resources (for example, a CSS style sheet being applied to an HTML document), the consistent representation of text becomes important when matching values between documents that use different character encodings. Use of a normalizing transcoder helps ensure interoperability by making legacy encoded documents match the normally expected Unicode character sequence for most languages.
Most transcoders used on the Web produce NFC as their output, but several do not. This is usually to allow the transcoder to be round-trip compatible with the source legacy character encoding, to preserve other character distinctions, or to be consistent with other transcoders in use in user-agents. This means that the Encoding specification [ Encoding ] and various other important transcoding implementations include a number of non-normalizing transcoders. Indeed, most compatibility characters in Unicode exist solely for round-trip conversion from legacy encodings and a number of these have singleton canonical mappings in NFC . You saw an example of this earlier in the document with Å [ U+212B ANGSTROM SIGN ] .
Bear in mind that most transcoders produce NFC output and that even those transcoders that do not produce NFC for all characters produce NFC for the preponderance of characters. In particular, there are no commonly-used transcoders that produce decomposed forms where precomposed forms exist or which produce a different combining character sequence from the normalized sequence (and this is true for all of the transcoders in [ Encoding ]).
[S] Specifications MUST allow a Unicode character encoding.
[S] Specifications MUST specify a default character encoding and SHOULD specify UTF-8 as the default encoding.
[S] Specifications SHOULD disallow encodings other than UTF-8.
Legacy character encodings have generally outlived their usefulness on the Web. New specifications need to support Unicode encodings from the beginning, default to a Unicode encoding (generally UTF-8), and, if at all possible, disallow any other encodings to be used. This not only promotes interoperability, but reduces the range of pointless variation in character and data representation.
Most document formats and protocols provide a means for encoding characters as an escape sequence or including external data, including text, into a resource . This is discussed in detail in Section 4.6 of [ CHARMOD ] as well as above .
When performing matching, it is important to know when to interpret character escapes so that a match succeeds (or fails) appropriately. Normally, escapes, references, and includes are processed or expanded before performing matching (or match-sensitive processing), since these syntaxes exist to allow difficult-to-encode sequences to be put into a document conveniently, yet allowing the characters to behave as-if they were directly encoded as a codepoint sequence in the document in question.
One
area
where
this
can
be
complicated
is
deciding
how
syntactic
content
and
natural
language
localizable
content
interact.
For
example,
consider
the
following
snippet
of
HTML:
Although technically the combining mark ̀ [ U+0300 COMBINING GRAVE ACCENT ] combines with the preceding quote mark, HTML does not consider the character (whether or not it is encoded as an entity) to form part of the HTML syntax.
When
performing
a
matching
operation
on
a
resource,
the
general
rule
is
to
expand
escapes
on
the
same
"level"
as
the
user
is
interacting
with.
For
example,
when
considering
the
above
example,
a
tool
to
view
the
source
of
the
HTML
would
show
the
escape
sequence
̀
as
a
string
of
characters
starting
with
an
ampersand.
A
JavaScript
program,
by
contrast,
operates
on
the
browser's
interpretation
of
the
document
and
would
match
the
character
U+0300
as
the
value
of
the
attribute
id
.
When processing the syntax of a document format, escapes are usually converted to the character sequence they represent before the processing of the syntax, except where explicitly forbidden by the format's processing rules. This allows resources to include characters of all types into the resource's syntactic structures.
In
some
cases,
pre-processing
escapes
creates
problems.
For
example,
expanding
the
sequence
<
before
parsing
an
HTML
document
would
produce
document
errors.
A specific Unicode normalization form is not always appropriate or available to content authors and the text encoding choices of users might not be obvious to downstream consumers of the data. As shown in this document, there are many different ways that content authors or applications could choose to represent the same semantic values when inputting or exchanging text. Normalization can remove distinctions that the users applied intentionally. Therefore, the matching algorithm specifies the use of Unicode normalization when performing case-fold matching of strings—and then only internally to the algorithm. Imposing normalization on content can represent a barrier to users and implementers. Thus:
[S] Specifications SHOULD NOT specify a Unicode normalization form for encoding, storage, or interchange of a given vocabulary.
[I]
Implementations
MUST
NOT
alter
the
normalization
form
of
syntactic
content
(including
user-supplied
values)
or
natural
language
localizable
content
being
exchanged,
read,
parsed,
or
processed
except
when
required
to
do
so
as
a
side-effect
of
text
transformation
such
as
transcoding
the
content
to
a
Unicode
character
encoding,
case
folding,
or
other
user-initiated
change,
as
consumers
or
the
content
itself
might
depend
on
the
de-normalized
representation.
[I] Authoring tools SHOULD provide a means of normalizing resources and warn the user when a given resource is not in Unicode Normalization Form C.
A specification that requires storage and interchange of text in a specific normalization form needs to address the requirements in § 3.2.5.1 Requirements When Specifying Normalization in Document Formats .
Specifications are generally discouraged from requiring formats or protocols to store or exchange data in a normalized form unless there are specific, clear reasons why the additional requirement is necessary. As many document formats on the Web do not require normalization, content authors might occasionally rely on denormalized character sequences. A normalization step could negatively affect such content.
The canonical normalization forms (form NFC or form NFD) are intended to preserve the meaning and presentation of the text to which they are applied. This is not always the case, which is one reason why normalization is not recommended. NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding), as well as data created by current software or entered by users on most (but not all) keyboards, is already in this form. NFC also has a slight compactness advantage and is a better match to user expectations in most languages with respect to the relationship between characters and graphemes.
[S] Specifications SHOULD NOT specify compatibility normalization forms (NFKC, NFKD).
[I] Implementations MUST NOT apply compatibility normalization forms (NFKC, NFKD) unless specifically requested by the end user.
The
compatibility
normalization
forms
(form
NFKC
and
form
NFKD)
change
the
structure
and
lose
the
meaning
of
the
text
in
important
ways.
Users
sometimes
use
characters
with
a
compatibility
mapping
in
Unicode
on
purpose
or
they
use
characters
in
a
legacy
character
encoding
that
have
a
compatibility
mapping
when
converted
to
Unicode.
This
has
to
be
considered
intentional
on
the
part
of
the
content
author.
Although
NFKC/NFKD
can
sometimes
be
useful
in
"find"
operations
or
string
searching
natural
language
localizable
content,
erasing
compatibility
differences
is
harmful.
Requiring NFC requires additional care on the part of the specification developer, as content on the Web generally is not in a known normalization state. Boundary and error conditions for denormalized content need to be carefully considered and well-specified in these cases.
[S] Specifications MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue.
[C] Content authors SHOULD use Unicode Normalization Form C ( NFC ) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages.
[C] Content authors SHOULD always encode text using consistent Unicode character sequences to facilitate matching, even if a Unicode normalization form is included in the matching performed by the format or implementation.
In order for their content to be processed consistently, content authors should try to use a consistent sequence of code points to represent the same text. While content can be in any normalization form or might use a de-normalized (but valid) Unicode character sequence, inconsistency of representation will cause implementations to treat the different sequences as different. The best way to ensure consistent selection, access, extraction, processing, or display is to always use NFC .
[C] Content authors SHOULD NOT include combining marks without a preceding base character in a resource.
There
can
be
exceptions
to
this.
For
example,
when
making
a
list
of
characters
(such
as
a
list
of
[
Unicode
]
characters),
an
author
might
want
to
use
combining
marks
without
a
corresponding
base
character.
However,
use
of
a
combining
mark
without
a
base
character
can
cause
unintentional
display
or,
with
naive
implementations
that
combine
or
processing
problems,
such
as
when
a
naïve
implementation
combines
the
combining
mark
with
adjacent
syntactic
content
syntactic,
user-supplied,
or
other
natural
language
content,
processing
problems.
localizable
content.
For
example,
if
you
were
to
use
a
combining
mark,
such
as
the
character
́
[
U+0301
COMBINING
ACUTE
ACCENT
]
,
as
the
start
of
a
class
attribute
value
in
HTML,
the
class
name
might
not
display
properly
in
your
editor
and
be
difficult
to
edit.
Some recommended base characters include ◌ [ U+25CC DOTTED CIRCLE ] (when the base character needs to be visible) or [ U+00A0 NO-BREAK SPACE ] (when the base character should be invisible).
Since content authors do not always follow these guidelines:
[S] Specifications of vocabularies MUST define the boundaries between syntactic content and character data as well as entity boundaries (if the language has any include mechanism). These need to include any boundary that may create conflicts when processing or matching content when instances of the language are processed, while allowing for character escapes designed to express arbitrary characters.
When a specification requires Unicode normalization for storage, transmission, or processing, some additional considerations need to be addressed by the specification authors as well as by implementers of that specification:
[S] Where operations can produce denormalized output from normalized text input, specifications MUST define whether the resulting output is required to be normalized or not. Specifications MAY state that performing normalization is optional for some operations; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off.
[S] Specifications that require normalization MUST NOT make the implementation of normalization optional. Interoperability cannot be achieved if some implementations normalize while others do not.
An implementation that is required to perform normalization needs to consider these requirements:
[I] Normalization-sensitive operations MUST NOT be performed unless the implementation has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed.
[I] A normalizing text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text.
[I] Authoring tool implementations SHOULD warn users or prevent the input or creation of syntactic content starting with a combining mark that could interfere with processing, display, or interchange.
One important consideration in string identity matching is whether the comparison is case sensitive or case insensitive.
[C] Content authors SHOULD always spell identifiers using consistent upper, lower, and mixed case formatting to facilitate matching, even if case folded matching is supported by the format or implementation.
[S] Case-sensitive matching is RECOMMENDED for matching syntactic content, including user-defined values.
Vocabularies usually put a premium on predictability for content authors and users. Case-sensitive matching is the easiest to implement and introduces the least potential for confusion, since it generally consists of a comparison of the underlying Unicode code point sequence. Because it is not affected by considerations such as language-specific case mappings, it produces the least surprise for document authors that have included words, such as the Turkish examples above, in their syntactic content.
Case
insensitivity
is
usually
reserved
for
processing
natural
language
localizable
content
,
such
as
running
performing
a
feature
for
searching
text.
natural
language
textual
search.
However,
cases
exist
in
which
sometimes
case-insensitivity
is
desirable.
When
case-insensitive
matching
is
necessary,
In
this
cases
there
are
several
implementation
choices
that
a
formal
language
needs
to
consider.
[S] Specifications that define case-insensitive matching in vocabularies that include more than the Basic Latin (ASCII) range of Unicode MUST specify Unicode full casefold matching.
[S] Specifications SHOULD allow the full range of Unicode for user-defined values.
Vocabularies generally should allow for a wide range of Unicode characters, particularly for user-supplied values , so as to enable use by the broadest range of languages and cultures without disadvantage. As a result, text operations such as case folding need to address the full range of Unicode and not just selected portions. When case-insensitive matching is desired, this means using Unicode case folding :
The Unicode simple casefolding form is not appropriate for string identity matching on the Web.
[S] Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching.
A formal language whose vocabulary is limited to ASCII and which does not allow user-defined names or identifiers can specify ASCII case-insensitive matching. An example of this is HTML, which defines the use of ASCII case-insensitive comparison for element and attribute names defined by the HTML specification.
A vocabulary is considered to be "ASCII-only" if and only if all tokens and identifiers are defined by the specification directly and these identifiers or tokens use only the Basic Latin subset of Unicode. If user-defined identifiers are permitted, the full range of Unicode characters (limited, as appropriate, for security or interchange concerns, see [ UTR36 ]) should be allowed and Unicode case insensitivity used for identity matching.
An ASCII-only vocabulary can exist inside a document format or protocol that allows a larger range of Unicode in identifiers or values. For example [ CSS-SYNTAX-3 ] defines the format of CSS style sheets in a way that allows the full range of Unicode to be used for identifiers and values. However, CSS specifications always define CSS keywords using a subset of the ASCII range. The vocabulary of CSS is thus ASCII-only, even though many style sheets contain identifiers or data values that are not ASCII.
Locale- or language-specific tailoring is most appropriate when it is part of natural language processing operations (which is beyond the scope of this document). Because language-specific tailoring of case mapping or case folding produces different results from the generic case folding rules, these should be avoided in formal languages, where predictability is at a premium.
[S] Specifications that define case-insensitive matching in vocabularies SHOULD NOT specify language-sensitive case-insensitive matching.
[S] If language-sensitive case-sensitive matching is specified, Unicode case mappings SHOULD be tailored according to language and the source of the language used for each tailoring MUST be specified.
Two strings being matched can be in different languages and might appear in yet a third language context. Which language to use for case folding therefore depends on the application and user expectations.
Language specific tailoring is not recommended for formal languages because the language information can be hard to obtain, verify, or manage and because the resulting operations can produce results that frustrate users or which fail for some users and succeed for others depending on the language configuration that they are using or the configuration of the system where the match is performed.
[S] Operations that are language-specific SHOULD include language-specific case folding where appropriate.
For
example,
the
CSS
operation
text-transform
is
language-sensitive
when
used
to
case
map
strings.
Although Unicode case folding is the preferred case-insensitive matching for document formats and protocols, content authors and users of languages that have mappings different from the default can still be surprised by the results, since their expectations are generally consistent with the languages that they speak.
Language-sensitive
string
comparison
is
often
referred
to
as
being
locale-sensitive
,
since
most
programming
languages
and
operating
environments
access
language-specific
tailoring
using
their
respective
locale-based
APIs.
For
example,
see
the
java.text.Collator
class
in
the
Java
programming
language
or
Intl.Collator
in
JavaScript.
[S] Specifications MUST clearly define any additional tailoring done as part of the matching process.
Some specifications might wish to include additional tailoring to assist with matching in a given vocabulary. Examples of this might include removing additional textual differences described in Section 2 , mapping together or removing characters that are part of the syntax, or performing a whitespace trim.
Any
additional
tailoring
needs
to
avoid
interfering
with
the
way
that
different
languages
are
represented
in
Unicode.
For
example,
a
process
that
attempts
to
remove
accents
from
letters
by
decomposing
the
text
and
then
removing
all
of
the
combining
characters
will
break
languages
that
rely
on
combining
marks.
An
example
of
this
would
be
as
the
Devanagari
text
in
Example
2
.
(Such
a
process
would
also
fail
to
remove
all
of
the
potential
accents
and
probably
do
harm
to
the
meaning
and
representation
of
the
text.)
While matching strings and tokens in a formal language is the primary concern of this document, sometimes a specification needs to consider additional types of matching beyond pure string equality.
[S] Specifications that define a regular expression syntax MUST provide at least Basic Unicode Level 1 support per [ UTS18 ] and SHOULD provide Extended or Tailored (Levels 2 and 3) support.
Regular expression syntaxes are sometimes useful in defining a format or protocol, since they allow users to specify values that are only partially known or which can vary in predictable ways. As seen in the various sections of this document, there is variation in the different ways that characters can be encoded in Unicode and this potentially interferes with how strings are specified or matched in expressions. For example, counting characters might need to depend on grapheme boundaries rather than the number of Unicode code points used; caseless matching might need to consider variations in case folding; or the Unicode normalization of the expression or text being processed might need to be considered.
Unicode Regular Expressions Level 1 support includes the ability to specify Unicode code points in regular expressions, including via the use of escapes, and to access Unicode character properties as well as certain kinds of boundaries common to most regular expression syntaxes.
Level
2
extends
this
with
a
number
of
important
capabilities,
notably
the
ability
to
select
text
on
certain
kinds
of
grapheme
cluster
boundary
and
support
for
case
conversion
(two
topics
mentioned
extensively
above).
Level
3
provides
for
locale
[
LTLI
]
based
tailoring
of
regular
expressions,
which
are
less
useful
in
formal
languages
but
can
be
useful
in
processing
natural
language
localizable
content
.
Changes to this document (beginning with the Working Draft of 2018-04-20) are available via the github commit log .
The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions. The Working Group would like to thank: Mati Allouche, Ebrahim Byagowi, John Cowan, Martin Dürst, Behdad Esfahbod, Asmus Freitag, Richard Ishida, John Klensin, Peter Saint-Andre, Amir Sarabadani, Najib Tounsi, Richard Wordingham, and all of the CharMod contributors over the twenty (!!) years of this document's development.
The previous version of this document was edited by:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in: