Copyright © 2014-2020 W3C ® ( MIT , ERCIM , Keio , Beihang ). W3C liability , trademark and permissive document license rules apply.
This document provides a checklist of internationalization-related considerations when developing a specification. Most checklist items point to detailed supporting information in other documents. Where such information does not yet exist, it can be given a temporary home in this document. The dynamic page Internationalization Techniques: Developing specifications is automatically generated from this document. The current version is still an early draft, and it is expected that the information will change regularly as new content is added and existing content is modified in the light of experience and discussion.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This document provides advice to specification developers about how to incorporate requirements for international use. What is currently available here is expected to be useful immediately, but is still an early draft and the document is in flux, and will grow over time as knowledge applied in reviews and discussions can be crystallized into guidelines.
Sending comments on this document
If you wish to make comments regarding this document, please raise them as github issues . Only send comments by email if you are unable to raise issues on github (see links below).
To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL for the dated version of the document. All comments are welcome.
This document was published by the Internationalization Working Group as an Editor's Draft.
GitHub Issues are preferred for discussion of this specification. Alternatively, you can send comments to our mailing list. Please send them to www-international@w3.org ( archives ).
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the W3C Patent Policy . The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 1 March 2019 W3C Process Document .
Developers of specifications need advice to ensure that what they produce will work for communities around the globe.
The Internationalization (i18n) WG tries to assist working groups by reviewing specifications and engaging in discussion. Often, however, such interventions come later in the process than would be ideal, or mean that the i18n WG has to repeat the same information for each working group it interacts with.
It would be better if specification developers could access a checklist of best practices, which points to explanations, examples and rationales where developers need it. Developers would then be able to build this knowledge into their work from the earliest stages, and could thereby reduce rework needed when the i18n WG reviews their specification.
This document contains the beginnings of a checklist, and points to locations where you can find explanations, examples and rationales for recommendations made. If there is no such other place, that extra information will be added to this document. It is still early days for this document, and it may also be used to develop ideas and organize them.
The guidelines in this document are not intended to be hard and fast requirements. This document will achieve a significant part of its purpose if, where you don't understand the guidelines or disagree with them, you contact the Internationalization WG to discuss what should be done.
You may prefer to use Internationalization Techniques: Developing specifications most of the time, since it uses JavaScript to help you more quickly see what's available and drill down to the information you need. (Where needed, it links to this or other documents.) There is also a non-dynamic version of the document available.
If your spec is github based, you can now create a snapshot of the checklist items in markdown. If you add that to a github issue, you can check off items that are ok, and add comments while doing a self-review. Generate the code here .
It should be possible to associate a language with any piece of natural language text that will be read by a user. more
Where possible, there should be a way to label natural language changes in inline text. more
Text is rendered or processed differently according to the language it is in. For example, screen readers need to be prompted when a language changes, and spell checkers should be language-sensitive. When rendering text a knowledge of language is need in order to apply correct fonts, hyphenation, line-breaking, upper/lower case changes, and other features.
For example, ideographic characters such as 雪, 刃, 直, 令, 垔 have slight but important differences when used with Japanese vs Chinese fonts, and it's important not to apply a Chinese font to the Japanese text, and vice versa when it is presented to a user.
Consider whether it is useful to express the intended linguistic audience of a resource, in addition to specifying the language used for text processing . more
Language information for a given resource can be used with two main objectives in mind: for text-processing, or as a statement of the intended use of the resource. We will explain the difference below.
A language declaration that indicates the text processing language for a range of text must associate a single language value with a specific range of text. more
When specifying the text-processing language you are declaring the language in which a specific range of text is actually written , so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, style processors, hyphenators, etc., can apply the appropriate rules to the text in question. So we are, by necessity, talking about associating a single language with a specific range of text.
It is normal to express a text-processing language as the default language, for processing the resource as a whole, but it may also be necessary to indicate where the language changes within the resource.
Use
the
HTML
lang
and
XML
xml:lang
language
attributes
where
appropriate
to
identify
the
text
processing
language
,
rather
than
creating
a
new
attribute
or
mechanism.
more
lang
attribute,
while
XML
provides
xml:lang
which
can
be
used
in
all
XML
formats.
It's
useful
to
continue
using
those
attributes
for
relevant
markup
formats,
since
authors
recognize
them,
as
do
HTML
and
XML
processors.
It may also be useful to describe the language of a resource as a whole . This type of language declaration typically indicates the intended use of the resource . For example, such metadata may be used for searching, serving the right language version, classification, etc.
This type of language declaration differs from that of the text-processing declaration in that (a) the value for such declarations may be more than one language subtag, and (b) the language value declared doesn't indicate which bits of a multilingual resource are in which language.
It should be possible to associate a metadata-type language declaration (which indicates the intended use of the resource rather than the language of a specific range of text) with multiple language values. more
The language(s) describing the intended use of a resource do not necessarily include every language used in a document. For example, many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.
On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another. In this case, it may make sense to list more than one language tag as the value of the language declaration.
Attributes
that
express
the
language
of
external
resources
should
not
use
the
HTML
lang
and
XML
xml:lang
language
attributes,
but
should
use
a
different
attribute
when
they
represent
metadata
(which
indicates
the
intended
use
of
the
resource
rather
than
the
language
of
a
specific
range
of
text).
more
Using a different attribute to indicate the language of an external resource allows the attribute to specify more than one language. It also works better if the resource pointed to is not in a single language.
This
distinction
can
be
seen
in
HTML
in
the
separation
of
the
lang
and
hreflang
attributes.
The
former
indicates
the
language
of
the
text
within
the
HTML
page;
the
latter
is
metadata
indicating
the
expected
language
of
a
page
that
is
linked
to.
For
a
longer
discussion
of
this
see
xml:lang
in
XML
document
schemas
.
This
article
talks
specifically
about
xml:lang
,
but
the
concepts
are
applicable
to
other
situations.
xml:lang in XML document schemas
When
should
I
use
xml:lang
and
when
should
I
define
my
own
element
or
attribute
for
passing
language
values
in
an
XML
document
schema
(DTD)?
In Internationalization Best Practices for Spec Developers .
Why use the language attribute?
Describes
why
it
is
useful
to
use
the
lang
or
xml:lang
attribute
to
label
language
in
web
pages.
Use cases for language information in web annotations
Description of use cases for annotations that illustrate the differences between text-processing and metadata types of language declaration.
HTTP headers, meta elements and language information
How the distinction between text-processing language and language metadata plays out in HTML5.
Values for language declarations must use BCP 47. more
BCP 47 defines a method to combine subtags in order to create a much more powerful notation for language tags than that provided by the old ISO lists, but it is also backwards compatible with the ISO lists.
For an overview of the key features of BCP 47, see Language tags in HTML and XML .
Refer to BCP 47, not to RFC 5646. more
The link to and name of BCP 47 was created specifically so that there is an unchanging reference to the definition of Tags for the Identification of Languages. RFCs 1766, 3066, 4646 were previous (superseded) versions and 5646 is the current version of BCP 47.
Be specific about what level of conformance you expect for language tags: BCP 47 defines two levels of conformance, "valid" and "well-formed".
A well-formed BCP 47 language tag follows the syntax defined for a language tag: implementations check that each language tag consists of hyphen-separated subtags; each subtag has a specific length and specific content (letters, digits or specific combinations) depending on the placement in the tag. A valid BCP 47 language tag is well-formed but additionally ensures that only subtags that are listed in the IANA Subtag Registry are used. Note that the IANA Subtag Registry is frequently updated with new subtags.
Specifications may require implementations to check if language tags are "valid", but in most circumstances should only require that the language tags be "well-formed".
Most specifications are second-order consumers of language metadata – they are using data already provided in the document format (HTML lang , XML xml:lang , or the document format's language fields/attributes).
Generally most specifications are concerned with selecting resources (such as spell checkers, tokenizers, fonts, etc.) or with matching (selecting which string to show, for example) and don't directly care about the content of the language tag. Invalid-but-well-formed tags just don't match anything and usually fallback schemes provide some behavior that is appropriate.
There might be cases where a specification really wants implementation-level checking for validity. In those cases, the result of a tag failing to be valid has to be specified (should the application die, warn the user, etc.). It's also a problem that the registry is sizeable and changes over time, so each implementation is registry-version dependent. The changes over time are often minor, but real users will encounter interoperability issues if random (out of date) implementations of the specification reject language tags that have become valid at a later date.
In addition, BCP 47 has an extension mechanism which defines add-on subtag sequences. For example, one extension [ RFC6067 ] (Unicode Locales, which uses the singleton -u ), is commonly used for controlling the internationalization features of JavaScript (and has other uses). Validating these additional subtags is likely out of scope for most specifications.
Specifications should require content and content authors to use "valid" language tags.
Normative language regarding language tags might be different between content and implementation requirements. Specification authors need to carefully consider what conformance requirements and tests are needed for their specification and what implementations are required to do. One solution is to normatively require that "valid" language tags be used by content authors but only require implementations to check for "well-formed" language tags.
Reference BCP47 for language tag matching.
BCP 47 contains one RFC dedicated to the syntax and subtags of language tags, and another dedicated to how to match two or more subtags. (This topic needs more detail, and may merit being a separate section.)
In Internationalization Best Practices for Spec Developers .
The IETF specification that indicates how to create language subtags and how to match them .
An overview of how to create language tags using BCP 47.
Here we are talking about an independent unit of data that contains structured text. Examples may include a whole HTML page, an XML document, a JSON file, a WebVTT script, an annotation, etc.
The specification should indicate how to define the default text-processing language for the resource as a whole. more
It
often
saves
trouble
to
identify
the
language,
or
at
least
the
default
language,
of
the
resource
as
a
whole
in
one
place.
For
example,
in
an
HTML
file,
this
is
done
by
setting
the
lang
attribute
on
the
html
element.
Content within the resource should inherit the language of the text-processing declared at the resource level, unless it is specifically overridden.
Consider whether it is necessary to have separate declarations to indicate the text-processing language versus metadata about the expected use of the resource. more
In many cases a resource contains text in only one language, and in many more cases the language declared as the default language for text-processing is the same as the language that describes the metadata about the resource as a whole. In such cases it makes sense to have a single declaration.
It becomes problematic, however, to use a single declaration when it refers to more than one language unless there is a way to determine which one language should be used as the text-processing default.
If there is only one language declaration for a resource, and it has more than one language tag as a value, it must be possible to identify the default text-processing language for the resource. more
Use cases for language information in web annotations
Description of use cases for annotations that illustrate the differences between text-processing and metadata types of language declaration.
HTTP headers, meta elements and language information
How the distinction between text-processing language and language metadata plays out in HTML5.
The words block and/or chunk are used here to refer to a structural component within the resource as a whole that groups content together and separates it from adjacent content. Boundaries between one block and another are equivalent to paragraph or section boundaries in text, or discrete data items inside a file.
For example, this could refer to a block or paragraph in XML or HTML, an object declaration in JSON, a cue in WebVTT, a line in a CSV file, etc. Contrast this with inline content, which describes a range within a paragraph, sentence, etc.
The interpretation of which structures defined in a spec are relevant to these requirements may require a little consideration, and will depend on the format of the data involved.
By default, blocks of content should inherit any text-processing language set for the resource as a whole. more
See § 2.1 Language basics for guidance related to the default text-processing language information.
It should be possible to indicate a change in language for blocks of content where the language changes. more
Here we refer to information that needs to be provided for a range of characters in the middle of a paragraph or string.
It should be possible to indicate language for spans of inline text where the language changes. more
Where a switch in language can affect operations on the content, such as spell-checking, rendering, styling, voice production, translation, information retrieval, and so forth, it is necessary to indicate the range of text affected and identify the language of that content.
It is important to establish direction for text written or mixed with right-to-left scripts. Characters in these scripts are stored in memory in the order they are typed and pronounced – called the logical order. The Unicode Bidirectional Algorithm (UBA) provides a lot of support for automatically rendering a sequence of characters stored in logical order so that they are visually ordered as expected. Unfortunately, the UBA alone is not sufficient to correctly render bidirectional text, and additional information has to be provided about the default directional context to apply for a given sequence of characters.
The basic requirements are as follows.
It must be possible to indicate base direction for each individual paragraph-level item of natural language text that will be read by someone. more
It must be possible to indicate base direction changes for embedded runs of inline bidirectional text for all natural language text that will be read by someone. more
Annotating right-to-left text must require the minimum amount of effort for people who work natively with right-to-left scripts. more
Requiring a speaker of Arabic, Divehi, Hebrew, Persian, Urdu, etc. to add markup or control characters to every paragraph or small data item they write is far too much to be manageable. Typically, the format should establish a default direction and require the user to intervene only when exceptions have to be dealt with.
In this section we try to set out some key concepts associated with text direction, so that it will be easier to understand the recommendations that follow.
In order to correctly display text written in a 'right-to-left' script or left-to-right text containing bidirectional elements, it is important to establish the base direction that will be used to dictate the order in which elements of the text will be displayed.
If you are not familiar with what the Unicode Bidirectional Algorithm (UBA) does and doesn't do, and why the base direction is so important, read Unicode Bidirectional Algorithm basics .
In
this
section,
the
word
paragraph
indicates
a
run
of
text
followed
by
a
hard
line-break
in
plain
text,
but
may
signify
different
things
in
other
situations.
In
CSV
it
equates
to
'cell',
so
a
single
line
of
comma-separated
items
is
actually
a
set
of
comma-separated
paragraphs.
In
HTML
it
equates
to
the
lowest
level
of
block
element,
which
is
often
a
p
element,
but
may
be
things
such
as
div
,
li
,
etc.,
if
they
only
contain
text
and/or
inline
elements.
In
JSON,
it
often
equates
to
a
quoted
string
value,
but
if
a
string
value
uses
markup
then
paragraphs
are
associated
with
block
elements,
and
if
the
string
value
is
multiple
lines
of
plain
text
then
each
line
is
a
paragraph.
The term metadata is used here to mean information which could be an annotation or property associated with the data, or could be markup in scenarios that allow that, or could be a higher-level protocol, etc.
There are a number of possible ways of setting the base direction.
ltr
,
rtl
or
auto
.
dir=auto
on
an
HTML
element.)
dir
attributes
in
your
HTML
file.)
dir
attribute
on
the
html
tag
in
HTML.
Another
example
would
be
a
subtitling
file
containing
many
cues,
all
written
in
Arabic;
it
would
be
best
to
allow
the
author
to
say
at
the
start
of
the
file
that
the
default
is
RTL
for
all
cue
text.
There
should
always
be
a
way
to
override
the
direction
information
for
a
specific
paragraph
where
needed.
auto
,
since
HTML
specifies
a
default
direction.)
When
capturing
text
input
by
a
user
it
is
usually
necessary
to
understand
the
context
in
which
the
user
was
inputting
the
data
to
determine
the
base
direction
of
the
input.
In
HTML,
for
example,
this
may
be
set
by
the
direction
inherited
from
the
html
tag,
or
by
the
user
pressing
keys
to
set
the
base
direction
for
a
form
field.
It
is
then
necessary
to
find
some
way
of
storing
the
information
about
base
direction
or
associating
it
with
the
string
when
rendered.
Typically,
in
this
situation,
any
direction
changes
internal
to
the
string
being
input
are
handled
by
the
user
and
will
be
captured
as
part
of
the
string.
Embedded ranges of text within a single paragraph may need to have a different base direction. For example,
"The title was '!NOITASILANOITANRETNI'."
where the span within the single quotes is in Hebrew/Arabic/Divehi, etc., and needs to have a RTL base direction, instead of the LTR base direction of the surrounding paragraph, in order to place the exclamation mark correctly.
If
markup
is
available
to
the
content
author,
it
is
likely
to
be
easier
and
safer
to
use
markup
to
indicate
such
inline
ranges
(see
below).
In
HTML
you
would
usually
use
an
inline
element
with
a
dir
attribute
to
establish
the
base
direction
for
such
runs
of
text.
If
you
can't
mark
up
the
text,
such
as
in
HTML's
title
element,
or
any
environment
that
handles
only
plain
text
content,
you
have
to
resort
to
Unicode's
paired
control
characters
to
establish
the
base
direction
for
such
an
internal
range.
Furthermore, inline ranges where the base direction is changed should be isolated from surrounding text, so that the UBA doesn't produce incorrect results due to interference across boundaries. See an example of how this can produce incorrect ordering of things such as text followed by numbers in HTML, or another example of how it can affect lists.
This means that if a content author is using Unicode control codes they should use RLI/LRI...PDI rather than RLE/LRE...PDF. These isolating codes are fairly new, and applications may not yet support them.
Reasons to avoid relying on control characters to set direction include the following:
The last two items above may also hold for markup, but implementers often support included markup better than included control codes.
Don't expect users to add control codes at the start and end of every paragraph. That's far too much work.
A word about the Unicode characters U+200F RIGHT-TO-LEFT MARK (RLM) and U+200E LEFT-TO-RIGHT MARK (LRM) is warranted at this point.
The first point to be clear about is that neither RLM nor LRM establish the base direction for a range of text. They are simply invisible characters with strong directional properties.
This means that you cannot use RLM for example, to make the text W3C appear to the left of the Hebrew text in the following example.
The title is " פעילות הבינאום, W3C ".
For this you can only use metadata or the paired control characters.
Of course, if you are detecting base direction using first-strong heuristics then RLM and LRM can be useful for setting the base direction where the text in question begins with something that would otherwise give the wrong result, eg.
" نشاط التدويل " is how you say "i18n Activity" in Arabic.
Here an LRM could be placed at the start of the text, before the strong RTL Arabic characters, to prevent the algorithm from assuming that the text should be right-to-left. (Remember that if metadata is used to set the base direction, that character is ignored, unless the metadata specifically says that first-strong heuristics should be used.)
Do not assume that direction can be determined from language information. more
The following are all reasons you cannot use language tags to provide information about base direction:
auto
value
with
language
tags.
suppressscript:
Hebr
).
Languages,
such
as
Persian,
that
are
usually
written
in
a
RTL
script
may
be
written
in
transcribed
form,
and
it's
not
possible
to
guarantee
that
the
necessary
script
tag
would
be
present
to
carry
the
directional
information.
In
summary,
you
won't
be
able
to
rely
on
people
supplying
script
tags
as
part
of
the
language
information
in
order
to
influence
direction.
Values for the default base direction should include left-to-right, right-to-left, and auto. more
The
auto
value
allows
automatic
detection
of
the
base
direction
for
a
piece
of
text.
For
example,
the
auto
value
of
dir
in
HTML
looks
for
the
first
strong
directional
character
in
the
text,
but
ignores
certain
items
of
markup
also,
to
guess
the
base
direction
of
the
text.
Note
that
automatic
detection
algorithms
are
far
from
perfect.
First-strong
detection
is
unable
to
correctly
identify
text
that
is
really
right-to-left,
but
that
begins
with
a
strong
LTR
character.
Algorithms
that
attempt
to
judge
the
base
direction
based
on
contents
of
the
text
are
also
problematic.
The
best
scenario
is
one
where
the
base
direction
is
known
and
declared.
The spec should indicate how to define a default base direction for the resource as a whole, ie. set the overall base direction. more
The default base direction, in the absence of other information, should be LTR. more
The content author must be able to indicate parts of the text where the base direction changes. At the block level, this should be achieved using attributes or metadata, and should not rely on Unicode control characters.
Relying on Unicode control characters to establish direction for every block is not feasible because line breaks terminate the effect of such control characters. It also makes the data much less stable, and unnecessarily difficult to manage if control characters have to appear at every point where they would be needed.
It
must
be
possible
to
also
set
the
direction
for
content
fragments
to
auto
.
This
means
that
the
base
direction
will
be
determined
by
examining
the
content
itself.
A typical approach here would be to set the direction based on the first strong directional character outside of any markup, but this is not the only possible method. The algorithm used to determine directionality when direction is set to auto should match that expected by the receiver.
The first-strong algorithm looks for the first character in the paragraph with a strong directional property according to the Unicode definitions. It then sets the base direction of the paragraph according to the direction of that character.
Note that the first-strong algorithm may incorrectly guess the direction of the paragraph when the first character is not typical of the rest of the paragraph, such as when a RTL paragraph or line starts with a LTR brand name or technical term.
For additional information about algorithms for detecting direction, see Estimation algorithms in the document where this was discussed with reference to HTML.
If
the
overall
base
direction
is
set
to
auto
for
plain
text,
the
direction
of
content
paragraphs
should
be
determined
on
a
paragraph
by
paragraph
basis.
To indicate the sides of a block of text where relative to the start and end of its contained lines, you should use 'before' and 'after' (maybe block-start/block-end – the terminology is changing), rather than 'top' and 'bottom'.
To indicate the start/end of a line you should use 'start' and 'end' rather than 'left' and 'right'.
Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.
For
example,
HTML
has
a
dir
attribute
that
is
capable
of
managing
base
direction
without
assistance
from
CSS
styling.
XML
formats
should
define
dedicated
markup
to
represent
directional
information,
even
if
they
need
CSS
to
achieve
the
required
display,
since
the
text
may
be
used
in
other
ways.
Style sheets such as CSS may not always be used with the data, or carried with the data when it is syndicated, etc. Directional information is fundamentally important to correct display of the data, and should be associated more closely and more permanently with the markup or data.
The information in this section is pulled from Requirements for Language and Direction Metadata in Data Formats . That document is still being written, so these guidelines are likely to change at any time.
Provide metadata constructs that can be used to indicate the base direction of any natural language string. more
Specify that consumers of strings should use heuristics, preferably based on the Unicode Standard first-strong algorithm, to detect the base direction of a string except where metadata is provided. more
Where possible, define a field to indicate the default direction for all strings in a given resource or document. more
Do NOT assume that a creating a document-level default without the ability to change direction for any string is sufficient. more
If metadata is not available due to legacy implementations and cannot otherwise be provided, specifications MAY allow a base direction to be interpolated from available language metadata. more
Specifications MUST NOT require the production or use of paired bidi controls. more
'Inline text' here has a readily understandable meaning in markup. It also applies to strings (eg. in JSON, CVS, or other plain text formats), meaning runs of characters which don't include all the characters in the string.
It must be possible to indicate spans of inline text where the base direction changes. If markup is available, this is the preferred method. Otherwise your specification must require that Unicode control characters are recognized by the receiving application, and correctly implemented.
It must be possible to also set the direction for a span to auto. This means that the base direction will be determined by examining the content itself. A typical approach here would be to set the direction based on the first strong directional character outside of any markup. more
The first-strong algorithm looks for the first character in the paragraph with a strong directional property according to the Unicode definitions. It then sets the base direction of the paragraph according to the direction of that character.
Note that the first-strong algorithm may incorrectly guess the direction of the paragraph when the first character is not typical of the rest of the paragraph, such as when a RTL paragraph or line starts with a LTR brand name or technical term.
For additional information about algorithms for detecting direction, see Estimation algorithms in the document where this was discussed with reference to HTML.
If users use Unicode bidirectional control characters, the isolating RLI/LRI/FSI with PDI characters must be supported by the application and recommended (rather than RLE/LRE with PDF) by the spec.
Use of RLM/LRM should be appropriate, and expectations of what those controls can and cannot do should be clear in the spec. more
The Unicode bidirectional control characters U+200F RIGHT-TO-LEFT MARK and U+200E LEFT-TO-RIGHT MARK are not sufficient on their own to manage bidirectional text. They cannot produce a different base direction for embedded text. For that you need to be able to indicate the start and end of the range of the embedded text. This is best done by markup, if available, or failing that using the other Unicode bidirectional controls mentioned just above.
For markup, provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.
For markup, allow bidi attributes on all inline elements in markup that contain text.
For markup, provide attributes that allow the user to (a) create an embedded base direction or (b) override the bidirectional algorithm altogether; the attribute should allow the user to set the direction to LTR or RTL or the aforementioned Auto in either of these two scenarios.
See the Character Model for the World Wide Web: Fundamentals for basic guidelines related to the use of characters and encodings.
See the Encoding specification for further guidelines related to use of character encodings.
Another Character Model document is currently in development, entitled String Matching . It looks at issues that arise when you try to compare two strings, be it identifiers or authored content.
The term character is often used to mean different things in different contexts: it can variously refer to the visual, logical, or byte-level representation of a given piece of text. This makes the term too imprecise to use when specifying algorithms, protocols, or document formats. Understanding how characters are defined and encoded in computing systems, along with the associated terminology used to make such specification unambiguous, is thus a necessary prerequisite to discussing the processing of string data.
The visual manifestation of a "character"—the shape most people mean when they say "character"—is what we call a user-perceived character . These visual building blocks are usually perceived to be a single unit of the visible text.
At their simplest, user-perceived characters are a single shape that can be tied one-to-one to the underlying computing representation. But a user-perceived character can be formed, in some scripts, from more than one character. And a given logical character can take many different shapes due to such influences as font selection, style, or the surrounding context (such as adjacent characters). In some cases, a single user-perceived character might be formed from a long sequence of logical characters. And some logical characters (so-called "combining marks") are always used in conjunction with another character.
When user-perceived characters are represented visibly (on screen or in print), they are represented by individual rendering units. This visual unit is called a grapheme (the word glyph is also used). Graphemes are the visual units found in fonts and rendering software.
Graphemes
are
encoded
into
computer
systems
using
"logical
characters".
A
character
set
is
a
set
of
logical
characters:
a
specific
collection
of
characters
that
can
be
used
together
to
encode
text.
The
most
important
character
set
is
the
Universal
Character
Set
,
also
known
as
[
Unicode
].
This
character
set
includes
all
of
the
characters
used
to
encode
text,
including
historical
or
extinct
writing
systems
as
well
as
modern
usage,
private
use,
typesetting
symbols,
and
other
things,
such
as
the
emoji.
Other
character
sets
are
defined
subsets
of
Unicode.
In
Unicode,
a
'character'
is
a
single
abstract
logical
unit
of
text.
Each
character
in
Unicode
is
assigned
a
unique
integer
number
between
0x0000
and
0x10FFFF
,
which
is
called
its
code
point
.
The
term
code
point
therefore
unambiguously
refers
to
a
single
logical
character
and
its
integer
representation.
Specifications SHOULD explicitly define the term 'character' to mean a Unicode code point.
The relationship between code points and graphemes can be complex. In most cases, a code point sequence that forms a single grapheme should be treated as a single textual unit. For example, when cursoring across text, an entire grapheme should select together. It shouldn't be possible to cursor into the "middle" of a grapheme or delete only a part of user-perceived character. Because the relationship is not one-to-one between code points and graphemes and because the relationship can be somewhat complex, [ Unicode ] defines a specific type of grapheme: the extended grapheme cluster which most closely matches the mapping of the underlying logical character sequence to a user-perceived character. When referring to 'graphemes' in this document, we mean extended grapheme clusters (unless otherwise called out).
Another example of the complex relationship between code points and graphemes are certain emoji. The emoji character for "family" has a code point in Unicode: 👪 [ U+1F46A FAMILY ] . It can also be formed by using using a sequence of code points: U+1F468 U+200D U+1F469 U+200D U+1F466 . Altering or adding other emoji characters can alter the composition of the family. For example the sequence 👨👩👧👧 U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F467 results in a composed emoji character for a "family: man, woman, girl, girl" on systems that support this kind of composition. Many common emoji can only be formed using sequences of code points, but should be treated as a single user-perceived character when displaying or processing the text. You wouldn't want to put a line-break in the middle of the family!
Unicode
code
points
are
just
abstract
integer
values:
they
are
not
the
values
actually
present
in
the
memory
of
the
computer
or
serialized
on
the
wire.
When
processing
text,
computers
use
an
array
of
fixed-size
integer
units.
One
such
common
unit
is
the
byte
(or
octet
,
since
bytes
have
8
bits
per
unit).
There
are
also
16-bit,
32-bit,
or
other
size
units.
In
many
programming
languages,
the
unit
is
called
a
char
,
which
suggests
that
strings
are
made
of
"characters".
We
use
the
term
code
unit
to
refer
unambiguously
to
the
programming
and
serialized
representation
of
characters.
For
example,
in
C,
a
char
is
generally
an
8-bit
byte:
each
char
is
a
8-bit
code
unit.
In
Java
or
Javascript,
a
char
is
a
16-bit
value.
A set of rules for converting code points to or from code units is called a character encoding form (or just "character encoding" for short.
Specifications SHOULD use specific terms, when available, instead of the general term 'character'. more
When specifications use the term 'character' the specifications MUST define which meaning they intend, and SHOULD explicitly define the term 'character' to mean a Unicode code point. more
Specifications, software and content MUST NOT require or depend on a one-to-one relationship between characters and units of physical storage. more
Specifications, software and content MUST NOT require or depend on a one-to-one correspondence between characters and the sounds of a language. more
Specifications, software and content MUST NOT require or depend on a one-to-one mapping between characters and units of displayed text. more
Specifications and software MUST NOT require nor depend on a single keystroke resulting in a single character, nor that a single character be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world. more
In W3C Recommendation, Character Model for the World Wide Web .
Textual data objects defined by protocol or format specifications MUST be in a single character encoding. more
All specifications that involve processing of text MUST specify the processing of text according to the Reference Processing Model described by the rest of the recommendations in this list. more
Specifications MUST define text in terms of Unicode characters, not bytes or glyphs. more
For their textual data objects specifications MAY allow use of any character encoding which can be transcoded to a Unicode encoding form. more
Specifications MAY choose to disallow or deprecate some character encodings and to make others mandatory. Independent of the actual character encoding, the specified behavior MUST be the same as if the processing happened as follows: (a) The character encoding of any textual data object received by the application implementing the specification MUST be determined and the data object MUST be interpreted as a sequence of Unicode characters - this MUST be equivalent to transcoding the data object to some Unicode encoding form, adjusting any character encoding label if necessary, and receiving it in that Unicode encoding form, (b) All processing MUST take place on this sequence of Unicode characters, (c) If text is output by the application, the sequence of Unicode characters MUST be encoded using a character encoding chosen among those allowed by the specification. more
If a specification is such that multiple textual data objects are involved (such as an XML document referring to external parsed entities), it MAY choose to allow these data objects to be in different character encodings. In all cases, the Reference Processing Model MUST be applied to all textual data objects. more
Digital Encoding of Characters
In W3C Recommendation, Character Model for the World Wide Web .
Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive. more
Specifications MUST NOT allow code points above U+10FFFF. more
Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use. more
Specifications MUST NOT allow the use of surrogate code points. more
Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define. more
Specifications SHOULD allow the full range of Unicode for user-defined values. more
Digital Encoding of Characters
In W3C Recommendation, Character Model for the World Wide Web .
Specifications MUST NOT require the use of private use area characters with particular assignments. more
Specifications MUST NOT require the use of mechanisms for defining agreements of private use code points. more
Specifications and implementations SHOULD NOT disallow the use of private use code points by private agreement. more
Specifications MAY define markup to allow the transmission of symbols not in Unicode or to identify specific variants of Unicode characters. more
Specifications SHOULD allow the inclusion of or reference to pictures and graphics where appropriate, to eliminate the need to (mis)use character-oriented mechanisms for pictures or graphics. more
In W3C Recommendation, Character Model for the World Wide Web .
Specifications MUST either specify a unique character encoding, or provide character encoding identification mechanisms such that the encoding of text can be reliably identified. more
When designing a new protocol, format or API, specifications SHOULD require a unique character encoding. more
When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules. more
When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. more
This guideline needs further consideration: utf-16 and utf-32 are not recommended these days. UTF-8 is the recommended encoding.
Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED . more
If the unique encoding approach is not taken, specifications SHOULD require the use of the IANA charset registry names, and in particular the names identified in the registry as 'MIME preferred names', to designate character encodings in protocols, data formats and APIs. more
This guideline needs further consideration: the list of character encodings recommended for Web specifications is listed in the Encoding specification.
Character encodings that are not in the IANA registry SHOULD NOT be used, except by private agreement. more
If an unregistered character encoding is used, the convention of using 'x-' at the beginning of the name MUST be followed. more
If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms (encoding forms that MUST be supported by implementations of the specification). more
Specifications that require a default encoding MUST define either UTF-8 or UTF-16 as the default, or both if they define suitable means of distinguishing them. more
Choice and identification of code points
In W3C Recommendation, Character Model for the World Wide Web .
What is the 'Document Character Set' for XML and HTML, and how does it relate to the encodings I use for my documents?
Specifications MUST NOT propose the use of heuristics to determine the encoding of data. more
Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding. more
Choice and identification of code points
In W3C Recommendation, Character Model for the World Wide Web .
Specifications should provide a mechanism for escaping characters, particularly those which are invisible or ambiguous. more
It is generally recommended that character escapes be provided so that difficult to enter or edit sequences can be introduced using a plain text editor. Escape sequences are particularly useful for invisible or ambiguous Unicode characters, including zero-width spaces, soft-hyphens, various bidi controls, mongolian vowel separators, etc.
For advice on use of escapes in markup, but which is mostly generalisable to other formats, see Using character escapes in markup and CSS .
Specifications SHOULD NOT invent a new escaping mechanism if an appropriate one already exists. more
The number of different ways to escape a character SHOULD be minimized (ideally to one). more
Escape syntax SHOULD require either explicit end delimiters or a fixed number of characters in each character escape. Escape syntaxes where the end is determined by any character outside the set of characters admissible in the character escape itself SHOULD be avoided. more
Whenever specifications define character escapes that allow the representation of characters using a number, the number MUST represent the Unicode code point of the character and SHOULD be in hexadecimal notation. more
Escaped characters SHOULD be acceptable wherever their unescaped forms are; this does not preclude that syntax-significant characters, when escaped, lose their significance in the syntax. In particular, if a character is acceptable in identifiers and comments, then its escaped form should also be acceptable. more
In W3C Recommendation, Character Model for the World Wide Web .
Protocols, data formats and APIs MUST store, interchange or process text data in logical order. more
Independent of whether some implementation uses logical selection or visual selection, characters selected MUST be kept in logical order in storage. more
Specifications of protocols and APIs that involve selection of ranges SHOULD provide for discontiguous logical selections, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs. more
Visual rendering and logical order
In W3C Recommendation, Character Model for the World Wide Web .
Specifications SHOULD NOT define a string as a 'byte string'. more
The 'character string' definition SHOULD be used by most specifications. more
In W3C Recommendation, Character Model for the World Wide Web .
Use U+XXXX syntax to represent Unicode code points in the specification. more
The U+XXXX format is well understood when referring to Unicode code points in a specification. These are space separated when appearing in a sequence. No additional decoration is needed. Note that a code point may contain four, five, or six hexadecimal digits. When fewer than four digits are needed, the code point number is zero filled. E.g. U+0020.
Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. more
A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time. more
All generic references to the Unicode Standard MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification. more
All generic references to ISO/IEC 10646 MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification. more
Referencing the Unicode Standard and ISO/IEC 10646
In W3C Recommendation, Character Model for the World Wide Web .
In this section:
The character string is RECOMMENDED as a basis for string indexing. more
Grapheme clusters MAY be used as a basis for string indexing in applications where user interaction is the primary concern. more
Specifications that define indexing in terms of grapheme clusters MUST either: (a) define grapheme clusters in terms of extended grapheme clusters as defined in Unicode Standard Annex #29, Text Boundaries [UTR #29], or (b) define specifically how tailoring is applied to the indexing operation. more
The use of byte strings for indexing is NOT RECOMMENDED . more
A UTF-16 code unit string is NOT RECOMMENDED as a basis for string indexing, even if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string. more
Specifications that need a way to identify substrings or point within a string SHOULD consider ways other than string indexing to perform this operation. more
Specifications SHOULD understand and process single characters as substrings, and treat indices as boundary positions between counting units, regardless of the choice of counting units. more
Specifications of APIs SHOULD NOT specify single characters or single 'units of encoding' as argument or return types. more
When the positions between the units are counted for string indexing, starting with an index of 0 for the position at the start of the string is the RECOMMENDED solution, with the last index then being equal to the number of counting units in the string. more
String identity matching for identifiers and syntactic content should involve the following steps: (a) Ensure the strings to be compared constitute a sequence of Unicode code points (b) Expand all character escapes and includes (c) Perform any appropriate case-folding and Unicode normalization step (d) Perform any additional matching tailoring specific to the specification, and (e) Compare the resulting sequences of code points for identity. more
The default recommendation for matching strings in identifiers and syntactic content is to do no normalization (ie. case folding or Unicode Normalization) of content. more
'ASCII case fold' and 'Unicode canonical case fold' approaches should only be used in special circumstances. more
A 'Unicode compatibility case fold' approach should not be used. more
Specifications of vocabularies MUST define the boundaries between syntactic content and character data as well as entity boundaries (if the language has any include mechanism). more
Specifications SHOULD NOT specify a Unicode normalization form for encoding, storage, or interchange of a given vocabulary. more
Implementations MUST NOT alter the normalization form of syntactic or natural language content being exchanged, read, parsed, or processed except when required to do so as a side-effect of text transformation such as transcoding the content to a Unicode character encoding, case folding, or other user-initiated change, as consumers or the content itself might depend on the de-normalized representation. more
Specifications SHOULD NOT specify compatibility normalization forms (NFKC, NFKD). more
Specifications MUST document or provide a health-warning if canonically equivalent but disjoint Unicode character sequences represent a security issue. more
Where operations can produce denormalized output from normalized text input, specifications MUST define whether the resulting output is required to be normalized or not. Specifications MAY state that performing normalization is optional for some operations; in this case the default SHOULD be that normalization is performed, and an explicit option SHOULD be used to switch normalization off. more
Specifications that require normalization MUST NOT make the implementation of normalization optional. more
Normalization-sensitive operations MUST NOT be performed unless the implementation has first either confirmed through inspection that the text is in normalized form or it has re-normalized the text itself. Private agreements MAY be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed. more
A normalizing text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text. more
Specifications that perform comparison or matching of string values SHOULD specify the appropriate note or warning regarding Unicode normalization.
The use or adoption of Unicode Normalization in a specification is usually part of defining how matching takes place in a given format or protocol. To help specification authors and implementers understand some of the complexity involved, the Internationalization Working Group has developed a document describing the considerations for the matching and comparison of strings: Character Model for the World Wide Web: String Matching [ CHARMOD-NORM ].
One of the choices specifications need to make is whether (or not) to require Unicode Normalization as part of matching various "values" defined as part of the specification's vocabulary. Values are commonly part of a document format or protocol's syntax, and include such things as: attribute names or values, element names or values, IDs, and so forth. Specifications that follow the recommendation to not employ normalization as part of matching should include the following Note as a reminder to content authors.
Example note. Necessarily this version is non-specific about what constitutes "values": specifications may wish to be more specific.
This specification does not permit Unicode normalization of values for the purposes of comparison. Values that are visually and semantically identical but use different Unicode character sequences will not match. Content authors are advised to use the same encoding sequence consistently or to avoid potentially troublesome characters when choosing values. For more information, see [ CHARMOD-NORM ].
Specifications that choose to require require normalization as part of string matching should include the following warning:
Example warning. Necessarily this version is non-specific about what constitutes "values": specifications may wish to be more specific.
This specification applies Unicode normalization during the matching of values. This can have an effect on the appearance and meaning of the affected text. For more information, see [ CHARMOD-NORM ].
Contact the I18N WG for alternatives or assistance if the above do not meet your needs or you're not sure about usage.
Specifications and implementations that define string matching as part of the definition of a format, protocol, or formal language (which might include operations such as parsing, matching, tokenizing, etc.) MUST define the criteria and matching forms used. These MUST be one of: (a) case-sensitive (b) Unicode case-insensitive using Unicode full case-folding (c) ASCII case-insensitive.
Case-sensitive matching is RECOMMENDED for matching syntactic content, including user-defined values. more
Specifications that define case-insensitive matching in vocabularies that include more than the Basic Latin (ASCII) range of Unicode MUST specify Unicode full casefold matching. more
Specifications that define case-insensitive matching in vocabularies limited to the Basic Latin (ASCII) subset of Unicode MAY specify ASCII case-insensitive matching. more
If language-sensitive case-sensitive matching is specified, Unicode case mappings SHOULD be tailored according to language and the source of the language used for each tailoring MUST be specified. more
Specifications that define case-insensitive matching in vocabularies SHOULD NOT specify language-sensitive case-insensitive matching. more
Some specifications, formats, or protocols or their implementations need to specify limits for the size of a given data structure or text field. This could be due to many reasons, such as limits on processing, memory, data structure size, and so forth. When selecting or specifying limits on the length of a given string, specifications or implementations need to ensure that they do not cause corruption in the text.
Specifications SHOULD NOT limit the size of data fields unless there is a specific practical or technical limitation.
There are many reasons why a length limit might be needed in a specification or format. Generally length limits correspond to underlying limits in the implementation, such as the use of fixed-size fields in a database or data store, the desire to fit into practical boundaries such as packet size, or some other implementation detail related to storage allocation or efficiency.
When truncating strings, it's necessary to decide what units to use when counting the size of the string. In many cases this is beyond the control of the specification, since the truncation is occuring for some preordained reason. However, when the choice is available, some general guidelines can be applied.
If the limitation is related to the number of display positions, the grapheme count usually corresponds most closely to the expected limit. Note that proportional width fonts, combining marks, complex scripts, and many other factors complicate counting "screen positions". In Web pages, for example, the CSS text-overflow property provides visual truncation without disturbing the content of the text. Attempts to estimate the size of a given piece of text based on the number of Unicode code points or even the number of grapheme clusters is mostly futile.
Otherwise most limits are expressed in terms of code points in Unicode or code units (such as bytes) in a specific character encoding. Code points provides the best user experience, since all Unicode code points are treated identically: if text is truncated after 40 code points, all languages and scripts get the same number of code points to work with. By contrast, when the size limit is expressed in code units such as bytes in UTF-8, users who write in a language that mostly uses ASCII letters get many more characters (code points) for a given size limit than user's whose language is mostly made up of characters that take 2-, 3-, or 4-bytes per code point.
Specifications that limit the length of a string MUST specify which type of unit (extended grapheme clusters, Unicode code points, or code units) the length limit uses.
Specifications that limit the length of a string SHOULD specify the length in terms of Unicode code points.
If a specification sets a length limit in code units (such as bytes), it MUST specify that truncation can only occur on code point boundaries.
If a specification specifies a length limit, it SHOULD specify that any string that is truncated include an indicator, such as ellipses, that the string has been altered.
When specifying a length limitation in code units (such as bytes), specifications SHOULD set the maximum length in a way that accommodates users whose language requires multibyte code unit sequences.
Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application. more
Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user. more
Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering. more
Specifications and implementations of sorting and searching algorithms SHOULD accommodate text that contains any character in Unicode. more
Resource identifiers must permit the use of characters outside those of plain ASCII. discussion
Specifications MUST define when the conversion from IRI references to URI references (or subsets thereof) takes place, in accordance with Internationalized Resource Identifiers (IRIs). more
Many current specifications already contain provisions in accordance with Internationalized Resource Identifiers (IRIs). For XML 1.0, see Section 4.2.2, External Entities. XML Schema Part 2: Datatypes provides the anyURI datatype (see Section 3.2.17). The XML Linking Language (XLink) provides the href attribute (see Section 5.4, Locator Attribute).
Document formats should allow IRIs to be used; handlers for protocols that do not currently support IRIs can convert the IRI to a URI when the IRI is dereferenced.
Do not define attribute values that will contain user readable content. Use elements for such content. more
If you do define attribute values containing user readable content, provide a means to indicate directional and language information for that text separately from the text contained in the element.
Provide
a
way
for
authors
to
annotate
arbitrary
inline
content
using
a
span
-like
element
or
construct.
more
Identifiers should be case-sensitive.
Avoid natural language text in elements that only allow for plain text and in attribute values.
Provide a span-like element that can be used for any text content to apply information needed for internationalization. more
Internationalization information may include a change of language, bidirectional text behaviour changes, translate flags, etc.
Text decoration such as underline and overline should allow lines to skip ink.
It should be possible to specify the distance of overlines and underlines from the text. more
Skipping ink for text decoration such as underlines may not be appropriate for some scripts, such as Arabic, which prefers to move the underline further away from the baseline instead.
It should be possible to render text vertically for languages such as Japanese, Chinese, Korean, Mongolian, etc.
Vertical text must support line progression from LTR (eg. Mongolian) and RTL (eg. Japanese)
By
default,
text
decoration,
ruby,
and
the
like
in
vertical
text
where
lines
are
stacked
from
left
to
right
(eg.
Mongolian)
should
appear
on
the
same
side
as
for
CJK
vertical
text.
Placement
should
not
rely
on
the
before
and
after
line
locations.
Vertical
writing
modes
that
are
equivalent
to
the
vertical-
values
in
CSS
(only)
should
use
UTR50
to
apply
default
text
orientation
of
characters.
(This
does
not
apply
to
writing
modes
that
are
equivalent
to
sideways-
in
CSS.)
By default, glyphs of scripts that are normally horizontal should run along a line in vertical text such that the top of the character is toward the right side of the vertical line, but there should also be a mechanism to allow them to progress down the line in upright orientation. Such a mechanism should use grapheme clusters as a minimum text unit, but where necessary allow syllabic clusters to be treated as a unit when they involve more than one grapheme cluster.
Upright Arabic text in vertical lines should use isolated letter forms and the order of text should read top to bottom.
It should be possible for some sequences of characters (particularly digits) to run horizontally within vertical lines (tate chu yoko).
Writing
modes
should
provide
values
like
sideways-lr
and
sideways-rl
in
CSS
to
allow
for
vertical
rotation
of
lines
of
horizontal
script
text.
UTR50
is
not
applicable
for
these
cases.
Overlaps should not be exposed when transparency is applied to the joined letters in cursive text, such as for Arabic, Mongolian, and N'Ko.
When adding a text stroke or shadow, joined letters should not be separated from their neighbors in cursive script text.
Box positioning coordinates must take into account whether the text is horizontal or vertical. more
It
is
typical,
when
localizing
a
user
interface
or
web
page,
to
create
mirror-images
for
the
RTL
and
LTR
versions.
For
example,
it
is
likely
that
a
box
that
appears
near
the
left
side
of
a
window
containing
English
content
would
appear
near
the
right
side
of
the
window
if
the
content
is
Arabic
or
Hebrew.
It
should
preferably
automatic
for
this
to
change,
based
on
the
base
direction
of
the
current
context,
unless
there
is
a
strong
reason
for
using
absolute
geometry.
One
way
to
achieve
this
is
to
use
keywords
such
as
start
and
end
,
rather
than
left
and
right
,
to
indicate
position.
'Ruby' style annotations alongside base text should be supported for Chinese, Japanese, Korean and Mongolian text, in both horizontal and vertical writing modes.
Ruby implementations should support zhuyin fuhao (bopomofo) ruby for Traditional Chinese.
Ruby
implementations
should
support
a
tabular
content
model
(such
that
ruby
contents
can
be
arranged
in
a
sequence
approximating
to
rb
rb
rt
rt
).
Ruby
implementations
should
make
it
possible
to
use
an
explicit
rb
tag
for
ruby
bases.
Ruby implementations should allow annotations to appear on either or both sides of the base text.
What are 'ruby' annotations?
Line heights must allow for characters that are taller than English.
Box sizes must allow for text expansion in translation.
Line wrapping should take into account the special rules needed for non-Latin scripts. more
Various non-Latin writing systems don't simply wrap text on inter-word spaces. They have additional rules that must be respected. For example
See the CSS Text Level 3 specification for additional background. ( This tutorial provides additional examples, if needed.)
Avoid
specifying
presentational
tags,
such
as
b
for
bold,
and
i
for
italic.
more
It
is
best
to
avoid
presentational
markup
b
,
i
or
u
,
since
it
isn't
interoperable
across
writing
systems
and
furthermore
may
cause
unnecessary
problems
for
localisation.
In
addition,
some
scripts
have
native
approaches
to
things
such
as
emphasis,
that
do
not
involve,
and
can
be
very
different
from,
bolding,
italicisation,
etc.
In the HTML case, there was a legacy issue, but unless there is one for your specification, the recommendation is that styling be used instead to determine the presentation of the text, and that any markup or tagging should allow for general semantic approaches.
For
an
explanation
of
the
issues
surrounding
b
and
i
tags,
see
Using
<b>
and
<i>
elements
.
Software
systems
that
support
languages
and
cultural
preferences
are
said
to
be
internationalized
.
An
internationalized
system
uses
APIs
to
provide
language
or
culturally
specific
processing,
based
on
user
preferences.
These
user
preferences
are
usually
referred
to
as
a
locale
.
For
more
information
on
general
internationalization
terminology,
see
Language
Tags
and
Locale
Identifiers
[
LTLI
]
When definining data formats, use locale-neutral serialization forms.
When defining calendar and date systems, be sure to allow for dates prior to the common era, or at least define handling of dates outside the most common range.
When defining time or date data types, ensure that the time zone or relationship to UTC is always defined.
Provide a health warning for conversion of time or date data types that are "floating" to/from incremental types, referring as necessary to the Time Zones WG Note . more
Allow for leap seconds in date and time data types. more
These occur occasionally when the number of seconds in a minute is allowed to range from 0 to 60 (ie. there are sixty-ONE seconds in that minute).
Use consistent terminology when discussing date and time values. Use 'floating' time for time zone independent values.
Keep separate the definition of time zone from time zone offset.
Use IANA time zone IDs to identify time zones. Do not use offsets or LTO as a proxy for time zone.
Use a separate field to identify time zone.
When defining rules for a "week", allow for culturally specific rules to be applied. more
For example, the weekend is not always Saturday/Sunday; the first day of week is not always Sunday [or Monday or...], etc.
When defining rules for week number of year, allow for culturally specific rules to be applied.
When non-Gregorian calendars are permitted, note that the "month" field can go to 13 (undecimber).
Check whether you really need to store or access given name and family name separately. more
Avoid placing limits on the length of names, or if you do, make allowance for long strings. more
Try to avoid using the labels 'first name' and 'last name' in non-localized contexts. more
Consider whether it would make sense to have one or more extra fields, in addition to the full name field, where users can provide part(s) of their name that you need to use for a specific purpose. more
Allow for users to be asked separately how they would like you be addressed when someone contacts them. more
If parts of a person's name are captured separately, ensure that the separate items can capture all relevant information. more
Be careful about assumptions built into algorithms that pull out the parts of a name automatically. more
Don't assume that a single letter name is an initial. more
Don't require that people supply a family name. more
Don't forget to allow people to use punctuation such as hyphens, apostrophes, etc. in names. more
Don't require names to be entered all in upper case. more
Allow the user to enter a name with spaces. more
Don't assume that members of the same family will share the same family name. more
It may be better for a form to ask for 'Previous name' rather than 'Maiden name' or 'née'. more
You may want to store the name in both Latin and native scripts, in which case you probably need to ask the user to submit their name in both native script and Latin-only form, as separate items. more
Personal names around the world
How do people's names differ around the world, and what are the implications of those differences on the design of forms, databases, ontologies, etc. for the Web?
When defining email field validation, allow for EAI (smtputf8) names.
When parsing user input of numeric values, allow for digit shaping (non-ASCII digits).
When formatting numeric values for display, allow for culturally sensitive display, including the use of non-ASCII digits (digit shaping).
Localization [ LTLI ] enables users to employ software in the language and locale of their choice. Specifications for protocols and document formats need to consider how to provide the language and formatting that the end-user expects.
APIs
and
protocols
SHOULD
provide
language
independent
identifiers
for
errors.
For
example,
HTTP
result
codes,
such
as
the
familiar
404
help
users
communicate
which
error
they
received
or
look
up
a
translation.
APIs and protocols SHOULD include language and direction metadata for all natural language messages, including errors, to ensure proper presentation, even if localization is not provided. See also [ STRING-META ].
All natural language fields or messages, including error messages, defined by a given API or protocol SHOULD be localized into the preferred locale of the user or, if that language is not available, supplied with a suitable fallback or default.
Specifications for APIs or protocols SHOULD define how the user's locale is determined (this is called language negotiation ).
Specifications MAY define a specific default language for messages or errors in an API or protocol.
Note that specifications do not need to require that messages be returned in all possible or all available locales.
You can take a snapshot of the checklist, and paste it into an issue in your Github repository. Then you can click on a checkbox alongside a checklist item to indicate whether or not your spec supports it, or you can add your own notes/questions by editing the issue.
To do this, simply click on the button just below, then paste the output it generates into a new issue in your repository.
The following summarises substantive changes since the previous publication, but the material is still subject to significant flux as it develops. This should not be a reason not to use the document. What it so far contains is useful, and any shortfalls can be reported or discussed.
See the github commit log for more details.
Thanks to Addison Phillips for help reviewing old reviews for recommendations.
Other people who contributed through reviews or issues include Steve Atkin, Andrew Cunnigham, Martin Dürst, Asmus Freytag, John Klensin, Tomer Mahlin, Chaals McCathieNevile, Florian Rivoal.