W3C
Editor's
Draft
19
June
Copyright © 2021-2024 World Wide Web Consortium . W3C ® liability , trademark and permissive document license rules apply.
This document provides definitions for various terms related to W3C internationalization.
This section describes the status of this document at the time of its publication. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
We welcome comments on this document, but to make it easier to track them, please raise separate issues for each comment, and point to the section you are commenting on using a URL.
This document was published by the Internationalization Working Group as an Editor's Draft.
Publication as an Editor's Draft does not imply endorsement by W3C and its Members.
This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 03 November 2023 W3C Process Document .
This document can be pointed to for definitions of terms, or these definitions may be copied to other documents and slightly adapted.
The W3C Internationalization Working Group also uses definitions provided by the Unicode Consortium . For more information on how to use this glossary, see Appendix A. How to use this glossary
Abjad . A writing system in which consonants are indicated, but not short vowels. The term “abjad” is derived from the first four letters of the traditional order of the Arabic script: alef, beh, jeem, dal . (See also the Unicode definition and Section 6.1, Writing Systems .) Alternatives include abugida , alphabet and syllabary .
Abugida . A writing system in which consonants have an inherent vowel, and other vowels are indicated by associating the consonant with one or more combining marks and/or letters. The term “abugida” is derived from the first four letters of the Ethiopic script in the Semitic order: alf, bet, gaml, dant . (See also the Unicode definition and Section 6.1, Writing Systems .) Alternatives include abjad , alphabet and syllabary .
Alphabet . A writing system in which both consonants and vowels are indicated. The term “alphabet” is derived from the first two letters of the Greek script: alpha, beta . (See also the Unicode definition and Section 6.1, Writing Systems .) Alternatives include abugida , abjad and syllabary .
Application internal identifiers . Identifiers defined by or assigned by a user in a vocabulary that is internal to the document format or protocol and not intended for human interaction. Such values are generally not localizable text .
ASCII case-insensitive matching . Defined in INFRA , this compares two sequences of code points as if all ASCII code points in the range 0x41 to 0x5A (A to Z) were mapped to the corresponding code points in the range 0x61 to 0x7A (a to z), but other code points are not case-folded . ASCII case-insensitive matching can be required when a vocabulary is itself constrained to ASCII.
Auto (direction) . A value used for the paragraph direction of textual data when the actual direction is unknown; it indicates that first-strong detection will be used to estimate the display of the text. See also LTR and RTL .
↑Base direction . Determines the general arrangement and progression of content when bidirectional text is displayed. The Unicode Bidirectional Algorithm is primarily focused on arranging adjacent characters, based on character properties. Base direction works at a higher level, and dictates (a) the visual order and direction in which runs of strongly-typed LTR and RTL character are displayed, and (b) where there are weakly-typed characters such as punctuation, the placement of those items relative to the other content.
Basic Multilingual Plane (BMP) . The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane. The BMP includes most of the more commonly used characters.
Bidi algorithm , see Unicode Bidirectional Algorithm .
Bidirectional text (often referred to as " bidi text " for short). Text that mixes runs of both LTR and RTL text inline. It is common for right-to-left scripts, such as Arabic and Hebrew, to contain short runs of left-to-right text (most commonly in the Latin script), and several of the scripts that are predominantly right-to-left display numbers from left-to-right. Bidirectional text is the source of many of the difficulties when dealing with RTL scripts.
Basic language range . A language range consisting of a sequence of subtags separated by hyphens. That is, it is identical in appearance to a language tag.
Bicameral
.
Unicode
definition
:
A
script
that
distinguishes
between
two
cases.
(See
case
.)
Usually
used
to
refer
to
scripts
that
have
an
upper-
and
lowercase
distinction,
such
as
many
alphabetic
scripts
of
European
origin
(Latin,
Greek,
Cyrillic).
Bidirectional isolate . A range of text, bounded by formatting characters or markup, that is treated by the Unicode Bidirectional Algorithm [ UAX9 ] as directionally isolated from its surroundings. The entire range of text inside the isolate is treated by the surrounding text as if it were a single neutral character (such as  U+FFFC OBJECT REPLACEMENT CHARACTER ) , and is assigned the corresponding display position in the surrounding text. Furthermore, the text inside the isolate has no effect on the ordering of the text outside it, and vice versa.
Bidi
isolation
.
The
use
of
bidi
isolates
in
text
in
order
to
prevent
the
automatic
rules
of
the
Unicode
Bidirectional
Algorithm
incorrectly
ordering
that
content
in
relation
to
the
surrounding
text.
For
example,
numbers
following
right-to-left
text
in
memory
are
automatically
positioned
to
the
left
of
RTL
text
by
the
Bidi
Algorithm,
but
sometimes
need
to
appear
to
the
right.
Another
example
occurs
when
a
list
of
RTL
items
occurs
in
a
LTR
sentence
(and
vice
versa):
the
Bidi
Algorithm
will
automatically
assume
that
the
order
of
items
in
the
list
should
be
"3
,2
,1",
but
actually
what's
needed
is
"1,
2,
3".
In
HTML,
bidi
isolation
can
be
applied
to
a
range
of
text
by
enclosing
it
in
an
element
with
a
dir
attribute.
In
plain
text
there
are
Unicode
formatting
characters
that
can
do
the
job.
These
mechanisms
remove
unwanted
spillover
effects
.
Block direction . The initial base direction of a block of text, which resolves to either left-to-right or right-to-left . A block refers to a unit of text as a whole, such as a paragraph in a document or a string in a data file. The name "block" is chosen as a contrast to inline direction . Unicode calls this value the paragraph direction .
Block
(Unicode)
.
Unicode
definition
:
A
grouping
of
characters
within
the
Unicode
encoding
space
used
for
organizing
code
charts.
Each
block
is
a
uniquely
named,
continuous,
non-overlapping
range
of
code
points,
containing
a
multiple
of
16
code
points,
and
starting
at
a
location
that
is
a
multiple
of
16.
A
block
may
contain
unassigned
code
points,
which
are
reserved.
Note
that
a
given
script
might
be
split
across
multiple
blocks.
Canonical Unicode locale identifier . A well-formed language tag resulting from the application of the Unicode locale identifier canonicalization rules found in [ UAX35 ]. This process converts any valid [ BCP47 ] language tag into a valid Unicode locale identifier . For example, deprecated subtags or irregular grandfathered tags are replaced with their preferred value from the IANA language subtag registry .
Case mapping . The process of transforming characters to a specific case, such as UPPER, lower, or Titlecase. For those scripts that have a case distinction, Unicode defines a default UPPER, lower, and Titlecase character mapping for each Unicode code point. Case mapping, at first, appears simple. However there are variations that need to be considered when treating the full range of Unicode in diverse languages.
Case folding The process of making two texts which differ only in case identical for comparison purposes, that is, it is meant for the purpose of string matching. This is distinct from case mapping , which is primarily meant for display purposes. As with the default case mappings, Unicode defines default case fold mappings ("case folding") for each Unicode code point. Unicode defines two forms of case folding.
Case sensitive matching . A form of string matching in which code points are compared directly, with no case folding .
Character encoding or, more formally, a character encoding form . The way a coded character set is mapped to bytes for manipulation in a computer. Commonly referred to as just the encoding . For examples and further descriptions see Character encodings: Essential concepts .
Character set or repertoire . The set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).
Circumgraph
.
A
single
vowel
code
point
that
produces
glyphs
on
more
than
one
side
of
its
consonant
base.
For
example,
in
the
Odia
syllable
,
/ka/,
the
character
ୋ
U+0B4B
ORIYA
VOWEL
SIGN
O
produces
separate
glyphs
on
either
side
of
the
base
consonant.
CJK . An abbreviation for Chinese, Japanese, and Korean. Sometimes CJKV is used to include the Han characters used in Vietnamese.
Coded character set . A set of characters where a unique number has been assigned to each character. Units of a coded character set are known as code points .
CLDR , see Common Locale Data Repository .
Code point . A code point value represents the position of a character in a coded character set. For example, the code point for the letter á in the Unicode coded character set is 225 in decimal, or 0xE1 in hexadecimal notation. Hexadecimal notation is commonly used for referring to code points. See also Unicode code point .
Code
unit
.
The
units
of
data
used
by
a
character
encoding
to
encode
or
serialize
characters
into
a
programming
language
or
other
serialized
form
(such
as
a
file).
Common
code
units
are
8-,
16-,
and
32-bits
in
size.
On
the
Web
we
are
mostly
concerned
with
bytes
,
which
are
technically
8-bit
code
units
.
However,
in
Javascript
a
char
is
a
16-bit
code
unit
(related
to
the
UTF-16
encoding
of
Unicode).
Combining
character
.
Unicode
characters
such
as
accents,
diacritics,
Hebrew
points,
Arabic
vowel
signs,
and
Indic
matras.
They
normally
never
appear
alone
unless
they
are
being
described,
but
are
combined
with
a
preceding
base
character.
More
than
one
combining
character
may
be
associated
with
the
same
base
character.
Many
combining
characters
appear
above
or
below
or
inside
the
base
character,
however
some
consume
space
along
the
baseline,
either
before
or
after
the
base
character,
and
are
referred
to
as
spacing
marks
,
or
spacing
combining
characters
.
Unicode
definition
:
A
character
with
the
General
Category
of
Combining
Mark
(M).
(See
definition
D52
in
Section
3.6,
Combination
.)
Combining
character
sequence
.
Unicode
definition
:
A
maximal
sequence
of
characters
following
the
pattern
Base?
(Combining_mark
|
ZWJ
|
ZWNJ)+
.
Usually
a
base
character
that
is
a
letter
or
digit,
followed
by
one
or
more
combining
characters,
zero
width
joiners,
and/or
zero
width
non-joiners.
Combining mark . See Combining character .
Common Locale Data Repository (or CLDR ). The Common Locale Data Repository ([ CLDR ]) is a Unicode Consortium project that defines, collects, and curates sets of data needed to enable locales in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.
Compatibility
character
.
Unicode
definition
:
A
character
that
would
not
have
been
encoded
except
for
compatibility
and
round-trip
convertibility
with
other
standards.
(See
Section
2.3,
Compatibility
Characters
.)
Composite message . A single message, dynamically composed from more than one text string. The usual reason for creating composite messages is that one or more parts of the composite message will change according to the context. See Working with Composite Messages .
Composite vowel . A single vowel sound or diphthong that is represented by more than one code point from the available set of vowel marks, repurposed consonants, and diacritics.
Conjunct . A way of indicating consonant clusters, common in Brahmi-derived scripts, by visually merging or changing the glyphs for the sequence in some way. Conjunct behaviour is generally triggered in Unicode encoded text by adding a virama between the consonants.
Consonant cluster . A sequence of consonants with no intervening vowels. See also conjunct .
Consumer . When talking about strings on the Web, the W3C Internationalization group refers to a consumer as any process that receives natural language strings, either for display or processing.
Cursive . In the context of writing systems, this is applied to orthographies where letters are typically joined at the baseline (although some scripts have a few letters that only join on one side). Usually the font needs to support differences in glyph shape for the various joining contexts, which range from slight to radically different. Cursive scripts include Adlam, Arabic, Hanifi Rohingya, Mongolian, N'Ko, and Syriac. Letters in other scripts may also join, often at a hanging baseline, but they are not usually referred to as 'cursive', eg. Devanagari, Bengali, Gurmukhi, etc.
↑Daylight Savings Time (DST) or Summer Time . An approach to setting times of the day that was adopted as a way of allowing people more sunlight hours in the evening. DST varies from country to country (not to mention locality-to-locality) and often has special one-off changes to accommodate special events. Not all regions observe DST: usually those closer to the equator do not need it. In converting times it is important to know when DST was introduced, and sometimes abandoned, for the local area, as well as on what dates DST starts and ends (which can vary from year to year). For example, Korea Standard Time and Japan Standard Time currently use the same zone offset and neither uses daylight saving. However, Japan abandoned DST in 1951, while South Korea used it last in 1988, so an application that tracks time values that reach back that far might need to track these time zones separately.
Decomposed . Decomposed text is usually the result of applying Unicode normalization form D (NFD), which splits Unicode characters into component parts, typically a base character plus one or more diacritics. However, a decomposed sequence of code points may also be intentionally (or unintentionally) used by a content author where a precomposed alternative exists.
Dependent
vowel
.
Vowel_dependent
is
one
of
the
categories
in
the
Indic_Syllabic_Category
property
set
(
see
a
list
).
The
Unicode
Standard
definition
says
:
A
symbol
or
sign
that
represents
a
vowel
and
that
is
attached
or
combined
with
another
symbol,
usually
one
that
represents
a
consonant.
Dependent
vowels
are
usually
combining
marks,
but
may
also
be
letters
(eg.
in
Thai,
or
New
Tai
Lue,
which
has
no
combining
characters).
Document character set . The character set used for processing a document, regardless of what character encoding was used to store it. For XML and HTML (from version 4.0), this is always Unicode. This means that browsers convert all text to Unicode internally and the logical model describing how XML or HTML are processed is described in terms of the set of characters defined by Unicode.
↑
European
digits
.
A
term
used
by
the
Unicode
Standard
to
refer
to
ASCII
digits.
Unicode
definition
:
Forms
of
decimal
digits
first
used
in
Europe
and
now
used
worldwide.
Historically,
these
digits
were
derived
from
the
Arabic
digits;
they
are
sometimes
called
“Arabic
numerals,”
but
this
nomenclature
leads
to
confusion
with
the
real
Arabic-Indic
digits.
Also
called
"Western
digits"
and
"Latin
digits."
See
Terminology
for
Digits
for
additional
information
on
terminology
related
to
digits.
Extended grapheme cluster . See grapheme cluster .
Extended
language
range
.
A
language
range
consisting
of
a
sequence
of
hyphen-separated
subtags.
In
an
extended
language
range,
a
subtag
can
either
be
a
valid
subtag
or
the
wildcard
subtag
,
which
matches
any
value.
*
Featural syllabary . A syllabic writing system where the syllable glyphs are not arbitrary shapes but instead show, usually in a regular way, phonological features that are part of the syllable they represent. Examples include Korean (where a syllabic character is made up of strokes representing the individual sounds of the syllable) and Canadian Syllabics (where the vowel part of the syllable is expressed by rotation of the syllable glyph). See also Wikipedia .
Field-based formats . A time format that divides the date and/or time into separate field values such as year, month, day, hour, minute, second, etc. such as 2016-09-11T06:10:32 . Contrast this with an alternative way to express the same time, 1465621816590 , which is not field-based and is rather hard to read. Field-based times may or may not be tied to either UTC or the local time zone – or they may be indeterminate. Field-based times are also typically tied to a specific calendar (such as the Gregorian calendar). The formats described by the ISO 8601 standard are field-based.
First-strong detection . An algorithm that looks for the first strongly-directional character in a string (while ignoring embedded runs of isolated text), and then uses that to guess at the appropriate base direction for the string as a whole. Unicode code points are associated with properties relating to text direction: generally, letters in right-to-left scripts such as Arabic and Hebrew have a strong RTL direction, whereas Latin and Han characters have a strong LTR direction. Other characters, such as punctuation, only have a weak intrinsic directionality, and the actual directionality is determined according to the context in which they are found.
Floating times . Times that are not fixed to a specific incremental time value or time zone. When you apply time zone information to floating times they produce a range of acceptable incremental time values , because they represent a nominal time which is described in the same way in all time zones around the world. For example, Saturday 11 June 2016 happens to be the date of the British Queen's official 90th birthday. The specific time when 11th June starts or ends in Britain may actually be on Friday or Sunday in other countries, because their clocks are set differently, but the date of the event is always referred to as Saturday 11 June. Other examples of floating time events include the publication date for an issue of a newspaper, the date the Tokyo Olympics starts, the time the New Year starts, office hours set to "9 to 5" regardless of time zone , and so on.
Fullwidth
.
Unicode
definition
:
Characters
of
East
Asian
character
sets
whose
glyph
image
extends
across
the
entire
character
display
cell.
In
legacy
character
sets,
fullwidth
characters
are
normally
encoded
in
two
or
three
bytes.
The
Japanese
term
for
fullwidth
characters
is
zenkaku.
See
also
halfwidth
.
General
category
.
Unicode
definition
:
Partition
of
the
characters
into
major
classes
such
as
letters,
punctuation,
and
symbols,
and
further
subclasses
for
each
of
the
major
classes.
(See
Section
4.5,
General
Category
.)
Glyph . The visual representation of a character when rendered by a particular font. In more complex orthographies a glyph may represent only a part of a character, or may represent more than one character. A font is a collection of glyph shapes, and different fonts or font rules can render a given character using a variety of different glyphs. For example, the letter 'a' can be represented using regular (a), bold ( a ), or italic ( a ) glyphs.
Grapheme
.
A
character
or
a
sequence
of
characters
in
a
visual
representation
of
some
text
that
a
typical
user
would
perceive
as
being
a
single
unit
(
character
).
Graphemes
are
important
for
a
number
of
text
operations
such
as
sorting
or
text
selection,
so
it
is
necessary
to
be
able
to
compute
the
boundaries
between
each
user-perceived
character.
For
more
information
about
graphemes
and
grapheme
clusters,
with
examples,
see
Character
encodings:
Essential
concepts
.
Grapheme
cluster
.
A
grapheme
cluster
is
defined
by
the
Unicode
Standard
as
the
default
mechanism
for
computing
an
approximation
to
graphemes
(see
Unicode
Standard
Annex
#29:
Text
Segmentation
[
UAX29
]).
Two
types
of
default
grapheme
cluster
are
defined.
Unless
otherwise
noted,
grapheme
cluster
in
this
document
refers
to
an
extended
default
grapheme
cluster
.
(A
discussion
of
grapheme
clusters
is
also
given
in
Section
2
of
the
Unicode
Standard
,
[
Unicode
].
Cf.
near
the
end
of
Section
2.11
in
version
14.0
of
The
Unicode
Standard.)
Because
different
natural
languages
have
different
needs,
grapheme
clusters
can
also
sometimes
require
tailoring.
For
example,
a
Slovak
user
might
wish
to
treat
the
default
pair
of
grapheme
clusters
"ch"
as
a
single
grapheme
cluster.
Note
that
the
interaction
between
the
language
of
string
content
and
the
end-user's
preferences
might
be
complex.
Gregorian calendar . The most widely used way of representing civil time. It is a solar calendar, with years usually consisting of 365 days, plus the concept of a "leap year". This adds an additional day every 4 years, except when the year is evenly divisible by 100 (unless the year is also evenly divisible by 400). There are numerous other calendars in use around the world, some of which are lunar calendars, some that are based on a different start date than the Gregorian calendar, and some that are reset each time a prominent person dies. Often these calendars are still used for religious purposes, but sometimes you will also find them being used in newspapers and emails, or for birth dates. There are technologies, such as ICU or Dojo, that support conversion between different calendaring systems.
↑
Halfwidth
.
Unicode
definition
:
Characters
of
East
Asian
character
sets
whose
glyph
image
occupies
half
of
the
character
display
cell.
In
legacy
character
sets,
halfwidth
characters
are
normally
encoded
in
a
single
byte.
The
Japanese
term
for
halfwidth
characters
is
hankaku.
See
also
fullwidth
.
Ideograph
.
Unicode
definition
:
(1)
Any
symbol
that
primarily
denotes
an
idea
or
concept
in
contrast
to
a
sound
or
pronunciation
–
for
example,
♻,
which
denotes
the
concept
of
recycling
by
a
series
of
bent
arrows.
(2)
A
generic
term
for
the
unit
of
writing
of
a
logosyllabic
writing
system.
In
this
sense,
ideograph
(or
ideogram)
is
not
systematically
distinguished
from
logograph
(or
logogram).
(3)
A
term
commonly
used
to
refer
specifically
to
Han
characters,
equivalent
to
the
Chinese,
Japanese,
or
Korean
terms
also
sometimes
used:
hànzì,
kanji,
or
hanja.
Ijam . A diacritic in the Arabic script that is considered to be an integral part of a basic letter form, such as the dots in ث U+062B ARABIC LETTER THEH , pronounced θ . Unicode encodes letter+ijam combinations as atomic characters which are never given equivalent decompositions in the standard . Ijam generally take the form of one-, two-, three- or four-dot markings above or below the basic letter skeleton, although other diacritic forms occur, especially in extensions of the Arabic script in Central and South Asia and in Africa. For example, ۈ U+06C8 ARABIC LETTER YU shows a letter with ijam that represents the vowel y in the Uighur orthography. See Chapter 9 of the Unicode Standard. See also: tashkil .
Incremental time . A way of representing time in computers that is based on a progression of fixed integer units that increase monotonically from a specific point in time (called the "epoch"). Java (and many other systems) count time as the number of milliseconds since midnight (00:00 a.m.) on January 1, 1970 in UTC (less all of the intervening leap seconds). Other systems use different units and/or epochs. For example, the incremental time for 11 June, 2016 at 6.10am BST in JavaScript is 1465621816590 . Most programming languages and operating environments provide or use incremental time for working with time values. However, incremental time is not usually seen directly by users, but is typically mapped to a field-based time format for interchange or for human consumption.
Independent vowel . Independent letters used to represent standalone vowel sounds. They are typically found in Brahmi-derived Indic scripts, at the beginning of a word or after a word-internal vowel.
Inherent vowel . A vowel sound that is automatically pronounced after a consonant letter, unless suppressed by either indicating another vowel, or using a character specifically designed to kill the vowel sound, or contextual rules. Inherent vowels are commonly found in scripts Brahmi-derived Indic scripts. The sound of the inherent vowel varies by language.
Internationalization
.
The
design
and
development
of
a
product
that
is
enabled
for
target
audiences
that
vary
in
culture,
region,
or
language.
Internationalization
is
sometimes
abbreviated
i18n
because
there
are
eighteen
letters
between
the
"I"
and
the
"N"
in
the
English
word.
International preferences . A user's particular set of language and formatting preferences and associated cultural conventions. Software can use these preferences to correctly process or present information exchanged with that user.
IANA Language Subtag Registry . A machine-readable text file available via IANA which contains a comprehensive list of all of the subtags valid in language tags. (Link: Registry )
↑Jamo . The basic unit used to form Hangul syllables, representing vowels and consonants in Korean. In addition to code points for Jamo, Unicode encodes 11,172 Hangul syllables . These represent combinations of Jamo as single, pre-composed characters. In practice, such syllables are the main characters in actual use. (See also: Unicode Korean FAQ )
↑
Kana
.
Unicode
definition
:
A
collective
term
for
the
two
syllabic
scripts
used
(along
with
kanji
and
romaji
)
by
the
Japanese
writing
system.
The
two
forms
are
hiragana
and
katakana
.
Language metadata . When constrasted with the text-processing language , this indicates the intended use of the resource as a whole. For example, such metadata may be used for searching for a relevant resource, for serving the right language version, for classification, etc. This type of language declaration differs from that of the text-processing language declaration in that (a) the value for such declarations may be more than one language subtag, and (b) the language value declared doesn't indicate which bits of a multilingual resource are in which language.
Language tag extension . A system of additional [ BCP47 ] subtags introduced by a single letter or digit subtag registered with IANA and permitting additional types of language identification.
Language negotiation . Any process which selects or filters content based on language. Usually this implies selecting content in a single language (or falling back to some meaningful default language that is available) by finding the best matching values when several languages or locales [ LTLI ] are present in the content. Some common language negotiation algorithms include the Lookup algorithm in [ BCP47 ] or the BestFitMatcher in [ ECMA-402 ].
Language
priority
list
.
A
collection
of
one
or
more
language
ranges
identifying
the
user's
language
preferences
for
use
in
matching.
As
the
name
suggests,
such
lists
are
normally
ordered
or
weighted
according
to
the
user's
preferences.
The
HTTP
[
RFC2616
]
Accept-Language
[
RFC3282
]
header
is
an
example
of
one
kind
of
language
priority
list.
Language range . A string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".
Language subtag . A sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag . In [ BCP47 ], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).
Language tag . A string used as an identifier for a language, usually referring explicitly to a [ BCP47 ] language tag. These language tags consist of one or more language subtags .
Legacy character encodings . Character encoding forms that do not encode the full repertoire of characters in the Unicode character set.
Locale . An identifier (such as a language tag ) for a set of international preferences . Usually this identifier indicates the preferred language of the user and possibly includes other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.
Locale-aware (or Enabled ). A system that can respond to changes in the locale with culturally and language-specific behavior or content. Generally, systems that are internationalized can support a wide range of locales in order to meet the international preferences of many kinds of users.
Locale fallback . The process of searching for translated content, locale data, or other resources by "falling back" from more-specific resources to more-general ones following a deterministic pattern.
Locale-neutral . A non-linguistic field is said to be locale-neutral when it is stored or exchanged in a format that is not specifically appropriate for any given language, locale, or culture and which can be interpreted unambiguously for presentation in a locale-aware way.
Localizable content . Content that can be adapted to meet the needs of a particular language, culture, or region. It includes both localizable text and non-text content such as icons.
Localizable
text
.
String
content
intended
as
human-readable
text
and
not
to
any
of
the
surrounding
or
embedded
syntactic
content
that
form
part
of
the
document
structure.
Note
that
syntactic
content
can
have
localizable
text
embedded
in
it,
such
as
when
an
[
HTML
]
img
element
has
an
alt
attribute
containing
a
description
of
the
image.
[
CHARMOD-NORM
]
gives
some
additional
examples
.
See
also
localizable
content
,
syntactic
content
,
and
natural
language
.
Localization
.
The
tailoring
of
a
system
to
the
individual
cultural
expectations
of
a
specific
target
market
or
group
of
individuals.
Localization
includes,
but
is
not
limited
to,
the
translation
of
user-facing
text
and
messages.
Localization
is
sometimes
abbreviated
as
l10n
because
there
are
ten
letters
between
the
"L"
and
the
"N"
in
the
English
word.
When
a
particular
set
of
content
and
preferences
corresponding
to
a
specific
set
of
international
preferences
is
operationally
available,
then
the
system
is
said
to
be
localized
.
Logical order . Some scripts, in particular Arabic and Hebrew, are written from right to left. Text including characters from these scripts can run in both directions and is therefore called bidirectional text . The Unicode Standard [ Unicode ] requires that characters be stored and interchanged in logical order, i.e. roughly corresponding to the order in which text is typed in via the keyboard or spoken (for a more detailed definition see [ Unicode ], Section 2.2). Logical ordering is important to ensure interoperability of data, and also benefits accessibility, searching, and collation.
LTR . Stands for "left-to-right" and refers to the inline base direction of left-to-right [ UAX9 ]. This is the base text direction used by languages whose starting character progression begins on the left side of the page in horizontal text. It's used for scripts such as Latin, Cyrillic, Devanagari, and many others. See also RTL and auto (direction) .
↑Metadata . Additional information about data. Key types of metadata for internationalization are language metadata and metadata to support bidirectional text . Metadata has a scope, e.g., a string or a set of strings. In absence of explicit metadata, defaults might apply, e.g. defaults for the base direction of a text.
Mojibake
(
文字化け
).
Garbled
or
incorrectly
rendered
or
processed
characters,
generally
caused
by
using
the
wrong
character
encoding
to
interpret
the
bytes
in
a
string
or
file.
The
word
is
Japanese
in
origin
and
is
pronounced
/mo.d͡ʒi.ba.ke/
.
For
example,
the
word
文字化け
encoded
as
UTF-8
might
be
displayed
as
æ–‡å—化ã‘
if
viewed
in
an
application
that
thinks
(incorrectly)
that
the
character
encoding
is
ISO-8859-1
.
Natural Language (sometimes just language ). Refers to the spoken, written, or signed communications used by human beings. See also localizable text and syntactic content .
Non-linguistic Field . Any element of a data structure not intended for the storage or interchange of natural language textual data. This includes non-string data types, such as booleans, numbers, dates, and so forth. It also includes strings, such as program or protocol internal identifiers. This document uses the term field as a short hand for this concept.
Normalization . The process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence. In internationalization contexts this usually refers to applying one of the Unicode normalization forms, such as NFC , to a string. For more info, see [ CHARMOD-NORM ] or [ UAX15 ]
Normalization
Form
C
(
NFC
).
Unicode
definition
:
A
normalization
form
that
erases
any
canonical
differences,
and
generally
produces
a
composed
result.
For
example,
a
+
umlaut
is
converted
to
ä
in
this
form.
This
form
most
closely
matches
legacy
usage.
The
formal
definition
is
D120
in
Section
3.11,
Normalization
Forms
.
See
also
normalization
.
Normalization
Form
D
(
NFD
).
Unicode
definition
:
A
normalization
form
that
erases
any
canonical
differences,
and
produces
a
decomposed
result.
For
example,
ä
is
converted
to
a
+
umlaut
in
this
form.
This
form
is
most
often
used
in
internal
processing,
such
as
in
collation.
The
formal
definition
is
D118
in
Section
3.11,
Normalization
Forms
.
See
also
normalization
.
Normalization
Form
KC
(
NFKC
).
Unicode
definition
:
A
normalization
form
that
erases
both
canonical
and
compatibility
differences,
and
generally
produces
a
composed
result:
for
example,
the
single
dž
character
is
converted
to
d
+
ž
in
this
form.
This
form
is
commonly
used
in
matching.
The
formal
definition
is
D121
in
Section
3.11,
Normalization
Forms
.
Note
that
compatibility
decomposition
removes
meaning
from
the
text
that
it
is
applied
to.
Some
developers
and
specification
authors
find
this
normalization
form
attractive
because
it
appears
to
bring
together
many
strings
that
are
logically
similar,
but
NFKC
has
limited
utility
in
actual
practice
and
has
side
effects
that
confuse
users.
This
definition
is
provided
for
completeness,
but
NFKC
is
not
generally
appropriate
for
use
on
the
Web.
See
also
normalization
.
Normalization
Form
KD
(
NFKD
).
Unicode
definition
:
A
normalization
form
that
erases
both
canonical
and
compatibility
differences,
and
produces
a
decomposed
result:
for
example,
the
single
dž
character
is
converted
to
d
+
z
+
caron
in
this
form.
The
formal
definition
is
D121
in
Section
3.11,
Normalization
Forms
.
Some
developers
and
specification
authors
find
this
normalization
form
attractive
because
it
appears
to
bring
together
many
strings
that
are
logically
similar,
but
NFKD
has
limited
utility
in
actual
practice
and
has
side
effects
that
confuse
users.
This
definition
is
provided
for
completeness,
but
NFKD
is
not
generally
appropriate
for
use
on
the
Web.
See
also
normalization
.
Othographic
syllable
A
typographic
character
unit
that
includes
more
than
one
or
more
grapheme
cluster
clusters
.
These
are
most
commonly
found
in
Brahmi-derived
scripts,
such
scripts
(such
as
Devanagari,
or
Balinese,
Balinese)
when
forming
conjuncts
or
stacks.
This
term
is
used
but
not
defined
They
commonly
demarcate
sequences
of
characters
that
are
different
from
phonetic
syllables,
and
they
may
even
span
word
boundaries.
See
also
the
definition
in
the
Unicode
Standard:
it
is
defined
here
for
use
by
W3C
documents,
such
as
layout
requirements.
Standard.
Paragraph
direction
.
The
initial
base
direction
of
a
paragraph
or
string,
which
resolves
to
either
left-to-right
or
right-to-left
.
Nested
embedding
controls
may
be
used
to
change
the
direction
of
an
inline
range
of
text,
but
the
paragraph
direction
sets
the
starting
point
which
the
Unicode
Bidirectional
Algorithm
uses
to
calculate
the
directions
of
the
embedded
levels.
For
more
details,
see
Unicode
Standard
Annex
#9,
Unicode
Bidirectional
Algorithm
[
UAX9
],
especially
definitions
BD2–BD5.
See
also
the
definition
in
the
Unicode
Standard
.
Percent-encoding
.
Percent
encoding
is
the
escaping
mechanism
defined
by
URI
[
RFC3986
]
for
the
encoding
of
arbitrary
byte
values
not
otherwise
permitted
into
a
URI.
For
example,
if
a
user
wishes
to
include
the
character
/
U+002F
SOLIDUS
in
a
URI,
the
byte
0x2F
is
encoded
as
the
character
sequence
%2F
.
If
the
user
wishes
to
include
the
character
é
U+00E9
LATIN
SMALL
LETTER
E
WITH
ACUTE
in
a
URI,
the
UTF-8
byte
sequence
for
this
character
(
0xC3
0xA9
)
could
be
encoded
as
the
sequence
%C3%A9
.
Plane
.
Unicode
definition
:
A
range
of
65,536
(10000
16
)
contiguous
Unicode
code
points,
where
the
first
code
point
is
an
integer
multiple
of
65,536
(10000
16
).
Planes
are
numbered
from
0
to
16,
with
the
number
being
the
first
code
point
of
the
plane
divided
by
65,536.
Thus
Plane
0
is
U+0000..U+FFFF,
Plane
1
is
U+
1
0000..U+
1
FFFF,
...,
and
Plane
16
(10
16
)
is
U+
10
0000..
10
FFFF.
(Note
that
ISO/IEC
10646
uses
hexadecimal
notation
for
the
plane
numbers—for
example,
Plane
B
instead
of
Plane
11).
See
also
Basic
Multilingual
Plane
.
Pre-base vowel . A pre-base (or prescript) vowel glyph is displayed before the consonant or orthographic syllable after which it is pronounced. If the vowel character is a combining mark, it is still typed and stored in pronunciation order, and the application will render it in the correct location. In some scripts, such as Thai, a pre-base vowel glyph is represented by a normal letter, which is typed and stored in the correct position relative to the base.
Precomposed . A precomposed character is one that can also be broken down into separate code points representing its component parts (decomposition). Typically this will include base characters with diacritics, such as accented Latin characters, or Indic characters with nuktas. Normalization Form C (NFC) produces precomposed characters from many decomposed sequences.
Producer . When talking about strings on the Web, the W3C Internationalization group refers to a producer as any process where natural language string data is created for later storage, processing, or interchange.
↑
Resource
.
In
the
context
of
W3C
Internationalization
documents,
a
given
document,
file,
or
protocol
"message"
which
includes
both
the
localizable
content
as
well
as
the
syntactic
content
such
as
identifiers
surrounding
or
containing
it.
For
example,
in
an
HTML
document
that
also
has
some
CSS
and
a
few
script
elements
with
embedded
JavaScript,
the
entire
HTML
document,
considered
as
a
file,
is
a
resource.
This
term
is
intentionally
similar
to
the
term
'resource'
as
used
in
[
RFC3986
],
although
here
the
term
is
applied
loosely.
Resource identifier . A compact string of characters for identifying an abstract or physical resource . On the Web, this mostly means various types of Universal Resource Identifiers (or URIs ). For wire formats, [ RFC3986 ] defines the structure and serialization. Internationalized Resource Identifiers (or IRIs ) [ RFC3987 ] describes how non-ASCII Unicode characters can be used in resource identifiers. The WhatWG [ URL ] specification describes how browsers handle IRIs and their mapping to URIs.
RTL . Stands for "right-to-left" and refers to the inline base direction of right-to-left [ UAX9 ]. This is the base text direction used by languages whose starting character progression begins on the right side of the page in horizontal text. It's used for a variety of scripts which include Arabic, Hebrew, N'Ko, Adlam, Thaana, and Syriac among others. See also LTR and auto (direction) .
Ruby . A name for small (usually phonetic) annotations that are rendered alongside text. 'Ruby' is a British and Japanese printing term (often also called furigana in Japan). Similar annotations are also used for Chinese, and sometimes Mongolian, and Korean.
↑Scalar value , see Unicode scalar value .
Script
.
Unicode
definition
:
A
collection
of
letters
and
other
written
signs
used
to
represent
textual
information
in
one
or
more
writing
systems.
For
example,
Russian
is
written
with
a
subset
of
the
Cyrillic
script;
Ukrainian
is
written
with
a
different
subset.
The
Japanese
writing
system
uses
several
scripts.
See
also
writing
system
.
Serialization agreement . When talking about strings on the Web, the W3C Internationalization group refers to serialization agreements as the common understanding between a producer and consumer about the serialization of string metadata: how it is to be understood, serialized, read, transmitted, removed, etc.
Shaping . Making context-sensitive changes to glyph shapes. Shaping may or may not occur at the same time as context-sensitive positioning of glyphs (such as higher diacritics over tall base characters).
Spacing mark . Combining characters that consume space along the baseline, either before or after the base character.
Spillover . Errors in text presentation due to a lack of bidi isolation . When strings appear next to each other or when values are inserted into text without isolation, the bidi algorithm can visually rearrange the text in ways that make the text illegible. See the article Inline markup and bidirectional text in HTML for more info. See also: bidi isolation
Standalone vowel . Vowel sounds that are not immediately preceded by a consonant sound; they typically appear at the beginning or in the middle of a word. In Brahmi-derived scripts standalone vowels are often represented using independent vowel letters.
String direction . The overall direction of a specific string, which indicates the presentation order of string-internal directional runs. Strings transmitted inside various data structures are often inserted into a block (such as a paragraph). In such a case, the string direction is needed as part of the bidi isolation of the string.
Supplementary character . Beyond the Basic Multilingual Plane the Unicode character set also contains space for around a million additional code point positions. Characters in this latter range are referred to as supplementary characters. In the UTF-16 encoding, each supplementary character is encoded using a pair of surrogates .
Surrogate
code
point
.
Unicode
definition
:
A
Unicode
code
point
in
the
range
This
term
is
also
defined
by
[
INFRA
].
U+D800..U+DFFF
.
Reserved
for
use
by
UTF-16,
where
a
pair
of
surrogate
code
units
(a
high
surrogate
followed
by
a
low
surrogate)
“stand
in”
for
a
supplementary
code
point
.
Surrogate
pair
.
In
the
UTF-16
character
encoding
of
Unicode,
a
sequence
of
two
surrogate
code
points
,
one
from
the
range
U+D800...U+DBFF
(a
high
surrogate
)
followed
by
one
from
the
range
U+DC00...U+DFFF
(a
low
surrogate
).
Each
surrogate
pair
encodes
a
single
supplementary
code
point
.
Syllabary . A type of writing system in which each symbol typically represents both a consonant and a vowel, or in some instances more than one consonant and a vowel. Usually there is also a set of symbols that represent standalone vowel sounds. Alternatives include abugida , alphabet and abjad .
Syllable . A model of syllable structure divides the syllable into an onset followed by a rhyme . The rhyme is typically composed of a nucleus and an optional coda . The nucleus is the most sonorous part of the syllable. A syllable always has a nucleus, but syllables may have no onset and/or coda (eg. compare 'but', 'an', 'the', 'a').
Syntactic content . Any text in a document format or protocol that belongs to the structure of the format or protocol. This definition includes values that are typically thought of as "markup" but can also include other values, such as the name of a field in an HTTP header. Syntactic content consists of all of the characters that form the structure of a format or protocol. For example, < and > (as well as the element name and various attributes they surround) are part of the syntactic content in an HTML document. Syntactic content usually is defined by a specification or specifications and includes both the defined, reserved keywords for the given protocol or format as well as string tokens and identifiers that are defined by document authors to form the structure of the document (rather than the "content" of the document). See also localizable text and natural language .
↑Tashkil . An Arabic script mark that indicates vocalization of text or other types of phonetic guide which indicate pronunciation, such as in ثَ U+062B ARABIC LETTER THEH + U+064E ARABIC FATHA , pronounced θa . These include several subtypes: harakat (short vowel marks), tanwin (postnasalized, that is, an extra n sound at the end of a noun marked by a double similar harakat), shaddah (consonant gemination mark), and sukun (to mark lack of a following vowel). A basic Arabic letter plus any of these types of marks is never encoded as an atomic, precomposed character, but must always be represented as a sequence of letter plus separate combining mark. For example, هٰ U+0647 ARABIC LETTER HEH + U+0670 ARABIC LETTER SUPERSCRIPT ALEF pronounced ha , is an example of a letter plus tashkil combination in Arabic (cf. the use of that diacritic in a precomposed Uighur letter). See Chapter 9 of the Unicode Standard. See also ijam .
Text-processing language . The language in which a specific range of text is actually written. This needs to be declared so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, style processors, hyphenators, etc., can apply the appropriate rules to the text in question. So we are, by necessity, talking about associating a single language with a specific range of text. Contrast this with language metadata .
Time zone . A set of rules for determining the local time (wall time) as it relates to incremental time (as used in most computing systems) for a particular geographical region, and vice versa. Time zone rules have to take into account zone offsets plus any daylight savings modifications to wall time that apply.
Time
zone
identifiers
.
Allows
you
to
refer
to
a
particular
difference
from
UTC
that
includes
both
zone
offsets
and
daylight
savings
time
.
The
most
definitive
reference
for
identifying
sets
of
time
zone
rules
is
the
TZ
database
(also
known
as
the
Olson
time
zone
database),
which
is
used
by
systems
such
as
various
commercial
UNIX
operating
systems,
Linux,
Java,
CLDR,
ICU,
and
many
other
systems
and
libraries.
Other
systems
exist:
for
example,
Microsoft
Windows
uses
its
own
data
set
and
identifiers.
In
the
TZ
database,
time
zones
are
given
IDs
that
typically
consist
of
a
region
and
exemplar
city.
An
exemplar
city
is
a
city
in
the
time
zone
in
question
that
should
be
well-known
to
people
using
the
time
zone.
For
example,
the
U.S.
Pacific
time
zone
has
a
TZ
database
ID
of
America/Los_Angeles
.
The
TZ
database
also
supplies
aliases
for
many
IDs;
for
example,
Asia/Ulan
Bator
is
equivalent
to
Asia/Ulaanbaatar
.
The
Common
Locale
Data
Repository
(CLDR)
can
be
used
to
provide
a
localized
form
for
the
IDs.
Note
that
some
systems,
such
as
Apple's
Mac
OS,
provide
additional
exemplar
cities.
Titlecase . Unicode definition: Uppercased initial letter followed by lowercase letters in words. A casing convention often used in titles, headers, and entries, as exemplified in this glossary. Note that titlecasing rules are language-sensitive. For more information, see Case Mapping and Case Folding in Character Model for the World Wide Web: String Matching .
Transcoder . A process that converts text between two character encodings. Most commonly in W3C internationalization documents it refers to a process that converts from a legacy character encoding to a Unicode encoding form , such as UTF-8 .
Transcription . A transcription is likely to be more phonetically accurate than a transliteration (though usually still only reflects an approximation to the actual sound), but, in particular, it does not usually allow completely reversible conversions.
Transliteration . In a transliteration each native character is associated with an equivalent and unique Latin-script character. The transliteration may not accurately represent pronunciation, but does allow straightforward and reversible conversion between the two scripts. Compare with transcription .
Typographic
character
unit
.
CSS
definition
:
the
basic
unit
of
text.
Even
within
the
realm
of
text
layout,
the
relevant
character
unit
depends
on
the
operation.
For
example,
line-breaking
and
letter-spacing
will
segment
a
sequence
of
Thai
characters
that
include
ำ
U+0E33
THAI
CHARACTER
SARA
AM
differently;
or
the
behavior
of
a
conjunct
consonant
in
a
script
such
as
Devanagari
may
depend
on
the
font
in
use.
So
the
typographic
character
represents
a
unit
of
the
writing
system—such
as
a
Latin
alphabetic
letter
(including
its
diacritics),
Hangul
syllable,
Chinese
ideographic
character,
Myanmar
syllable
cluster—that
is
indivisible
with
respect
to
a
particular
typographic
operation
(line-breaking,
first-letter
effects,
tracking,
justification,
vertical
arrangement,
etc.).
Typographic
letter
unit
.
CSS
definition
:
a
typographic
character
unit
belonging
to
one
of
the
Letter
or
Number
general
categories
.
Unicameral
or
unicase
.
Unicode
definition
:
A
script
that
has
no
case
distinctions.
Unicode Bidirectional Algorithm or Bidi algorithm . The name for the rules described in the Unicode Standard Annex #9, “Unicode Bidirectional Algorithm [ UAX9 ]. Those rules describe how inline bidirectional text should be rendered for scripts such as Arabic, Hebrew, Thaana, N'Ko, Adlam, etc. The effects of the bidi algorithm depend on the base direction and the directional properties of the characters to which it is applied.
Unicode
code
point
.
The
numeric
value
assigned
to
each
Unicode
character.
Unicode
code
points
range
from
0
to
0x10FFFF
.
(See
Section
4.1
in
[
CHARMOD
]
for
a
deeper
discussion
of
character
encoding
terminology.)
Unicode
code
points
are
denoted
as
U+
hhhh
,
where
hhhh
is
a
sequence
of
at
least
four,
and
at
most
six
hexadecimal
digits.
For
example,
the
character
€
U+20AC
EURO
SIGN
has
the
code
point
U+20AC
,
while
the
character
😺
U+1F63A
SMILING
CAT
FACE
WITH
OPEN
MOUTH
has
the
code
point
U+1F63A
.
Unicode Locale Identifier or Unicode Locale . A language tag that follows the additional rules and restrictions on subtag choice defined in UTR#35 [ UAX35 ]. Any valid Unicode locale identifier is also a valid [ BCP47 ] language tag , but a few valid language tags are not also valid Unicode locale identifiers.
Unicode
scalar
value
.
Unicode
definition
:
Any
Unicode
code
point
except
high-surrogate
and
low-surrogate
code
points.
In
other
words,
the
ranges
of
integers
0
to
D7FF
16
and
E000
16
to
10FFFF
16
inclusive.
(See
definition
D76
in
Section
3.9,
Unicode
Encoding
Forms
.)
Universal Character Set or Unicode . The character set or repertoire defined by the [ Unicode ] Standard and which includes all of the characters used to encode text, including historical or extinct writing systems as well as modern usage, private use, typesetting symbols, and other things, such as the emoji. Other character sets are subsets of Unicode.
Universal
Coordinated
Time
(UTC)
.
The
basis
for
modern
timekeeping.
Among
other
things,
it
provides
a
common
baseline
for
converting
between
incremental
time
and
wall
time
.
UTC
is
also
known
as
GMT
(Greenwich
Mean
Time).
There
are
some
subtle
differences
between
the
two,
but
none
that
the
average
person
would
notice.
The
time
zone
offset
for
UTC
is
0.
UTC
is
often
indicated
in
field-based
formats
using
Z
.
User-facing identifiers . Identifiers defined by or assigned by a user in a vocabulary that is intended to be at least potentially visible to end-users (and thus is localizable text ).
User-perceived character . See grapheme .
User-supplied value . Unreserved syntactic content in a vocabulary that is assigned by users, as distinct from reserved keywords in a given format or protocol. Users generally expect that their user-supplied values can be words or phrases in their preferred natural language . This is why [ CHARMOD ] recommends that "Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive." [ CHARMOD-NORM ] gives some examples .
UTF-8
.
Unicode
definition
:
A
multibyte
encoding
for
text
that
represents
each
Unicode
character
with
1
to
4
bytes,
and
which
is
backward-compatible
with
ASCII.
UTF-8
is
the
predominant
form
of
Unicode
in
web
pages.
More
technically:
(1)
The
UTF-8
encoding
form
.
(2)
The
UTF-8
encoding
scheme
.
(3)
“UCS
Transformation
Format
8,”
defined
in
Annex
D
of
ISO/IEC
10646:2003,
technically
equivalent
to
the
definitions
in
the
Unicode
Standard.
UTF-16
.
Unicode
definition
:
A
multibyte
encoding
for
text
that
represents
each
Unicode
character
with
2
or
4
bytes;
it
is
not
backward-compatible
with
ASCII.
It
is
the
internal
form
of
Unicode
in
many
programming
languages,
such
as
Java,
C#,
and
JavaScript,
and
in
many
operating
systems.
More
technically:
(1)
The
UTF-16
encoding
form
.
(2)
The
UTF-16
encoding
scheme
.
(3)
"Transformation
format
for
16
planes
of
Group
00,"
defined
in
Annex
C
of
ISO/IEC
10646:2003;
technically
equivalent
to
the
definitions
in
the
Unicode
Standard.
Valid language tag . A language tag that is well-formed and which also conforms to the additional conformance requirements in [ BCP47 ], notably that each of the subtags appears in the IANA Language Subtag Registry. Contrast this with well-formed language tags.
Variation
selector
.
Unicode
definition
:
Any
of
three
ranges
of
Unicode
characters
designated
for
use
in
defining
a
variation
sequence
.
Variation
selectors
in
the
range
U+FE00..U+FE0F
are
known
as
general-use
variation
selectors
and
are
used
for
standardized
variation
sequences
.
Two
of
these,
U+FE0E
and
U+FE0F,
have
specialized
functions
when
used
with
emoji
base
characters.
Variation
selectors
in
the
range
U+180B..U+180D
are
known
as
Mongolian
Free
Variation
Selector
;
their
use
is
limited
to
standardized
variation
sequences
for
the
Mongolian
script.
Variation
selectors
in
the
range
U+E0100..U+E01EF
are
known
as
ideographic
variation
selectors
and
are
used
for
ideographic
variation
sequences
.
Variation
selectors
are
all
nonspacing
combing
marks
(General_Category=Mn).
They
have
no
graphic
shape
of
their
own;
instead
they
function
to
pick
out
a
particular,
defined
subset
of
potential
graphic
presentations
for
the
base
character
to
which
they
are
applied.
All
variation
selectors
are
default
ignorable
code
points
(DICP=Yes),
meaning
that
if
they
are
not
interpretable
in
combination
with
their
base
character,
they
should
be
ignored
for
display,
rather
than
shown
with
a
nondisplayable
glyph
box,
for
example.
See
Section
23.4,
Variation
Selectors
.
The
term
variation
selector
is
sometimes
abbreviated
as
'VS'.
Virama
.
Unicode
definition
:
From
Sanskrit.
The
name
of
a
sign
used
in
many
Indic
and
other
Brahmi-derived
scripts
to
suppress
the
inherent
vowel
of
the
consonant
to
which
it
is
applied,
thereby
generating
a
dead
consonant.
(See
Section
12.1,
Devanagari
.)
The
sign
varies
in
shape
from
script
to
script,
and
may
be
known
by
other
names
in
various
languages.
It
may
also
be
visible
or
hidden
in
consonant
clusters,
depending
on
the
language
and
context.
Used
for
scripts
such
as
Devanagari,
Bengali,
Tamil,
Balinese,
etc.
Vocabulary . The list of reserved keywords and/or rules for assigning user-supplied values (such as identifiers) in a format or protocol. This can include restrictions on range, order, or type of characters that can appear in different places. For example, HTML defines the names of its elements and attributes, as well as enumerated attribute values, which defines the "vocabulary" of HTML syntactic content . Another example would be ECMAScript, which restricts the range of characters that can appear at the start or in the body of an identifier or variable name. It applies different rules for other cases, such as to the values of string literals. Values within a vocabulary fall into two broad classes: those that are meant to be seen, read, or interacted with by humans (and thus might be expected to contain natural language text); and those that are application or protocol internal and not intended for human interaction.
↑Wall time or local time . A moment in time that can be mapped to a specific point in incremental time if you apply any relevant time zone information, but it corresponds to what a person would recognise the time to be if they looked at a clock and/or calendar mounted on a wall in a particular place. So, for example, the time displayed by a computer in the UK may be Sat 11 Jun 06:10 . By applying knowledge about how that time relates to UTC (in this case, adjusting by one hour to account for British Summer Time) it is possible to convert that to the incremental time 1465621816590 . It's also possible to convert that to a wall time in another location, such as San Francisco, where someone looking at their computer's time display at the same time would have seen Fri 10 Jun 22:10 .
Well-formed language tag . A language tag that follows the grammar defined in [ BCP47 ]. That is, it is structurally correct, consisting of ASCII letters and digit language subtags of the prescribed length, separated by hyphens. Contrast this with valid language tag .
Word boundary or Word . The concept of 'word' is difficult to define in any language (see What is a word? ). In these definitions, a word is an often vaguely-defined, but recognisable semantic unit that is typically smaller than a phrase and may comprise one or more syllables. Word boundaries are typically important for text operations such as line-breaking, and for prosodic and phonetic rules.
Writing
system
.
Unicode
definition
:
A
set
of
rules
for
using
one
or
more
scripts
to
write
a
particular
language.
Examples
include
the
American
English
writing
system,
the
British
English
writing
system,
the
French
writing
system,
and
the
Japanese
writing
system.
See
also
script
.
Zone offset . An amount that is added to or subtracted from UTC based on the location of the event around the world relative to the prime meridian. Usually offsets are at one-hour intervals, but offsets can also include other differences, such as 30 or 45 minutes. A common way to express a zone offset in field-based formats is with +/- followed by the offset. So for example, Japan is 9 hours ahead of UTC, so you may see a time written as 2016-06-11 05:10+09:00 . Note, importantly, that the zone offset does not help you convert times to wall time where daylight savings time is in force.
↑If you are writing a W3C specification, you can use the terminology in this document directly by importing this document as a cross-reference source. This may be especially helpful when using character encoding or locale -related terms found here. Such terms are common to the needs of many specifications and often refer to other definitions found in this glossary.
This glossary is a Working Group NOTE, which means that the definitions found in it are, by definition, non-normative. In most cases, most specifications do not need the definition of terms found here to be formally normative. Using the definitions found here lends clarity when using specialized terminology, but doesn't affect requirements directly and in those cases copying these definitions into your specification won't add value to your readers.
However, you might find that your specification needs a normative dependency on a definition. In those cases, you should copy the definition to your own document (perhaps linking back to this glossary as a source). Slight adjustments can be made to the definitions to fit local circumstances. Please contact the Internationalization Working Group with any questions or concerns when doing this.
This document is meant to be used in conjunction with formal, normative definitions found in [ INFRA ]. When a term is defined in both documents, the [ INFRA ] version SHOULD be preferred. The definitions in both documents are maintained to be consistent with one-another. Note that [ INFRA ]'s definitions may be used as normative.
The I18N Glossary has many more terms specific to internationalization than are to be found in Infra, including many terms useful in defining or discussing the handling of text.
If you are using ReSpec , you can import this glossary using the xref keyword in your configuration block. Complete instructions can be found the ReSpec documentation here .
For
this
document,
the
xref
configuration
will
look
something
like
this:
xref:
[
"i18n-glossary"
],
Adding
the
above
configuration
allows
you
to
write
references
to
the
terms
found
in
this
glossary
using
the
normal
ReSpec
notation
(
<a>
tags
or
[=term=]
markup).
If you use a terminology definition inside of a normative statement, ReSpec will complain that you've used an informative reference inside of a normative statement. The I18N Glossary is maintained as a WG Note, rather than on the REC-track, so it is formally informative. This is only necessary when you reference a term from a normative statement.
You can suppress the warning by including this directive in your configuration block:
lint:
informative-dfn:
false
lint
:
informative-dfn:
false
If
you
are
using
Bikeshed
to
generate
your
specification,
you
can
import
this
glossary
using
a
spec
directive
that
looks
like
this:
spec:
i18n-glossary;
urlPrefix:
https://www.w3.org/TR/i18n-glossary/
All
of
the
references
in
the
glossary
are
exported,
so
you
should
be
able
to
refer
to
terms
in
the
glossary
by
using
markup
that
looks
like
[=
term
=]
.
In
rare
cases,
you
may
need
to
include
specific
directives
following
the
spec
directive
to
ensure
that
the
definition
is
found
during
document
processing.
Here
is
an
example
using
the
term
locale
:
spec: i18n-glossary; urlPrefix: https://www.w3.org/TR/i18n-glossary/
type: dfn
text:
locale;
url:
locale