Copyright © 2020 W3C ® ( MIT , ERCIM , Keio , Beihang ). W3C liability , trademark and permissive document license rules apply.
This document describes the best practices for the identification of the natural language of content in document formats, specifications, and implementations on the Web. It also describes how languge tags are used to indicate user's locale preferences, which are used to process or display data values and other information on the Web.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This is an updated Public Working Draft of "Language Tags and Locale Identifiers for the World Wide Web". The Working Group expects this to become a Working Group Note.
If
you
wish
to
make
comments
regarding
this
document,
please
raise
a
github
issue
.
You
may
also
send
email
to
the
list
www-international@w3.org
(
subscribe
,
archives
)
as
mentioned
below.
Please
include
[ltli]
at
the
start
of
your
email's
subject.
To
make
it
easier
to
track
comments,
please
raise
separate
issues
or
send
separate
emails
for
each
comment.
All
comments
are
welcome.
This document was published by the Internationalization Working Group as an Editor's Draft.
GitHub Issues are preferred for discussion of this specification. Alternatively, you can send comments to our mailing list. Please send them to www-international@w3.org ( archives ).
Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the W3C Patent Policy . The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .
This document is governed by the 1 March 2019 W3C Process Document .
This document provides best practices for specification authors who need to identify natural language values in their document formats or protocols via the use of language tags [ BCP47 ], including common operations such as language negotiation . It also provides recommendations for how to specify locale-aware or internationalized behavior and defines core terminology that specifications might need to refer to these behaviors or capabilities.
Tags for identifying the natural language of content or the international preferences of users are one of the fundamental building blocks of the Web. The language tags found in Web and Internet formats and protocols are defined by [ BCP47 ]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.
Many
of
the
core
standards
for
the
Web
include
support
for
language
tags
;
these
include
the
xml:lang
attribute
in
[
XML10
],
the
lang
and
hreflang
atttributes
in
[
HTML
],
the
language
property
in
[
XSL10
],
and
the
:lang
pseudo-class
in
CSS
[
CSS3-SELECTORS
].
Language tags can also be used to identify international preferences associated with a given piece of content or user because these preferences are linked to the natural language, regional association, or culture of the end user. Such preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults for items such as the presentation of a calendar, or common units of measurement; selecting between 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, these preferences are usually called a locale . The extensions to [ BCP47 ] that define Unicode locales [ CLDR ] provide the basis for internationalization APIs on the Web, notably the JavaScript language [ ECMASCRIPT ] uses Unicode locales as the basis for the APIs found in [ ECMA-402 ].
Document formats and protocols often need to provide metadata about the natural language of content or perform language negotiation when selecting appropriate content on the Web. For more information and best practices related to the specification and use of language tags as metadata, see [ STRING-META ].
Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.
International Preferences A user's particular set of cultural conventions, language, and formatting choices that software must employ to correctly process or present information exchanged with that user.
Internationalization
The
design
and
development
of
a
product
that
is
enabled
for
target
audiences
that
vary
in
culture,
region,
or
language.
Internationalization
is
sometimes
abbreviated
I18N
because
there
are
eighteen
letters
between
the
"i"
and
the
"n"
in
the
English
word.
There are many kinds of international preferences that may be offered on the Web in order for the content or service to be considered usable and acceptable by users around the world. Some of these preferences might include:
Because there are a large number of preferences, software systems (operating environments and programming languages) often use an identifier that combines natural language and other information, such as region or country, as a shorthand indicator for collections of preferences that typify categories of users that share certain cultural preferences.
HTML
for
example
uses
the
lang
attribute
to
indicate
the
language
of
segments
of
content.
XML
uses
the
xml:lang
attribute
for
the
same
purpose.
Localization
The
tailoring
of
a
system
to
the
individual
cultural
expectations
of
a
specific
target
market
or
group
of
individuals.
Localization
includes,
but
is
not
limited
to,
the
translation
of
user-facing
text
and
messages.
Localization
is
sometimes
abbreviated
as
L10N
because
there
are
ten
letters
between
the
"L"
and
the
"N"
in
the
English
word.
When
a
particular
set
of
content
and
preferences
corresponding
to
a
specific
set
of
international
preferences
is
operationally
available,
then
the
system
is
said
to
be
localized
.
Locale A collection of international preferences, generally related to a language and geographic region, that is passed in APIs or set in the operating environment to get culturally affected behavior from a system or process. Usually a locale is identified by an id or shorthand token, such as a language tag.
Generally, systems that are internationalized can support a wide range of locales (collections of languages and locally-tailored behaviors and defaults) in order to meet the international preferences of many kinds of users. When a particular system can respond to changes in the locale by trying to load different resources or by performing culturally appropriate formatting, we say that this system is locale-aware or enabled .
Language tags can provide information about the language, script, region, and various specially-registered variants using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. Thus a German language user might want to choose between the sort ordering used in a dictionary versus that used in a phone book.
Historically, locales were identified by the programming language or operating environment of the user. This application-specific identifier was often inferred from language tags. For example, an implementation could map a language tag from an existing protocol, such as HTTP's Accept-Language header, to its locale model.
Common
Locale
Data
Repository
or
CLDR
[
CLDR
]
is
a
Unicode
Consortium
project
that
defines,
collects,
and
curates
sets
of
locale
data
needed
to
enable
systems
or
operating
environments.
CLDR
data
and
it's
its
locale
model
are
widely
adopted,
particularly
in
browsers.
Unicode
Locale
.
A
combination
of
language
tag
extension
[
extensions
([
RFC6067
]
],
[
RFC6497
])
and
additional
processing
rules
defined
by
[
CLDR
]
to
support
locales
defined
by
Unicode.
.
A
Unicode
locales
provide
locale
provides
the
ability
to
specify
in
a
language
tag
a
number
of
the
international
preference
variations
that
users
go
beyond
linguistic
or
regional
variation
or
content
authors
might
wish
to
specify
directly.
The
language
tag
extension
uses
the
-u-
subtag.
These
select
formatting
behavior
or
content
when
there
are
multiple
options.
Unicode
locale
identifiers
are
identical
to
language
tags,
but
apply
additional
rules
about
the
content
of
certain
language
tags.
Unicode
Locales
increasingly
form
the
basis
for
internationalization
on
the
Web,
particularly
as
part
of
the
Intl
locale
framework
[
ECMA-402
]
in
JavaScript.
JavaScript
[
ECMASCRIPT
]
].
Unicode's
[
CLDR
]
project
also
maintains
both
[
RFC6497
BCP47
],
which
defines
a
]
extensions
related
to
Unicode
locales.
The
Unicode
locale
language
tag
extension
[
BCP47
RFC6067
]
registered
uses
the
-u-
subtag,
and
provides
subtags
for
selecting
different
locale-based
formats
and
behaviors.
The
Transformed
Content
extension
[
RFC6497
],
which
describes
transformations
(generally
uses
the
-t-
subtag,
provides
subtags
for
text
transformations,
such
as
transliteration
between
scripts).
scripts.
It is important to remember that every Unicode locale identifier is also a well-formed [ BCP47 ] language tag.
Some preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.
In this document, data values are any data type used in a document format or application other than natural language string values. These often correspond to date types such as numbers, dates, booleans, etc. Note that on the Web many data values are serialized as strings.
A data value is said to be locale-neutral when it is stored or exchanged in a format that is not specifically appropriate any given language, locale, or culture and which can be interpreted unambiguously for presentation in a locale aware way. A locale-neutral representation might itself be linked to a specific cultural preference, but such linkages should be minimized. An example of this are the ISO8601 serializations of date/time values. Many of these are linked to the Gregorian calendar, but the format, field order, separators, and visual appearance are not specifically suitable to any locale (they are intended to be machine readable) and, as shown in the example above, the value can be converted for display into any calendar or locale.
Language negotiation is the process of matching a user's international preferences to available localized resources, content, or processing. The user's preferences are usually expressed as a locale or prioritized list of locales. When negotiating the language, the system follows some sort of algorithm to get the best matching content or functionality from the available resources. In many cases the language negotiation algorithm uses locale fallback that proceeds from more-specific resources to more-general ones following a deterministic pattern.
This document uses the term language to refer to what is sometimes called a natural language : the spoken, written, or signed communications used by human beings.
There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [ BCP47 ].
[ BCP47 ] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [ RFC5646 ], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [ RFC4647 ], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.
A language tag is a string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [ BCP47 ] language tag. These language tags consist of one or more subtags.
A subtag is a sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag. In [ BCP47 ], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).
Selecting content or behavior based on the language tag requires a few additional concepts defined by [ RFC4647 ]. In this document, we adopt the following terminology:
The IANA Language Subtag Registry is a machine-readable text file available via IANA which contains a comprehensive list of all of the subtags valid in language tags. (Link: Registry )
[ BCP47 ] defines two different levels of conformance. See classes of conformance in [ BCP47 ] for specifics. For language tags, the levels of conformance correspond to type of checking that an implementation applies to language tag values.
A well-formed language tag follows the grammar defined in [ BCP47 ]. That is, it is structurally correct, consisting of ASCII letters and digit subtags of the prescribed length, separated by hyphens.
A valid language tag has been checked to ensure that each of the subtags appears in the Language Subtag Registry hosted by IANA.
A canonical Unicode locale identifier is a well-formed language tag that also conforms to the additional rules for Unicode locale identifiers found in [ CLDR ] (see Section 3 ). Unicode locales define additional conformance criteria and normalization steps beyond that found in [ BCP47 ] that help make language tags more consistent and interoperable.
A language range is a string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".
A
language
priority
list
is
a
collection
of
one
or
more
language
ranges
identifying
the
user's
language
preferences
for
use
in
matching.
As
the
name
suggests,
such
lists
are
normally
ordered
or
weighted
according
to
the
user's
preferences.
The
HTTP
[
RFC2616
]
Accept-Language
[
RFC3282
]
header
is
an
example
of
one
kind
of
language
priority
list.
A
basic
language
range
is
simply
a
language
tag
used
to
express
a
language
preference.
An
extended
language
range
allows
a
more
expressive
set
of
language
preference
through
the
use
of
a
wildcard
subtag
.
*
Some
language
priority
lists
,
such
as
the
Accept-Language
[
RFC3282
]
header
mentioned
earlier,
provide
"weights"
for
values
appearing
in
the
list.
Such
weighting
cannot
be
depended
on
for
anything
other
than
ordering
the
list.
This section provides specification authors and implementers with best practices recommended by the Internationalization (I18N) Working Group. These (and many other) best practices, along with links to supporting materials, can also be found in the Internationalization Best Practices for Spec Developers [ INTERNATIONAL-SPECS ]. In addition to the best practices found here, additional best practices relating to language metadata on the Web can be found in [ STRING-META ].
Specifications for the Web that require language identification MUST refer to [ BCP47 ].
Specifications SHOULD NOT refer to specific component RFCs.
The "BCP" nomenclature refers to the current set of RFCs that form the "best current practice". At the time this document was published, [ BCP47 ] consisted of two RFCs: Tags for the Identification of Languages [ RFC5646 ] and Matching of Language Tags [ RFC4647 ].
Formulations such as " RFC 5646 or its successor " MAY be used, but only in cases where the specific document version is necessary.
While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [ RFC4646 ], referring to the BCP will not incur additional compliance risk to most implementations.
Specifications MUST NOT reference obsolete versions of [ BCP47 ], such as [ RFC1766 ] or [ RFC3066 ].
Specifications
that
need
to
preserve
compatibility
with
obsolete
versions
of
[
BCP47
]
MUST
reference
the
production
obs-language-tag
in
[
BCP47
].
Beginning
with
[
RFC4646
],
[
BCP47
]
defined
a
more
complex,
machine-readable
syntax
for
language
tags.
Some
specifications
might
desire
or
require
compatibility
with
the
older
language
tag
grammar
found
in
previous
versions
of
BCP47
(specifically
[
RFC1766
]
and
[
RFC3066
]).
This
grammar
was
more
permissive
and
is
described
in
[
BCP47
]
as
the
ABNF
production
obs-language-tag
.
[
RFC4646
],
which
introduced
the
current
grammar
for
language
tags,
is
itself
obsolete.
Specifications MAY reference registered extensions to [ BCP47 ] as necessary.
In particular, [ RFC6067 ] defines the BCP 47 Extension U , also known as "Unicode Locales". This extension to [ BCP47 ] provides additional subtag sequences for selecting specific locale variations.
Specifications SHOULD require that language tags be well-formed .
Specifications MAY require that language tags be valid .
Checking if a tag is valid requires access to or a copy of the registry plus additional runtime logic. While content authors are advised to choose, generate, and exchange only valid values, language tag matching and other common language tag operations are designed so that validity checking is not needed. Features or functions that need to understand the specific semantic content of subtags are the main reason that a specification would normatively require valid tags as part of the protocol or document format.
Content authors SHOULD choose language tags that are canonical Unicode locale identifiers .
The additional content restrictions and normalization steps found in Section 3 of [ LDML ] provide for better interoperability and consistency than that afforded by [ BCP47 ] directly.
Implementations SHOULD only emit language tags that are canonical Unicode locale identifiers and SHOULD normalize language tags that they consume using the rules for producing canonical tags.
As above, the additional content restrictions and normalization steps found in Section 3 of [ LDML ] provide for better interoperability and consistency than that afforded by [ BCP47 ] directly. This best practice should not be interpreted as meaning that implementations need to support, generate, process, or understand either of [ CLDR ]'s extensions.
Specifications SHOULD require content authors use valid language tags.
Content validators SHOULD check if content uses valid language tags where feasible.
Specifications SHOULD NOT reference [ BCP47 ]'s underlying standards that contribute to the IANA Language Subtag Registry , such as ISO639, ISO15924, ISO3066, or UN M.49.
Some standards might directly consume one of [ BCP47 ]'s contributory standards, in which case a reference is wholly appropriate. However, in most cases, the purpose of the reference is to specify a valid list of codes and their meanings. [ BCP47 ]'s subtag registry is stabilized and resolves ambiguity in a number of useful ways and so should be the preferred source for this type of reference.
Applications that provide language information as part of URIs (e.g. in the realm of RDF) SHOULD use [ BCP47 ].
Currently,
URIs
expressing
language
information
often
use
values
from
parts
of
ISO
639.
This
leads
to
situations
in
which
there
are
ambiguities
about
what
the
proper
value
should
be,
e.g.
for
German
de
from
ISO
639-1
or
ger
from
ISO
639-2.
By
using
BCP
47
and
its
language
sub
tag
registry,
such
ambiguities
can
be
avoided,
e.g.
for
German,
the
registry
contains
only
de
.
Specifications that define language tag matching or language negotiation MUST specify whether language ranges used are a basic language range or an extended language range .
Specifications that define language tag matching MUST specify whether the results of a matching operation contains a single result ( lookup as defined in [ RFC4647 ]), or a possibly-empty (zero or more) set of results ( filtering as defined in [ RFC4647 ]).
Specifications that define language tag matching MUST specify the matching algorithms available and the selection mechanism.
For example, JavaScript internationalization [ ECMA-402 ] and [ CLDR ] provide a "best fit" algorithm which can be tailored by implementers.
Specifications SHOULD NOT restrict the length of language tags or permit or encourage the removal of extensions.
Specifications that present data values in a document format SHOULD require that data is formatted according to the language of the surrounding content.
When data values are present to the user as part of a document or application, the document or application forms the "context" where the data is being viewed. Content authors or application developers need a way to make the data values seem like a natural part of the experience and need a way to control the presentation. This is indicated by the language tag of the context in which the content appears: usually enabled implementations interpret the tag as a locale in order to accomplish this. Using the runtime locale or localization of the user-agent as the locale for presenting data values should only be a last resort.
Specifications that present forms or receive input of data values in a document format or application SHOULD require that the values be presented to the user localized in the format of the language of the content or markup immediately surrounding the value.
Specifications that present, exchange, or allow the input of data values MUST use a locale-neutral format for storage and interchange.
Implementations SHOULD present data values in a document format or application using a format consistent with the language of the surrounding content and are encouraged to provide controls which are localized to the same locale for input or editing.
Users expect form fields and other data inputs to use a presentation for data values that is consistent with the document or application where the values appear. User's usually expect their input to match the document's context rather than the user-agent or operating environments and input validation, prompting, or controls are also thus consistent with the content. This gives content authors the ability to create a wholly localized customer experience and is generally in keeping with customer expectations.
The Internationalization WG has additional best practices and other references, such as articles on language tag choice. These include:
The following changes were made since the revision of 2015-04-23.
The following changes were made since the revision of 2006-06-20.
The following log records changes that have been made to this document since the publication in April 2006 .
The informative introductory section has been rewritten thoroughly, including the description of the scope of the document, of application scenarios and of the separation of locale versus natural language.
Terms which rely on [ BCP47 ] are not defined anymore, but only reference these documents. In addition, examples for these terms were created.
The requirements for language and locale values have been taken out of the conformance section and are now placed in the body of the document.
A revision log has been created.
The Internationalization Working Group would like to acknowledge the following contributors to this specification: