Language Tags and Locale Identifiers for the World Wide Web

Abstract

This document describes the best practices for the identification of the natural language of content in document formats, specifications, and implementations on the Web. It also describes how languge tags are used to indicate user's locale preferences, which are used to process or display data values and other information on the Web.

1. Introduction

This document provides best practices for specification authors who need to identify natural language values in their document formats or protocols via the use of language tags [ BCP47 ], including common operations such as language negotiation . It also provides recommendations for how to specify locale-aware or internationalized behavior and defines core terminology that specifications might need to refer to these behaviors or capabilities.

Tags for identifying the natural language of content or the international preferences of users are one of the fundamental building blocks of the Web. The language tags found in Web and Internet formats and protocols are defined by [ BCP47 ]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.

Many of the core standards for the Web include support for language tags ; these include the xml:lang attribute in [ XML10 ], the lang and hreflang atttributes in [ HTML ], the language property in [ XSL10 ], and the :lang pseudo-class in CSS [ CSS3-SELECTORS ].

Language tags can also be used to identify international preferences associated with a given piece of content or user because these preferences are linked to the natural language, regional association, or culture of the end user. Such preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults for items such as the presentation of a calendar, or common units of measurement; selecting between 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, these preferences are usually called a locale . The extensions to [ BCP47 ] that define Unicode locales [ CLDR ] provide the basis for internationalization APIs on the Web, notably the JavaScript language [ ECMASCRIPT ] uses Unicode locales as the basis for the APIs found in [ ECMA-402 ].

Document formats and protocols often need to provide metadata about the natural language of content or perform language negotiation when selecting appropriate content on the Web. For more information and best practices related to the specification and use of language tags as metadata, see [ STRING-META ].

2. What are Internationalization and Localization?

Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.

International Preferences A user's particular set of cultural conventions, language, and formatting choices that software must employ to correctly process or present information exchanged with that user.

Internationalization The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated I18N because there are eighteen letters between the "i" and the "n" in the English word.

There are many kinds of international preferences that may be offered on the Web in order for the content or service to be considered usable and acceptable by users around the world. Some of these preferences might include:

Natural language for text processing: parsing, spell checking, and grammar checking are examples of this;
User interface language, which may include items like images, colors, sounds, formats, and navigational elements as well as the visible strings;
Presentation (human-oriented formatting) of dates, times, numbers, lists, and other values;
Collation, sorting, and organization of content (such as in a phone book or a dictionary);
Alternate time-keeping and calendars, which may include holidays, work rules, weekday/weekend distinctions, the number and organization of months, the numbering of years, and so forth;
Tax or regulatory regime;
Currency

... and many more.

Because there are a large number of preferences, software systems (operating environments and programming languages) often use an identifier that combines natural language and other information, such as region or country, as a shorthand indicator for collections of preferences that typify categories of users that share certain cultural preferences.

HTML for example uses the lang attribute to indicate the language of segments of content. XML uses the xml:lang attribute for the same purpose.

Localization The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as L10N because there are ten letters between the "L" and the "N" in the English word. When a particular set of content and preferences corresponding to a specific set of international preferences is operationally available, then the system is said to be localized .

Locale A collection of international preferences, generally related to a language and geographic region, that is passed in APIs or set in the operating environment to get culturally affected behavior from a system or process. Usually a locale is identified by an id or shorthand token, such as a language tag.

Generally, systems that are internationalized can support a wide range of locales (collections of languages and locally-tailored behaviors and defaults) in order to meet the international preferences of many kinds of users. When a particular system can respond to changes in the locale by trying to load different resources or by performing culturally appropriate formatting, we say that this system is locale-aware or enabled .

Language tags can provide information about the language, script, region, and various specially-registered variants using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. Thus a German language user might want to choose between the sort ordering used in a dictionary versus that used in a phone book.

Historically, locales were identified by the programming language or operating environment of the user. This application-specific identifier was often inferred from language tags. For example, an implementation could map a language tag from an existing protocol, such as HTTP's Accept-Language header, to its locale model.

Common Locale Data Repository or CLDR [ CLDR ] is a Unicode Consortium project that defines, collects, and curates sets of locale data needed to enable systems or operating environments. CLDR data and ~~it's~~ its locale model are widely adopted, particularly in browsers.

Unicode Locale . A combination of language tag ~~extension [~~ extensions ([ RFC6067 ] ], [ RFC6497 ]) and additional processing rules defined by [ CLDR ] to support locales ~~defined by Unicode.~~ . A Unicode ~~locales provide~~ locale provides the ability to specify in a language tag ~~a number of the~~ international preference variations that ~~users~~ go beyond linguistic or regional variation or ~~content authors might wish~~ to ~~specify directly. The language tag extension uses the -u- subtag. These~~ select formatting behavior or content when there are multiple options. Unicode locale identifiers are identical to language tags, but apply additional rules about the content of certain language tags. Unicode Locales increasingly form the basis for internationalization on the Web, particularly as part of the Intl locale framework [ ECMA-402 ] in ~~JavaScript.~~ JavaScript [ ECMASCRIPT ] ].

Unicode's [ CLDR ] project ~~also~~ maintains both [ ~~RFC6497~~ BCP47 ~~], which defines a~~ ] extensions related to Unicode locales. The Unicode locale language tag extension [ ~~BCP47~~ RFC6067 ] ~~registered~~ uses the -u- subtag, and provides subtags for selecting different locale-based formats and behaviors.

Example 1

Here are a few selected examples of Unicode locale variations.

In the first example, the value 123456789.5678 is formatted using the locale rules represented by the various language tags. Notice how the u extension ~~(using~~ and its nu keyword are used to select between Latin and Devanagari digit shapes in the Hindi-as-used-in-India ( -t- hi-IN ~~subtag)~~ ) locale and between Latin and Arabic script digit shaps in the Arabic ( ar ) locale.

Variation Type	Value	Language Tag	Formatted Value
Numbering System	`123456789.5678`	en-US	123,456,789.5678
		de	123.456.789,5678
		hi-IN-u-nu-latn	12,34,56,789.5678
		hi-IN-u-nu-deva	१२,३४,५६,७८९.५६७८
		ar-u-nu-latn	123,456,789.5678
		ar-u-nu-arab	١٢٣٬٤٥٦٬٧٨٩٫٥٦٧٨

In the second example, the date value corresponding to 11 July 2020 on the Gregorian calendar is formatted using various different locales. Here, for example, the language tag for Thai ( th ) is extended to select between the Greogrian ( -u-ca-gregory ) and Thai Buddhist ( -u-ca-buddhist ) calendar systems. Other examples show the Japanese Imperial calendar and one type of Islamic calendar. Notice in the last example that the calendar is not restricted to a specific locale: here we show the Islamic calendar system in an English locale.

Variation Type	Value	Language Tag	Formatted Value
Calendar	`2020-07-11T12:00:00Z`	th-u-ca-gregory	11 ก.ค. 2020
		th-u-ca-buddhist	11 ก.ค. 2563
		ja-u-ca-japanese	令和2年7月11日
		ar-u-ca-islamic	٢٠ ذو القعدة ١٤٤١ هـ
		en-u-ca-islamic	Dhuʻl-Q. 20, 1441 AH

The Transformed Content extension [ RFC6497 ], which ~~describes transformations (generally~~ uses the -t- subtag, provides subtags for text transformations, such as transliteration between ~~scripts).~~ scripts.

It is important to remember that every Unicode locale identifier is also a well-formed [ BCP47 ] language tag.

Note

Some preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.

In this document, data values are any data type used in a document format or application other than natural language string values. These often correspond to date types such as numbers, dates, booleans, etc. Note that on the Web many data values are serialized as strings.

A data value is said to be locale-neutral when it is stored or exchanged in a format that is not specifically appropriate any given language, locale, or culture and which can be interpreted unambiguously for presentation in a locale aware way. A locale-neutral representation might itself be linked to a specific cultural preference, but such linkages should be minimized. An example of this are the ISO8601 serializations of date/time values. Many of these are linked to the Gregorian calendar, but the format, field order, separators, and visual appearance are not specifically suitable to any locale (they are intended to be machine readable) and, as shown in the example above, the value can be converted for display into any calendar or locale.

Example 2

Suppose your application needs to collect and store a data value . The system can use a locale-neutral format for storing and exchanging the value: schema languages such as [ XMLSCHEMA11-2 ] or data formats such as [ JSON ] provide ready made types for this purpose. When the user is entering or editing the value, however, the user expects to interact with a more human friendly format. For example, if your application needed to input a user's birth date and the value they were trying to enter were 2020-01-31:

Input	HTML	Output	Format Pattern
`2020-01-31`	<input type=date lang= en-GB ...>	31/01/2020	dd/MM/yyyy
	<input type=date lang= en-US ...>	01/31/2020	MM/dd/yyyy
	<input type=date lang= fr-FR ...>	31-01-2020	dd-MM-yyyy
	<input type=date lang= zh-Hans-CN ...>	2020-01-31	yyyy-MM-dd

Language negotiation is the process of matching a user's international preferences to available localized resources, content, or processing. The user's preferences are usually expressed as a locale or prioritized list of locales. When negotiating the language, the system follows some sort of algorithm to get the best matching content or functionality from the available resources. In many cases the language negotiation algorithm uses locale fallback that proceeds from more-specific resources to more-general ones following a deterministic pattern.

3. Languages, Language Tags and Matching of Language Tags

This document uses the term language to refer to what is sometimes called a natural language : the spoken, written, or signed communications used by human beings.

There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [ BCP47 ].

[ BCP47 ] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [ RFC5646 ], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [ RFC4647 ], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.

A language tag is a string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [ BCP47 ] language tag. These language tags consist of one or more subtags.

A subtag is a sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag. In [ BCP47 ], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).

Selecting content or behavior based on the language tag requires a few additional concepts defined by [ RFC4647 ]. In this document, we adopt the following terminology:

The IANA Language Subtag Registry is a machine-readable text file available via IANA which contains a comprehensive list of all of the subtags valid in language tags. (Link: Registry )

[ BCP47 ] defines two different levels of conformance. See classes of conformance in [ BCP47 ] for specifics. For language tags, the levels of conformance correspond to type of checking that an implementation applies to language tag values.

A well-formed language tag follows the grammar defined in [ BCP47 ]. That is, it is structurally correct, consisting of ASCII letters and digit subtags of the prescribed length, separated by hyphens.

A valid language tag has been checked to ensure that each of the subtags appears in the Language Subtag Registry hosted by IANA.

A canonical Unicode locale identifier is a well-formed language tag that also conforms to the additional rules for Unicode locale identifiers found in [ CLDR ] (see Section 3 ). Unicode locales define additional conformance criteria and normalization steps beyond that found in [ BCP47 ] that help make language tags more consistent and interoperable.

A language range is a string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".

A language priority list is a collection of one or more language ranges identifying the user's language preferences for use in matching. As the name suggests, such lists are normally ordered or weighted according to the user's preferences. The HTTP [ RFC2616 ] Accept-Language [ RFC3282 ] header is an example of one kind of language priority list.

A basic language range is simply a language tag used to express a language preference. An extended language range allows a more expressive set of language preference through the use of a wildcard subtag * .

Some language priority lists , such as the Accept-Language [ RFC3282 ] header mentioned earlier, provide "weights" for values appearing in the list. Such weighting cannot be depended on for anything other than ordering the list.

4. Best Practices and Recommendations

This section provides specification authors and implementers with best practices recommended by the Internationalization (I18N) Working Group. These (and many other) best practices, along with links to supporting materials, can also be found in the Internationalization Best Practices for Spec Developers [ INTERNATIONAL-SPECS ]. In addition to the best practices found here, additional best practices relating to language metadata on the Web can be found in [ STRING-META ].

Note

Specifications for the Web that require language identification MUST refer to [ BCP47 ].

Specifications SHOULD NOT refer to specific component RFCs.

The "BCP" nomenclature refers to the current set of RFCs that form the "best current practice". At the time this document was published, [ BCP47 ] consisted of two RFCs: Tags for the Identification of Languages [ RFC5646 ] and Matching of Language Tags [ RFC4647 ].

Formulations such as " RFC 5646 or its successor " MAY be used, but only in cases where the specific document version is necessary.

While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [ RFC4646 ], referring to the BCP will not incur additional compliance risk to most implementations.

Specifications MUST NOT reference obsolete versions of [ BCP47 ], such as [ RFC1766 ] or [ RFC3066 ].

Specifications that need to preserve compatibility with obsolete versions of [ BCP47 ] MUST reference the production obs-language-tag in [ BCP47 ].

Beginning with [ RFC4646 ], [ BCP47 ] defined a more complex, machine-readable syntax for language tags. Some specifications might desire or require compatibility with the older language tag grammar found in previous versions of BCP47 (specifically [ RFC1766 ] and [ RFC3066 ]). This grammar was more permissive and is described in [ BCP47 ] as the ABNF production obs-language-tag. [ RFC4646 ], which introduced the current grammar for language tags, is itself obsolete.

Specifications MAY reference registered extensions to [ BCP47 ] as necessary.

In particular, [ RFC6067 ] defines the BCP 47 Extension U , also known as "Unicode Locales". This extension to [ BCP47 ] provides additional subtag sequences for selecting specific locale variations.

Specifications SHOULD require that language tags be well-formed .

Specifications MAY require that language tags be valid .

Checking if a tag is valid requires access to or a copy of the registry plus additional runtime logic. While content authors are advised to choose, generate, and exchange only valid values, language tag matching and other common language tag operations are designed so that validity checking is not needed. Features or functions that need to understand the specific semantic content of subtags are the main reason that a specification would normatively require valid tags as part of the protocol or document format.

Content authors SHOULD choose language tags that are canonical Unicode locale identifiers .

The additional content restrictions and normalization steps found in Section 3 of [ LDML ] provide for better interoperability and consistency than that afforded by [ BCP47 ] directly.

Implementations SHOULD only emit language tags that are canonical Unicode locale identifiers and SHOULD normalize language tags that they consume using the rules for producing canonical tags.

As above, the additional content restrictions and normalization steps found in Section 3 of [ LDML ] provide for better interoperability and consistency than that afforded by [ BCP47 ] directly. This best practice should not be interpreted as meaning that implementations need to support, generate, process, or understand either of [ CLDR ]'s extensions.

Specifications SHOULD require content authors use valid language tags.

Content validators SHOULD check if content uses valid language tags where feasible.

Specifications SHOULD NOT reference [ BCP47 ]'s underlying standards that contribute to the IANA Language Subtag Registry , such as ISO639, ISO15924, ISO3066, or UN M.49.

Some standards might directly consume one of [ BCP47 ]'s contributory standards, in which case a reference is wholly appropriate. However, in most cases, the purpose of the reference is to specify a valid list of codes and their meanings. [ BCP47 ]'s subtag registry is stabilized and resolves ambiguity in a number of useful ways and so should be the preferred source for this type of reference.

Applications that provide language information as part of URIs (e.g. in the realm of RDF) SHOULD use [ BCP47 ].

Currently, URIs expressing language information often use values from parts of ISO 639. This leads to situations in which there are ambiguities about what the proper value should be, e.g. for German de from ISO 639-1 or ger from ISO 639-2. By using BCP 47 and its language sub tag registry, such ambiguities can be avoided, e.g. for German, the registry contains only de.

Specifications that define language tag matching or language negotiation MUST specify whether language ranges used are a basic language range or an extended language range .

Specifications that define language tag matching MUST specify whether the results of a matching operation contains a single result ( lookup as defined in [ RFC4647 ]), or a possibly-empty (zero or more) set of results ( filtering as defined in [ RFC4647 ]).

Specifications that define language tag matching MUST specify the matching algorithms available and the selection mechanism.

For example, JavaScript internationalization [ ECMA-402 ] and [ CLDR ] provide a "best fit" algorithm which can be tailored by implementers.

Specifications SHOULD NOT restrict the length of language tags or permit or encourage the removal of extensions.

Specifications that present data values in a document format SHOULD require that data is formatted according to the language of the surrounding content.

When data values are present to the user as part of a document or application, the document or application forms the "context" where the data is being viewed. Content authors or application developers need a way to make the data values seem like a natural part of the experience and need a way to control the presentation. This is indicated by the language tag of the context in which the content appears: usually enabled implementations interpret the tag as a locale in order to accomplish this. Using the runtime locale or localization of the user-agent as the locale for presenting data values should only be a last resort.

Specifications that present forms or receive input of data values in a document format or application SHOULD require that the values be presented to the user localized in the format of the language of the content or markup immediately surrounding the value.

Specifications that present, exchange, or allow the input of data values MUST use a locale-neutral format for storage and interchange.

Implementations SHOULD present data values in a document format or application using a format consistent with the language of the surrounding content and are encouraged to provide controls which are localized to the same locale for input or editing.

Users expect form fields and other data inputs to use a presentation for data values that is consistent with the document or application where the values appear. User's usually expect their input to match the document's context rather than the user-agent or operating environments and input validation, prompting, or controls are also thus consistent with the content. This gives content authors the ability to create a wholly localized customer experience and is generally in keeping with customer expectations.

C. References

C.1 Informative references

[BCP47]: Tags for Identifying Languages . A. Phillips; M. Davis. IETF. September 2009. IETF Best Current Practice. URL: https://tools.ietf.org/html/bcp47
[CLDR]: Common Locale Data Repository . Unicode. URL: https://cldr.unicode.org
[CSS3-SELECTORS]: Selectors Level 3 . Tantek Çelik; Elika Etemad; Daniel Glazman; Ian Hickson; Peter Linss; John Williams. W3C. 6 November 2018. W3C Recommendation. URL: https://www.w3.org/TR/selectors-3/
[ECMA-402]: ECMAScript Internationalization API Specification . Ecma International. URL: https://tc39.es/ecma402/
[ECMASCRIPT]: ECMAScript Language Specification . Ecma International. URL: https://tc39.es/ecma262/
[HTML]: HTML Standard . Anne van Kesteren; Domenic Denicola; Ian Hickson; Philip Jägenstedt; Simon Pieters. WHATWG. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INTERNATIONAL-SPECS]: Internationalization Best Practices for Spec Developers . Marcos Caceres. W3C. 29 May 2020. W3C Working Draft. URL: https://www.w3.org/TR/international-specs/
[JSON]: The application/json Media Type for JavaScript Object Notation (JSON) . D. Crockford. IETF. July 2006. Informational. URL: https://tools.ietf.org/html/rfc4627
[LDML]: Unicode Technical Standard #35: Locale Data Markup Language . Mark Davis; CLDR Contributors. Unicode. URL: https://www.unicode.org/reports/tr35/
[RFC1766]: Tags for the Identification of Languages . H. Alvestrand. IETF. March 1995. Proposed Standard. URL: https://tools.ietf.org/html/rfc1766
[RFC2119]: Key words for use in RFCs to Indicate Requirement Levels . S. Bradner. IETF. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[RFC2616]: Hypertext Transfer Protocol -- HTTP/1.1 . R. Fielding; J. Gettys; J. Mogul; H. Frystyk; L. Masinter; P. Leach; T. Berners-Lee. IETF. June 1999. Draft Standard. URL: https://tools.ietf.org/html/rfc2616
[RFC3066]: Tags for the Identification of Languages . H. Alvestrand. IETF. January 2001. Best Current Practice. URL: https://tools.ietf.org/html/rfc3066
[RFC3282]: Content Language Headers . H. Alvestrand. IETF. May 2002. Draft Standard. URL: https://tools.ietf.org/html/rfc3282
[RFC4646]: Tags for Identifying Languages . A. Phillips; M. Davis. IETF. September 2006. Best Current Practice. URL: https://tools.ietf.org/html/rfc4646
[RFC4647]: Matching of Language Tags . A. Phillips; M. Davis. IETF. September 2006. Best Current Practice. URL: https://tools.ietf.org/html/rfc4647
[RFC5646]: Tags for Identifying Languages . A. Phillips, Ed.; M. Davis, Ed.. IETF. September 2009. Best Current Practice. URL: https://tools.ietf.org/html/rfc5646
[RFC6067]: BCP 47 Extension U . M. Davis; A. Phillips; Y. Umaoka. IETF. December 2010. Informational. URL: https://tools.ietf.org/html/rfc6067
[RFC6497]: BCP 47 Extension T - Transformed Content . M. Davis; A. Phillips; Y. Umaoka; C. Falk. IETF. February 2012. Informational. URL: https://tools.ietf.org/html/rfc6497
[STRING-META]: Strings on the Web: Language and Direction Metadata . Addison Phillips; Richard Ishida. W3C. 11 June 2019. W3C Working Draft. URL: https://www.w3.org/TR/string-meta/
[WS-I18N-REQ]: Requirements for the Internationalization of Web Services . Addison Phillips. W3C. 16 November 2004. W3C Note. URL: https://www.w3.org/TR/ws-i18n-req/
[WS-I18N-SCENARIOS]: Web Services Internationalization Usage Scenarios . Debasish Banerjee; Martin Dürst; Michael McKenna; Addison Phillips; Takao Suzuki; Tex Texin; Mary Trumble; Andrea Vine; Kentaro Noji et al. W3C. 30 July 2004. W3C Note. URL: https://www.w3.org/TR/ws-i18n-scenarios/
[XML10]: Extensible Markup Language (XML) 1.0 (Fifth Edition) . Tim Bray; Jean Paoli; Michael Sperberg-McQueen; Eve Maler; François Yergeau et al. W3C. 26 November 2008. W3C Recommendation. URL: https://www.w3.org/TR/xml/
[XMLSCHEMA11-2]: W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes . David Peterson; Sandy Gao; Ashok Malhotra; Michael Sperberg-McQueen; Henry Thompson; Paul V. Biron et al. W3C. 5 April 2012. W3C Recommendation. URL: https://www.w3.org/TR/xmlschema11-2/
[XSL10]: Extensible Stylesheet Language (XSL) Version 1.0 . Sharon Adler; Anders Berglund; Jeffrey Caruso; Stephen Deach; Tony Graham; Paul Grosso; Eduardo Gutentag; Alex Milowski; Scott Parnell; Jeremy Richman; Steve Zilles et al. W3C. 15 October 2001. W3C Recommendation. URL: https://www.w3.org/TR/xsl/

Language Tags and Locale Identifiers for the World Wide Web

W3C Editor's Draft 11 16 July 2020

Abstract

Status of This Document