ICANN IDN Glossary
Internationalized Domain Names – Glossary
In an attempt to ensure that discussions regarding IDNs
take place in a consistent manner ICANN has published an IDN Glossary.
The glossary terms can be used freely and is expected to be expanded
If you have suggestions for additions and/or changes to the glossary please submit
these to firstname.lastname@example.org. Comments
will be posted publicly in the discussion forum at http://forum.icann.org/lists/idn-glossary/.
Historically, domain names on the Internet were restricted to using a limited
set of ASCII characters (i.e. a-z, 0-9 and "-"). However, with the increasing
use of the Internet in all regions and by diverse linguistic groups of the
world, the demand for multilingual domain names has become more intense. Various
acronyms are used widely in communications around internationalizing the domain
name space. Explanations for many of these acronyms are provided below to help
make this topic simpler to understand.
ACE (ASCII Compatible Encoding)
ACE is a system for encoding Unicode so each character can be transmitted using only a limited set of ASCII characters (i.e. a-z, 0-9 and "-"). This is used because applications that use the DNS protocol may not reliably handle other values.
ASCII (American Standard Code for Information Interchange)
ASCII is a common numerical code for computers and other devices that work with text. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as ‘a’ or ‘@’. When mentioned in relation to domain names or strings, ASCII refers to the fact that before internationalization only the letters a-z, digits 0-9, and the hyphen "-", were allowed in domain names.
For the purposes of discussing IDNs, a ”character” can best
be seen as the basic graphic unit of a writing system, which is a script
plus a set of rules determining how it is used for representing a specific
language. However, domain labels do not convey any intrinsic information
about the language with which they are intended to be associated, although
they do reveal the script on which they are based. This language dependency
can unfortunately not be eliminated by restricting the definition to script
because in several cases (see examples below) languages that share the
same script differ in the way they regard its individual elements. The term
character can therefore not be defined independently of the context in
which it is used.
In phonetically based writing systems, a character is typically a letter or
represents a syllable, and in ideographic systems (or alternatively, pictographic
or logographic systems) a character may represent a concept or word.
The following examples are intended to illustrate that the definition of
a character is at least two-fold, one being a linguistic base unit and the
other is the associated code point.
U-label 酒 : Jiu; the Chinese word for ‘alcoholic beverage’; Unicode
code point is U+9152 (also referred to as: CJK UNIFIED IDEOGRAPH-9152); A-label
U-label 北京 : the Chinese word for ‘Beijing’,
Unicode codepoints are U+5300 U+4EAC; A-label is xn—1lq90i
U-label 東京 : Japanese word for ‘Tokyo’, the
Unicode code points are U+6771 U+4EAC; A-label is xn—1lqs71d
U-label ایكوم; Farsi acronym for ICOM, Unicode
code points are U+0627 U+06CC U+0643 U+0648 U+0645; A-label is xn—mgb0dgl27d.
DNS (Domain Name System)
The DNS makes using the Internet easier by allowing a familiar string of letters
(the "domain name") to be used instead of the arcane IP address. So instead
of typing 18.104.22.168, you can type www.internic.net.
IDNA (Internationalized Domain Names in Application)
IDNA is a protocol defined in RFC 3490 by the Internet Engineering Task Force
(http://www.ietf.org) that makes it possible
for applications to handle domain names with non-ASCII characters. IDNA converts
domain name strings with non-ASCII characters to ASCII domain name labels that
applications that use the DNS can accurately understand. Not all characters
used in the world’s languages will be available for use in domain names. Hence
IDNA is not able to convert all such characters into ASCII labels.
IDN (Internationalized Domain Name)
IDNs are domain names represented by local language characters. Such domain
names could contain characters with diacritical marks as required by many European
languages, or characters from non-Latin scripts (for example, Arabic or Chinese).
IDNs made the domain name label as it is displayed and viewed by the end
user different from that transmitted in the DNS. To avoid confusion the following
terminology is used:
The A-label is what is transmitted in the DNS protocol and
this is the ASCII-compatible (ACE) form of an IDNA string; for example "xn--11b5bs1di".
U-label is what should be displayed to the user and is the representation
of the Internationalized Domain Name (IDN) in Unicode; for example " परीका " ("test" version
in Hindi, Devanagari script ). Lastly, the LDH-label strictly
refers to an all-ASCII label that obeys the "hostname" (LDH) conventions
and that is not an IDN; for example "icann" in the domain name "icann.org".
(The above label definition are extracted from: http://www.ietf.org/internet-drafts/draft-klensin-idnabis-issues-01.txt)
IDN SLDs or IDN 2LDs
Usually a reference for domain names with local characters at the second
level, while the top level remains in ASCII-only characters. For example: [παράδειγμα .test]
("example.test" in Greek).
Usually the short reference for internationalized top-level domains, thus
allowing the entire domain name to be represented by local characters. For
example: [실례.테스트] ("example.test" in
A label is an individual part of a domain name. Labels are usually shown separated by dots; for example, the domain name “example.com” is composed of two labels: “example”, and “com”.
Languages | Scripts | Alphabets
Languages are used by speech communities. Scripts are used to write down information in the various languages and this is done by using the corresponding alphabets or alternative writing systems.
LDH (Letter, Digit, Hyphen)
The hostname convention defined in RFC 952 (later modified by RFC 1123) was used by top-level domain Registries before internationalization. This meant that domain names could only practically contain the letters a-z, digits 0-9 and the hyphen "-". The term "LDH code points" refers to this subset. With the introduction of IDNs this rule is no longer relevant for all domain names although with the use of IDNA, what appears in the DNS remains LDH.
Punycode is the LDH-compatible encoding algorithm described in Internet standard [RFC3492], and in use today. This is the method that is used to encode IDNs into sequences of LDH ASCII characters in order for applications using the Domain Name System (DNS) to understand and manage the names. The intention is that domain name registrants and users will never see this encoded form of a domain name. The sole purpose is for the DNS to be able to resolve for example a URL containing local characters. For examples see A-label under "IDN".
The prefix in a Punycode A-label is always "xn--". Hence this prefix is recommended to be reserved by top-level domain Registries in order to avoid confusion when/if registrations of IDNs are introduced under the respective top level domain.
The Unicode Consortium
A not-for-profit organization founded to develop, extend and promote use
of the Unicode standard. For more information, please visit http://www.unicode.org.
Unicode is a commonly used single encoding scheme that provides a unique number for each character across a wide variety of languages and scripts. The Unicode standard contains tables that list the "code points" (unique numbers) for each local character identified. These tables continue to expand as more and more characters are digitalized.
In Unicode, characters are assigned codes that uniquely define every character
in many of the scripts in the world. These “code points” are unique numbers for a character or some character aspect such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a “U” followed by a plus sign and the unique number in hexadecimal notation; for example, the word “Hello” is written U+0048 U+0065 U+006C U+006C U+006F.
An acronym for "Uniform Resource Locator", a string that describes the address of documents and other resources on the Internet.
Defined by the IETF in RFC 2396, a URL is comprised of two parts separated by a colon (":"). The first part of the address indicates what protocol to use, e.g., http, ftp, etc., and the second part specifies the IP address or the domain name where the resource is located.
UTF-8 -bit Unicode Transformation Format is a system for encoding Unicode
so each character can be transmitted using 8-bit numerical values. This is
commonly used as 8-bit data transmission is prevalent on the Internet.