From Wikipedia, the free encyclopedia
Microsoft was one of the first companies to implement Unicode in their products. Windows NT was the first operating system that used «wide characters» in system calls. Using the (now obsolete) UCS-2 encoding scheme at first, it was upgraded to the variable-width encoding UTF-16 starting with Windows 2000, allowing a representation of additional planes with surrogate pairs. However Microsoft did not support UTF-8 in its API until May 2019.
Before 2019, Microsoft emphasized UTF-16 (i.e. -W API), but has since recommended to use UTF-8 (at least in some cases),[1] on Windows and Xbox (and in other of its products), even states «UTF-8 is the universal code page for internationalization [and] UTF-16 [..] a unique burden that Windows places on code that targets multiple platforms. [..] Windows [is] moving forward to support UTF-8 to remove this unique burden [resulting] in fewer internationalization issues in apps and games».[2]
A large amount of Microsoft documentation uses the word «Unicode» to refer explicitly to the UTF-16 encoding. Anything else, including UTF-8, is not «Unicode» in Microsoft’s outdated language (while UTF-8 and UTF-16 are both Unicode according to the Unicode Standard, or encodings/»transformation formats» of).
In various Windows families[edit]
Windows NT based systems[edit]
Current Windows versions and all back to Windows XP and prior Windows NT (3.x, 4.0) are shipped with system libraries that support string encoding of two types: 16-bit «Unicode» (UTF-16 since Windows 2000) and a (sometimes multibyte) encoding called the «code page» (or incorrectly referred to as ANSI code page). 16-bit functions have names suffixed with ‘W’ (from «wide») such as SetWindowTextW
. Code page oriented functions use the suffix ‘A’ for «ANSI» such as SetWindowTextA
(some other conventions were used for APIs that were copied from other systems, such as _wfopen/fopen
or wcslen/strlen
). This split was necessary because many languages, including C, did not provide a clean way to pass both 8-bit and 16-bit strings to the same function.
Microsoft attempted to support Unicode «portably» by providing a «UNICODE» switch to the compiler, that switches unsuffixed «generic» calls from the ‘A’ to the ‘W’ interface and converts all string constants to «wide» UTF-16 versions.[3][4] This does not actually work because it does not translate UTF-8 outside of string constants, resulting in code that attempts to open files just not compiling.[citation needed]
Earlier, and independent of the «UNICODE» switch, Windows also provided the Multibyte Character Sets (MBCS) API switch.[5] This changes some functions that don’t work in MBCS such as strrev
to an MBCS-aware one such as _mbsrev
.[6][7]
Windows CE[edit]
In (the now discontinued) Windows CE, UTF-16 was used almost exclusively, with the ‘A’ API mostly missing.[8] A limited set of ANSI API is available in Windows CE 5.0, for use on a reduced set of locales that may be selectively built onto the runtime image.[9]
This section needs expansion. You can help by adding to it. (June 2011) |
Windows 9x[edit]
In 2001, Microsoft released a special supplement to Microsoft’s old Windows 9x systems. It includes a dynamic link library, ‘unicows.dll’, (only 240 KB) containing the 16-bit flavor (the ones with the letter W on the end) of all the basic functions of Windows API. It is merely a translation layer: SetWindowTextW
will simply convert its input using the current codepage and call SetWindowTextA
.
UTF-8[edit]
Microsoft Windows (Windows XP and later) has a code page designated for UTF-8, code page 65001[10] or CP_UTF8
. For a long time, it was impossible to set the locale code page to 65001, leaving this code page only available for (a) explicit conversion functions such as MultiByteToWideChar and/or (b) the Win32 console command chcp 65001
to translate stdin/out between UTF-8 and UTF-16. This meant that «narrow» functions, in particular fopen
(which opens files), couldn’t be called with UTF-8 strings, and in fact there was no way to open all possible files using fopen
no matter what the locale was set to and/or what bytes were put in the string, as none of the available locales could produce all possible UTF-16 characters. This problem also applied to all other APIs that take or return 8-bit strings, including Windows ones such as SetWindowText
.
Programs that wanted to use UTF-8, in particular code intended to be portable to other operating systems, needed a workaround for this deficiency. The usual work-around was to add new functions to open files that convert UTF-8 to UTF-16 using MultiByteToWideChar and call the «wide» function instead of fopen
.[11] Dozens of multi-platform libraries added wrapper functions to do this conversion on Windows (and pass UTF-8 through unchanged on others), an example is a proposed addition to Boost, Boost.Nowide.[12] Another popular work-around was to convert the name to the 8.3 filename equivalent, this is necessary if the fopen
is inside a library. None of these workarounds are considered good, as they require changes to the code that works on non-Windows.
In April 2018 (or possibly November 2017[13]), with insider build 17035 (nominal build 17134) for Windows 10, a «Beta: Use Unicode UTF-8 for worldwide language support» checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling «narrow» functions, including fopen
and SetWindowTextA
, with UTF-8 strings. However this is a system-wide setting and a program cannot assume it is set.
In May 2019, Microsoft added the ability for a program to set the code page to UTF-8 itself,[1][14] allowing programs written to use UTF-8 to be run by non-expert users.
As of 2019, Microsoft recommends programmers use UTF-8 (e.g. instead of any other 8-bit encoding),[1] on Windows and Xbox, and may be recommending its use instead of UTF-16, even stating «UTF-8 is the universal code page for internationalization [and] UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms.»[2] Microsoft does appear to be transitioning to UTF-8, stating it previously emphasized its alternative, and in Windows 11 some system files are required to use UTF-8 and do not require a Byte Order Mark.[15] Notepad can now recognize UTF-8 without the Byte Order Mark, and can be told to write UTF-8 without a Byte Order Mark.[citation needed] Some other Microsoft products are using UTF-8 internally, including Visual Studio[citation needed] and their SQL Server 2019, with Microsoft claiming 35% speed increase from use of UTF-8, and «nearly 50% reduction in storage requirements.»[16]
Programming platforms[edit]
Microsoft’s compilers often fail at producing UTF-8 string constants from UTF-8 source files. The most reliable method is to turn off UNICODE, not mark the input file as being UTF-8 (i.e. do not use a BOM), and arrange the string constants to have the UTF-8 bytes. If a BOM was added, a Microsoft compiler will interpret the strings as UTF-8, convert them to UTF-16, then convert them back into the current locale, thus destroying the UTF-8.[17] Without a BOM and using a single-byte locale, Microsoft compilers will leave the bytes in a quoted string unchanged. On modern systems setting the code page to UTF-8 helps considerably, but invalid byte sequences are still not preserved (using \x
can work around this).
See also[edit]
- Bush hid the facts, a text encoding mojibake
Notes[edit]
- ^ Found under control panel, «Region» entry, «Administrative» tab, «Change system locale» button.
References[edit]
- ^ a b c «Use UTF-8 code pages in Windows apps». learn.microsoft.com. Retrieved 2020-06-06.
As of Windows version 1903 (May 2019 update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. […]
CP_ACP
equates toCP_UTF8
only if running on Windows version 1903 (May 2019 update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend usingCP_UTF8
explicitly. - ^ a b «UTF-8 support in the Microsoft Game Development Kit (GDK) — Microsoft Game Development Kit». learn.microsoft.com. 19 August 2022. Retrieved 2023-03-05.
By operating in UTF-8, you can ensure maximum compatibility [..] Windows operates natively in UTF-16 (or WCHAR), which requires code page conversions by using MultiByteToWideChar and WideCharToMultiByte. This is a unique burden that Windows places on code that targets multiple platforms. [..] The Microsoft Game Development Kit (GDK) and Windows in general are moving forward to support UTF-8 to remove this unique burden of Windows on code targeting or interchanging with multiple platforms and the web. Also, this results in fewer internationalization issues in apps and games and reduces the test matrix that’s required to get it right.
- ^ «Unicode in the Windows API». Retrieved 7 May 2018.
- ^ «Conventions for Function Prototypes (Windows)». MSDN. Retrieved 7 May 2018.
- ^ «Support for Multibyte Character Sets (MBCSs)». Retrieved 2020-06-15.
- ^ «Double-byte Character Sets». MSDN. 2018-05-31. Retrieved 2020-06-15.
our applications use DBCS Windows code pages with the «A» versions of Windows functions.
- ^ _strrev, _wcsrev, _mbsrev, _mbsrev_l Microsoft Docs
- ^ «Differences Between the Windows CE and Windows NT Implementations of TAPI». MSDN. 28 August 2006. Retrieved 7 May 2018.
Windows CE is Unicode-based. You might have to recompile source code that was written for a Windows NT-based application.
- ^ «Code Pages (Windows CE 5.0)». Microsoft Docs. 14 September 2012. Retrieved 7 May 2018.
- ^ «Code Page Identifiers (Windows)». msdn.microsoft.com. 7 January 2021.
- ^ «UTF-8 in Windows». Stack Overflow. Retrieved July 1, 2011.
- ^ «Boost.Nowide». GitHub.
- ^ «Windows10 Insider Preview Build 17035 Supports UTF-8 as ANSI». Hacker News. Retrieved 7 May 2018.
- ^ «Windows 10 1903 and later versions finally support UTF-8 with the A forms of the Win32 functions».
- ^ «Customize the Windows 11 Start menu». docs.microsoft.com. Retrieved 2021-06-29.
Make sure your LayoutModification.json uses UTF-8 encoding.
- ^ «Introducing UTF-8 support for SQL Server». techcommunity.microsoft.com. 2019-07-02. Retrieved 2021-08-24.
For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. [..] In the ASCII range, when doing intensive read/write I/O on UTF-8, we measured an average 35% performance improvement over UTF-16 using clustered tables with a non-clustered index on the string column, and an average 11% performance improvement over UTF-16 using a heap.
- ^ UTF-8 Everywhere FAQ: How do I write UTF-8 string literal in my C++ code?
External links[edit]
- «Unicode». MSDN. Microsoft. Retrieved November 10, 2016.
The first 216 Unicode code points. The stripe of solid gray near the bottom are the surrogate halves used by UTF-16 (the white region below the stripe is the Private Use Area) |
|
Language(s) | International |
---|---|
Standard | Unicode Standard |
Classification | Unicode Transformation Format, variable-width encoding |
Extends | UCS-2 |
Transforms / Encodes | ISO 10646 (Unicode) |
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid character code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed.[1]
UTF-16 is used internally by systems such as Microsoft Windows, the Java programming language and JavaScript/ECMAScript. It is also often used for plain text and for word-processing data files on Microsoft Windows. It is rarely used for files on Unix-like systems. As of May 2019, Microsoft reversed its course of only emphasizing UTF-16 for Unicode; for Windows applications recommends and supports UTF-8, e.g. for Universal Windows Platform (UWP) apps.[2]
UTF-16 is the only web-encoding incompatible with ASCII,[3] and never gained popularity on the web, where it is used by under 0.002% (little over 1 thousand of 1 percent) of web pages.[4]UTF-8, by comparison, is used by 97% of all web pages.[5] The Web Hypertext Application Technology Working Group (WHATWG) considers UTF-8 «the mandatory encoding for all [text]» and that for security reasons browser applications should not use UTF-16.[6]
History
In the late 1980s, work began on developing a uniform encoding for a «Universal Character Set» (UCS) that would replace earlier language-specific encodings with one coordinated system. The goal was to include all required characters from most of the world’s languages, as well as symbols from technical domains such as science, mathematics, and music. The original idea was to replace the typical 256-character encodings, which required 1 byte per character, with an encoding using 65,536 (216) values, which would require 2 bytes (16 bits) per character.
Two groups worked on this in parallel, ISO/IEC JTC 1/SC 2 and the Unicode Consortium, the latter representing mostly manufacturers of computing equipment. The two groups attempted to synchronize their character assignments so that the developing encodings would be mutually compatible. The early 2-byte encoding was originally called «Unicode», but is now called «UCS-2».[7]
When it became increasingly clear that 216 characters would not suffice,[1]IEEE introduced a larger 31-bit space and an encoding (UCS-4) that would require 4 bytes per character. This was resisted by the Unicode Consortium, both because 4 bytes per character wasted a lot of memory and disk space, and because some manufacturers were already heavily invested in 2-byte-per-character technology. The UTF-16 encoding scheme was developed as a compromise and introduced with version 2.0 of the Unicode standard in July 1996.[8] It is fully specified in RFC 2781, published in 2000 by the IETF.[9][10]
In the UTF-16 encoding, code points less than 216 are encoded with a single 16-bit code unit equal to the numerical value of the code point, as in the older UCS-2. The newer code points greater than or equal to 216 are encoded by a compound value using two 16-bit code units. These two 16-bit code units are chosen from the UTF-16 surrogate range 0xD800–0xDFFF which had not previously been assigned to characters. Values in this range are not used as characters, and UTF-16 provides no legal way to code them as individual code points. A UTF-16 stream, therefore, consists of single 16-bit code points outside the surrogate range for code points in the Basic Multilingual Plane (BMP), and pairs of 16-bit values within the surrogate range for code points above the BMP.
UTF-16 is specified in the latest versions of both the international standard ISO/IEC 10646 and the Unicode Standard. «UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.»[11] There are no plans as of 2021 to extend UTF-16 to support a larger number of code points or the code points replaced by surrogates, as this would violate the Unicode Stability Policy with respect to general category or surrogate code points.[12] (any scheme that remains a self-synchronizing code would require allocating at least one BMP code point to start a sequence. Changing the purpose of a code point is disallowed).
Description
Each Unicode code point is encoded either as one or two 16-bit code units. How these 16-bit codes are stored as bytes then depends on the ‘endianness’ of the text file or communication protocol.
A «character» may need from as few as two bytes to fourteen[13] or even more bytes to be recorded. For instance an emoji flag character takes 8 bytes, since it is «constructed from a pair of Unicode scalar values»[14] (and those vales are outside the BMP and require 4 bytes each).
U+0000 to U+D7FF and U+E000 to U+FFFF
U+D800 to U+DFFF have a special purpose, see below.
Both UTF-16 and UCS-2 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points. These code points in the Basic Multilingual Plane (BMP) are the only code points that can be represented in UCS-2.[citation needed] As of Unicode 9.0, some modern non-Latin Asian, Middle-Eastern, and African scripts fall outside this range, as do most emoji characters.
Code points from U+010000 to U+10FFFF
Code points from the other planes (called Supplementary Planes) are encoded as two 16-bit code units called a surrogate pair, by the following scheme:
Low High |
DC00 | DC01 | … | DFFF |
---|---|---|---|---|
D800 | 010000 | 010001 | … | 0103FF |
D801 | 010400 | 010401 | … | 0107FF |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
DBFF | 10FC00 | 10FC01 | … | 10FFFF |
- 0x10000 is subtracted from the code point (U), leaving a 20-bit number (U’) in the hex number range 0x00000–0xFFFFF. Note for these purposes, U is defined to be no greater than 0x10FFFF.
- The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be in the range 0xD800–0xDBFF.
- The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2), which will be in the range 0xDC00–0xDFFF.
Illustrated visually, the distribution of U’ between W1 and W2 looks like:[15]
U' = yyyyyyyyyyxxxxxxxxxx // U - 0x10000 W1 = 110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy W2 = 110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
The high surrogate and low surrogate are also known as «leading» and «trailing» surrogates, respectively, analogous to the leading and trailing bytes of UTF-8.[16]
Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair. This simplifies searches a great deal. It also means that UTF-16 is self-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units (i.e. the type of code unit can be determined by the ranges of values in which it falls). UTF-8 shares these advantages, but many earlier multi-byte encoding schemes (such as Shift JIS and other Asian multi-byte encodings) did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string (UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random byte).
Because the most commonly used characters are all in the BMP, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135).
The Supplementary Planes contain emoji, historic scripts, less used symbols, less used Chinese ideographs, etc. Since the encoding of Supplementary Planes contains 20 significant bits (10 of 16 bits in each of the high and low surrogates), 220 code points can be encoded, divided into 16 planes of 216 code points each. Including the separately-handled Basic Multilingual Plane, there are a total of 17 planes.
U+D800 to U+DFFF
The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.[citation needed]
However, UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and a large amount of software does so,[citation needed] even though the standard states that such arrangements should be treated as encoding errors.[citation needed]
It is possible to unambiguously encode an unpaired surrogate (a high surrogate code point not followed by a low one, or a low one not preceded by a high one) in the format of UTF-16 by using a code unit equal to the code point. The result is not valid UTF-16, but the majority of UTF-16 encoder and decoder implementations do this then when translating between encodings.[citation needed] Windows allows unpaired surrogates in filenames and other places,[citation needed] which generally means they have to be supported by software in spite of their exclusion from the Unicode standard.
Examples
To encode U+10437 (𐐷) to UTF-16:
- Subtract 0x10000 from the code point, leaving 0x0437.
- For the high surrogate, shift right by 10 (divide by 0x400), then add 0xD800, resulting in 0x0001 + 0xD800 = 0xD801.
- For the low surrogate, take the low 10 bits (remainder of dividing by 0x400), then add 0xDC00, resulting in 0x0037 + 0xDC00 = 0xDC37.
To decode U+10437 (𐐷) from UTF-16:
- Take the high surrogate (0xD801) and subtract 0xD800, then multiply by 0x400, resulting in 0x0001 × 0x400 = 0x0400.
- Take the low surrogate (0xDC37) and subtract 0xDC00, resulting in 0x37.
- Add these two results together (0x0437), and finally add 0x10000 to get the final decoded UTF-32 code point, 0x10437.
The following table summarizes this conversion, as well as others. The colors indicate how bits from the code point are distributed among the UTF-16 bytes. Additional bits added by the UTF-16 encoding process are shown in black.
Character | Binary code point | Binary UTF-16 | UTF-16 hex code units |
UTF-16BE hex bytes |
UTF-16LE hex bytes |
|
---|---|---|---|---|---|---|
$ | U+0024
|
0000 0000 0010 0100
|
0000 0000 0010 0100
|
0024
|
00 24
|
24 00
|
€ | U+20AC
|
0010 0000 1010 1100
|
0010 0000 1010 1100
|
20AC
|
20 AC
|
AC 20
|
𐐷 | U+10437
|
0001 0000 0100 0011 0111
|
1101 1000 0000 0001 1101 1100 0011 0111
|
D801 DC37
|
D8 01 DC 37
|
01 D8 37 DC
|
𤭢 | U+24B62
|
0010 0100 1011 0110 0010
|
1101 1000 0101 0010 1101 1111 0110 0010
|
D852 DF62
|
D8 52 DF 62
|
52 D8 62 DF
|
Byte-order encoding schemes
UTF-16 and UCS-2 produce a sequence of 16-bit code units. Since most communication and storage protocols are defined for bytes, and each unit thus takes two 8-bit bytes, the order of the bytes may depend on the endianness (byte order) of the computer architecture.
To assist in recognizing the byte order of code units, UTF-16 allows a Byte Order Mark (BOM), a code point with the value U+FEFF, to precede the first actual coded value.[nb 1] (U+FEFF is the invisible zero-width non-breaking space/ZWNBSP character.)[nb 2] If the endian architecture of the decoder matches that of the encoder, the decoder detects the 0xFEFF value, but an opposite-endian decoder interprets the BOM as the non-character value U+FFFE reserved for this purpose. This incorrect result provides a hint to perform byte-swapping for the remaining values.
If the BOM is missing, RFC 2781 recommends[nb 3] that big-endian encoding be assumed. In practice, due to Windows using little-endian order by default, many applications assume little-endian encoding. It is also reliable to detect endianness by looking for null bytes, on the assumption that characters less than U+0100 are very common. If more even bytes (starting at 0) are null, then it is big-endian.
The standard also allows the byte order to be stated explicitly by specifying UTF-16BE or UTF-16LE as the encoding type. When the byte order is specified explicitly this way, a BOM is specifically not supposed to be prepended to the text, and a U+FEFF at the beginning should be handled as a ZWNBSP character. Most applications ignore a BOM in all cases despite this rule.
For Internet protocols, IANA has approved «UTF-16», «UTF-16BE», and «UTF-16LE» as the names for these encodings (the names are case insensitive). The aliases UTF_16 or UTF16 may be meaningful in some programming languages or software applications, but they are not standard names in Internet protocols.
Similar designations, UCS-2BE and UCS-2LE, are used to show versions of UCS-2.
Usage
UTF-16 is used for text in the OS API of all currently supported versions of Microsoft Windows (and including at least all since Windows CE/2000/XP/2003/Vista/7[17]) including Windows 10. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages.[18][19] Older Windows NT systems (prior to Windows 2000) only support UCS-2.[20] Files and network data tend to be a mix of UTF-16, UTF-8, and legacy byte encodings.
While there’s been some UTF-8 support for even Windows XP,[21] it was improved (in particular the ability to name a file using UTF-8) in Windows 10 insider build 17035 and the April 2018 update, and as of May 2019 Microsoft recommends software use it instead of UTF-16.[2]
The IBM i operating system designates CCSID (code page) 13488 for UCS-2 encoding and CCSID 1200 for UTF-16 encoding, though the system treats them both as UTF-16.[22]
UTF-16 is used by the Qualcomm BREW operating systems; the .NET environments; and the Qt cross-platform graphical widget toolkit.
Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UCS-2. iPhone handsets use UTF-16 for Short Message Service instead of UCS-2 described in the 3GPP TS 23.038 (GSM) and IS-637 (CDMA) standards.[23]
The Joliet file system, used in CD-ROM media, encodes file names using UCS-2BE (up to sixty-four Unicode characters per file name).
The Python language environment officially only uses UCS-2 internally since version 2.0, but the UTF-8 decoder to «Unicode» produces correct UTF-16. Since Python 2.2, «wide» builds of Unicode are supported which use UTF-32 instead;[24] these are primarily used on Linux. Python 3.3 no longer ever uses UTF-16, instead an encoding that gives the most compact representation for the given string is chosen from ASCII/Latin-1, UCS-2, and UTF-32.[25]
Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0.
JavaScript may use UCS-2 or UTF-16.[26] As of ES2015, string methods and regular expression flags have been added to the language that permit handling strings from an encoding-agnostic perspective.
In many languages, quoted strings need a new syntax for quoting non-BMP characters, as the C-style "\uXXXX"
syntax explicitly limits itself to 4 hex digits. The following examples illustrate the syntax for the non-BMP character «𝄞» (U+1D11E, MUSICAL SYMBOL G CLEF). The most common (used by C++, C#, D, and several other languages) is to use an upper-case ‘U’ with 8 hex digits such as "\U0001D11E"
.[27] In Java 7 regular expressions, ICU, and Perl, the syntax "\x{1D11E}"
must be used; similarly, in ECMAScript 2015 (JavaScript), the escape format is "\u{1D11E}"
. In many other cases (such as Java outside of regular expressions),[28] the only way to get non-BMP characters is to enter the surrogate halves individually, for example: "\uD834\uDD1E"
for U+1D11E.
String implementations based on UTF-16 typically define lengths of the string and allow indexing in terms of these 16-bit code units, not in terms of code points. Neither code points nor code units correspond to anything an end user might recognize as a “character”; the things users identify as characters may in general consist of a base code point and a sequence of combining characters (or might be a sequence of code points of some other kind, for example Hangul conjoining jamos) – Unicode refers to this construct as a grapheme cluster[29] – and as such, applications dealing with Unicode strings, whatever the encoding, must cope with the fact that this limits their ability to arbitrarily split and combine strings.
UCS-2 is also supported by the PHP language[30] and MySQL.[7]
Swift, version 5, Apple’s preferred application language, switched from UTF-16 to UTF-8 as the preferred encoding.[31]
See also
- Comparison of Unicode encodings
- Plane (Unicode)
- UTF-8
- UTF-32
Notes
- ^ UTF-8 encoding produces byte values strictly less than 0xFE, so either byte in the BOM sequence also identifies the encoding as UTF-16 (assuming that UTF-32 is not expected).
- ^ Use of U+FEFF as the character ZWNBSP instead of as a BOM has been deprecated in favor of U+2060 (WORD JOINER); see Byte Order Mark (BOM) FAQ at unicode.org. But if an application interprets an initial BOM as a character, the ZWNBSP character is invisible, so the impact is minimal.
- ^ RFC 2781 section 4.3 says that if there is no BOM, «the text SHOULD be interpreted as being big-endian.» According to section 1.2, the meaning of the term «SHOULD» is governed by RFC 2119. In that document, section 3 says «… there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course».
References
- ^ a b «What is UTF-16?». The Unicode Consortium. Unicode, Inc. Retrieved 29 March 2018.
- ^ a b «Use the Windows UTF-8 code page — UWP applications». docs.microsoft.com. Retrieved 2020-06-06.
As of Windows Version 1903 (May 2019 Update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. [..]
CP_ACP
equates toCP_UTF8
only if running on Windows Version 1903 (May 2019 Update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend usingCP_UTF8
explicitly. - ^ «HTML Living Standard». w3.org. 2020-06-10. Retrieved 2020-06-15.
UTF-16 encodings are the only encodings that this specification needs to treat as not being ASCII-compatible encodings.
- ^ «Usage Statistics of UTF-16 for Websites, June 2021». w3techs.com. Retrieved 2021-06-17.
- ^ «Usage Statistics of UTF-8 for Websites, February 2021». w3techs.com. Retrieved 2021-02-25.
- ^ «Encoding Standard». encoding.spec.whatwg.org. Retrieved 2018-10-22.
The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding. [..] The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that UTF-8 is now the mandatory encoding for all text things on the Web.
- ^ a b «MySQL :: MySQL 5.7 Reference Manual :: 10.1.9.4 The ucs2 Character Set (UCS-2 Unicode Encoding)». dev.mysql.com.
- ^ «Questions about encoding forms». Retrieved 2010-11-12.
- ^ ISO/IEC 10646:2014 «Information technology – Universal Coded Character Set (UCS)» sections 9 and 10.
- ^ The Unicode Standard version 7.0 (2014) section 2.5.
- ^ «The Unicode® Standard Version 10.0 – Core Specification. Appendix C Relationship to ISO/IEC 10646» (PDF). Unicode Consortium. section C.2 page 913 (pdf page 10)
- ^ «Unicode Character Encoding Stability Policies». unicode.org.
- ^ «It’s not wrong that «🤦🏼♂️».length == 7″. hsivonen.fi. Retrieved 2021-03-15.
- ^ «Apple Developer Documentation». developer.apple.com. Retrieved 2021-03-15.
- ^ Yergeau, Francois; Hoffman, Paul. «UTF-16, an encoding of ISO 10646». tools.ietf.org. Retrieved 2019-06-18.
- ^ Allen, Julie D.; Anderson, Deborah; Becker, Joe; Cook, Richard, eds. (2014). «3.8 Surrogates» (PDF). The Unicode Standard, Version 7.0—Core Specification. Mountain View: The Unicode Consortium. p. 118. Retrieved 3 November 2014.
- ^ Unicode (Windows). Retrieved 2011-03-08 «These functions use UTF-16 (wide character) encoding (…) used for native Unicode encoding on Windows operating systems.»
- ^ «Unicode». microsoft.com. Retrieved 2009-07-20.
- ^ «Surrogates and Supplementary Characters». microsoft.com. Retrieved 2009-07-20.
- ^ «Description of storing UTF-8 data in SQL Server». microsoft.com. 7 December 2005. Retrieved 2008-02-01.
- ^ «[Updated] Patch for cmd.exe for windows xp for cp 65001 — Page 2 — DosTips.com». www.dostips.com. Retrieved 2021-06-17.
- ^ «UCS-2 and its relationship to Unicode (UTF-16)». IBM. Retrieved 2019-04-26.
- ^ Selph, Chad (2012-11-08). «Adventures in Unicode SMS». Twilio. Archived from the original on 2012-11-09. Retrieved 2015-08-28.
- ^ «PEP 261 – Support for «wide» Unicode characters». Python.org. Retrieved 2015-05-29.
- ^ «PEP 0393 – Flexible String Representation». Python.org. Retrieved 2015-05-29.
- ^ «JavaScript’s internal character encoding: UCS-2 or UTF-16? · Mathias Bynens».
- ^ «ECMA-334: 9.4.1 Unicode escape sequences». en.csharp-online.net. Archived from the original on 2013-05-01.
- ^ Lexical Structure: Unicode Escapes in «The Java Language Specification, Third Edition». Sun Microsystems, Inc. 2005. Retrieved 2019-10-11.
- ^ «Glossary of Unicode Terms». Retrieved 2016-06-21.
- ^ «PHP: Supported Character Encodings — Manual». php.net.
- ^ «UTF-8 String». Swift.org. 2019-03-20. Retrieved 2020-08-20.
External links
- A very short algorithm for determining the surrogate pair for any code point
- Unicode Technical Note #12: UTF-16 for Processing
- Unicode FAQ: What is the difference between UCS-2 and UTF-16?
- Unicode Character Name Index
- RFC 2781: UTF-16, an encoding of ISO 10646
- java.lang.String documentation, discussing surrogate handling
The values stored in memory for Windows are UTF-16 little-endian, always. But that’s not what you’re talking about — you’re looking at file contents. Windows itself does not specify the encoding of files, it leaves that to individual applications.
The 0xfe 0xff you see at the start of the file is a Byte Order Mark or BOM. It not only indicates that the file is most probably Unicode, but it tells you which variant of Unicode encoding.
0xfe 0xff UTF-16 big-endian
0xff 0xfe UTF-16 little-endian
0xef 0xbb 0xbf UTF-8
A file that doesn’t have a BOM should be assumed to be 8-bit characters unless you know how it was written. That still doesn’t tell you if it’s UTF-8 or some other Windows character encoding, you’ll just have to guess.
You may use Notepad as an example of how this is done. If the file has a BOM then Notepad will read it and process the contents appropriately. Otherwise you must specify the coding yourself with the «Encoding» dropdown list.
Edit: the reason Windows documentation isn’t more specific about the encoding is that Windows was a very early adopter of Unicode, and at the time there was only one encoding of 16 bits per code point. When 65536 code points were determined to be inadequate, surrogate pairs were invented as a way to extend the range and UTF-16 was born. Microsoft was already using Unicode to refer to their encoding and never changed.
UTF-16 (англ. Unicode Transformation Format) в информатике — один из способов кодирования символов из Юникода в виде последовательности 16-битных слов. Данная кодировка позволяет записывать символы Юникода в диапазонах U+0000..U+D7FF и U+E000..U+10FFFF (общим количеством 1 112 064). При этом каждый символ записывается одним или двумя словами (суррогатная пара).
Кодировка UTF-16 описана в приложении Q к международному стандарту ISO/IEC 10646, а также ей посвящён IETF RFC 2781 «UTF-16, an encoding of ISO 10646».
История появления
Первая версия Юникода (1991 г.) представляла собой 16-битную кодировку с фиксированной шириной символа; общее число разных символов было 216 (65 536). Во второй версии Юникода (1996 г.) было решено значительно расширить кодовую область; для сохранения совместимости с теми системами, где уже был реализован 16-битный Юникод, и была создана UTF-16. Область 0xD800—0xDFFF, отведённая для суррогатных пар, ранее принадлежала к области «символов для частного использования».
Поскольку в UTF-16 можно отобразить 220+216−2048 (1 112 064) символов, то это число и было выбрано в качестве новой величины кодового пространства Юникода.
Принцип кодирования
DC00 | … | DFFE | DFFF | |
---|---|---|---|---|
D800 | 010000 | … | 0103FE | 0103FF |
D801 | 010400 | … | 0107FE | 0107FF |
… | … | |||
DBFF | 10FC00 | … | 10FFFE |
В UTF-16 символы кодируются двухбайтовыми словами с использованием всех возможных диапазонов значений (от 0 до FFFF16). При этом можно кодировать символы Unicode в дипазонах 000016..D7FF16 и E00016..10FFFF16. Исключенный отсюда диапазон D80016..DFFF16 используется как раз для кодирования так называемых суррогатных пар — символов, которые кодируются двумя 16-битными словами.
Символы Unicode до FFFF16 включительно (исключая диапазон для суррогатов) записываются как есть 16-битным словом.
Символы же в диапазоне 1000016..10FFFF16 (больше 16 бит) кодируются по следующей схеме:
-
Код символа арифметически сдвигается до нуля (из него вычитается минимальное число 1000016). В результате получится значение от нуля до FFFFF16, которое занимает до 20 бит.
-
Старшие 10 бит (число в диапазоне 000016..03FF16) суммируются с D80016, и результат идёт в лидирующее (первое) слово, которое входит в диапазон D80016..DBFF16.
-
Младшие 10 бит (тоже число в диапазоне 000016..03FF16) суммируются с DC0016, и результат идёт в последующее (второе) слово, которое входит в диапазон DC0016..DFFF16.
В обоих словах старшие 6 бит используются для обозначения суррогата. Биты с 11 по 15 (если вести отсчёт от нуля) имеют значения 110112, а 10-й бит содержит 0 у лидирующего слова и 1 — у последующего. В связи с этим можно легко определить, к чему относится каждое слово.
Порядок байт
Один символ кодировки UTF-16 представлен последовательностью двух байтов или двух пар байтов. Который из двух идёт впереди, старший или младший, зависит от порядка байтов. Систему, совместимую с процессорами x86, называют little endian, а с процессорами m68k и SPARC — big endian.
Для определения порядка байтов используется метка порядка байтов (англ. Byte order mark). В начале текста записывается код U+FEFF. При считывании, если вместо U+FEFF считалось U+FFFE, значит порядок байтов обратный, поскольку символа с кодом и U+FFFE в Юникоде нет. Так как в кодировке UTF-8 не используются значения 0xFE и 0xFF, можно использовать метку порядка байтов как признак, позволяющий различать UTF-16 и UTF-8.
UTF-16LE и UTF-16BE
Предусмотрена также возможность внешнего указания порядка байтов — для этого кодировка должна быть описана как UTF-16LE или UTF-16BE (little-endian / big-endian), а не просто UTF-16. В этом случае метка порядка байтов (U+FEFF) не нужна.
UTF-16 в ОС Windows
В API Win32, распространённом в современных версиях операционной системы Microsoft Windows, имеется два способа представления текста: в форме традиционных 8-битных кодовых страниц и в виде UTF-16.
При использовании UTF-16, Windows не накладывает ограничений на прикладные программы касательно кодирования текстовых файлов, позволяя им использовать как UTF-16LE, так и UTF-16BE посредством установки и трактовки соответствующей метки порядка байтов. Однако внутренний формат Windows — всегда UTF-16LE. Этот момент следует учитывать при работе с исполняемыми файлами, использующими юникодовые версии функций WinAPI. Строки в них всегда кодируются в UTF-16LE1).
В файловых системах NTFS, а также FAT с поддержкой длинных имён, имена файлов записываются также в UTF-16LE.
Примеры процедур
Примеры ниже записаны на псевдокоде и в них не учитывается маска порядка байт — они лишь показывают суть кодирования. Порядок байт — от младшего к старшему (Little-Endian, интеловский x86). Тип Word
— двухбайтовое слово (16-битное беззнаковое целое), а тип UInt32
— 32-битное беззнаковое целое. Шестнадцатиричные значения начинаются со знака доллара «$
».
Кодирование
В примере WriteWord()
— условная процедура, которая пишет одно слово (при этом сдвигает внутренний указатель). Функция LoWord()
возвращает младшее слово от 32-битного целого (старшие биты не глядя отбрасываются).
<nowiki>//</nowiki> Допустимые значения Code: $0000..$D7FF, $E000..$10FFFF. Procedure WriteUTF16Char(Code: UInt32) If (Code < $10000) Then WriteWord(LoWord(Code)) Else Code = Code - $10000 Var Lo10: Word = LoWord(Code And $3FF) Var Hi10: Word = LoWord(Code Shr 10) WriteWord($D800 Or Hi10) WriteWord($DC00 Or Lo10) End If End Procedure
Раскодирование
В примере ReadWord()
читает слово из потока (сдвигая при этом внутренний указатель). Она же при необходимости может корректировать порядок байт. Функция WordToUInt32
расширяет двухбайтовое слово до четырёхбайтового беззнакового целого, заполняя старшие биты нулями. Error()
прерывает выполнение (по сути исключение).
<nowiki>//</nowiki> В случае успеха возвращаются значения <nowiki>//</nowiki> в диапазонах $0000..$D7FF и $E000..$10FFFF. Function ReadUTF16Char: UInt32 Var Leading: Word <nowiki>//</nowiki> Лидирующее (первое) слово. Var Trailing: Word <nowiki>//</nowiki> Последующее (второе) слово. Leading = ReadWord(); If (Leading < $D800) Or (Leading > $DFFF) Then Return WordToUInt32(Leading) Else If (Leading >= $DC00) Then Error("Недопустимая кодовая последовательность.") Else Var Code: UInt32 Code = WordToUInt32(Leading And $3FF) Shl 10 Trailing = ReadWord() If ((Trailing < $DC00) Or (Trailing > $DFFF)) Then Error("Недопустимая кодовая последовательность.") Else Code = Code Or WordToUInt32(Trailing And $3FF) Return (Code + $10000) End If End If End Function
Ссылки
Кодировки символов | ||
---|---|---|
Основы | алфавит • текст (файл • данные) • набор символов • конверсия | |
Исторические кодировки | Докомп.: | семафорная (Макарова) • Морзе • Бодо • МТК-2 |
Комп.: | 6-битная • УПП • RADIX-50 • EBCDIC (ДКОИ-8) • КОИ-7 • ISO 646 | |
современное
8-битное представление |
символы | ASCII (управляющие • печатные) • не-ASCII (псевдографика) |
8-битные код.стр. | Кириллица: КОИ-8 • Основная кодировка • MacCyrillic | |
ISO 8859 | 1 (лат.) • 2 • 3 • 4 • 5 (кир.) • 6 • 7 • 8 • 9 • 10 • 11 • 12 • 13 • 14 • 15 (€) • 16 | |
Windows | 1250 • 1251 (кир.) • 1252 • 1253 • 1254 • 1255 • 1256 • 1257 • 1258 • WGL4 | |
IBM & DOS | 437 • 850 • 852 • 855 • 866 «альт.» • МИК • НИИ ЭВМ | |
Многобайтные | Традиционные | DBCS (GB2312) • HTML |
Unicode | UTF-32 • UTF-16 • UTF-8 • список символов (кириллица) | |
Связанные темы | интерфейс пользователя • раскладка клавиатуры • локаль • перевод строки • шрифт • транслит • нестандартные шрифты | |
Утилиты | iconv • recode |
From Wikipedia, the free encyclopedia
Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows,[citation needed] although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.
There are two groups of system code pages in Windows systems: OEM and Windows-native («ANSI») code pages.
(ANSI is the American National Standards Institute.) Code pages in both of these groups are extended ASCII code pages. Additional code pages are supported by standard Windows conversion routines, but not used as either type of system code page.
ANSI code page[edit]
Alias(es) | ANSI (misnomer) |
---|---|
Standard | WHATWG Encoding Standard |
Extends | US-ASCII |
Preceded by | ISO 8859 |
Succeeded by | Unicode UTF-16 (in Win32 API) |
|
ANSI code pages (officially called «Windows code pages» [1] after Microsoft accepted the former term being a misnomer [2]) are used for native non-Unicode (say, byte oriented) applications using a graphical user interface on Windows systems. The term «ANSI» is a misnomer because these Windows code pages do not comply with any ANSI (American National Standards Institute) standard; code page 1252 was based on an early ANSI draft that became the international standard ISO 8859-1, [2] which adds a further 32 control codes and space for 96 printable characters. Among other differences, Windows code-pages allocate printable characters to the supplementary control code space, making them at best illegible to standards-compliant operating systems.)
Most legacy «ANSI» code pages have code page numbers in the pattern 125x. However, 874 (Thai) and the East Asian multi-byte «ANSI» code pages (932, 936, 949, 950), all of which are also used as OEM code pages, are numbered to match IBM encodings, none of which are identical to the Windows encodings (although most are similar). While code page 1258 is also used as an OEM code page, it is original to Microsoft rather than an extension to an existing encoding. IBM have assigned their own, different numbers for Microsoft’s variants, these are given for reference in the lists below where applicable.
All of the 125x Windows code pages, as well as 874 and 936, are labelled by Internet Assigned Numbers Authority (IANA) as «Windows-number«, although «Windows-936» is treated as a synonym for «GBK». Windows code page 932 is instead labelled as «Windows-31J».[3]
ANSI Windows code pages, and especially the code page 1252, were so called since they were purportedly based on drafts submitted or intended for ANSI. However, ANSI and ISO have not standardized any of these code pages. Instead they are either:[2]
- Supersets of the standard sets such as those of ISO 8859 and the various national standards (like Windows-1252 vs. ISO-8859-1),
- Major modifications of these (making them incompatible to various degrees, like Windows-1250 vs. ISO-8859-2)
- Having no parallel encoding (like Windows-1257 vs. ISO-8859-4; ISO-8859-13 was introduced much later). Also, Windows-1251 follows neither the ISO-standardised ISO-8859-5 nor the then-prevailing KOI-8.
Microsoft assigned about twelve of the typography and business characters (including notably, the euro sign, €) in CP1252 to the code points 0x80–0x9F that, in ISO 8859, are assigned to C1 control codes. These assignments are also present in many other ANSI/Windows code pages at the same code-points. Windows did not use the C1 control codes, so this decision had no direct effect on Windows users. However, if included in a file transferred to a standards-compliant platform like Unix or MacOS, the information was invisible and potentially disruptive.[citation needed]
OEM code page[edit]
The OEM code pages (original equipment manufacturer) are used by Win32 console applications, and by virtual DOS, and can be considered a holdover from DOS and the original IBM PC architecture. A separate suite of code pages was implemented not only due to compatibility, but also because the fonts of VGA (and descendant) hardware suggest encoding of line-drawing characters to be compatible with code page 437. Most OEM code pages share many code points, particularly for non-letter characters, with the second (non-ASCII) half of CP437.
A typical OEM code page, in its second half, does not resemble any ANSI/Windows code page even roughly. Nevertheless, two single-byte, fixed-width code pages (874 for Thai and 1258 for Vietnamese) and four multibyte CJK code pages (932, 936, 949, 950) are used as both OEM and ANSI code pages. Code page 1258 uses combining diacritics, as Vietnamese requires more than 128 letter-diacritic combinations. This is in contrast to VISCII, which replaces some of the C0 (i.e. ASCII) control codes.
History[edit]
Initially, computer systems and system programming languages did not make a distinction between characters and bytes: for the segmental scripts used in most of Africa, the Americas, southern and south-east Asia, the Middle East and Europe, a character needs just one byte, but two or more bytes are needed for the ideographic sets used in the rest of the world. This subsequently led to much confusion. Microsoft software and systems prior to the Windows NT line are examples of this, because they use the OEM and ANSI code pages that do not make the distinction.
Since the late 1990s, software and systems have adopted Unicode as their preferred storage format; this trend has been improved by the widespread adoption of XML which default to UTF-8 but also provides a mechanism for labelling the encoding used.[4] All current Microsoft products and application program interfaces use Unicode internally,[citation needed] but some applications continue to use the default encoding of the computer’s ‘locale’ when reading and writing text data to files or standard output.[citation needed] Therefore, files may still be encountered that are legible and intelligible in one part of the world but unintelligible mojibake in another.
UTF-8, UTF-16[edit]
Microsoft adopted a Unicode encoding (first the now-obsolete UCS-2, which was then Unicode’s only encoding), i.e. UTF-16 for all its operating systems from Windows NT onwards, but additionally supports UTF-8 (aka CP_UTF8
) since Windows 10 version 1803.[5]
UTF-16 uniquely encodes all Unicode characters in the Basic Multilingual Plane (BMP) using 16 bits but the remaining Unicode (e.g. emojis) is encoded with a 32-bit (four byte) code – while the rest of the industry (Unix-like systems and the web), and now Microsoft chose UTF-8 (which uses one byte for the 7-bit ASCII character set, two or three bytes for other characters in the BMP, and four bytes for the remainder).
List[edit]
The following Windows code pages exist:
Windows-125x series[edit]
These nine code pages are all extended ASCII 8-bit SBCS encodings, and were designed by Microsoft for use as ANSI codepages on Windows. They are commonly known by their IANA-registered[6] names as windows-<number>
, but are also sometimes called cp<number>
, «cp» for «code page». They are all used as ANSI code pages; Windows-1258 is also used as an OEM code page.
The Windows-125x series includes nine of the ANSI code pages, and mostly covers scripts from Europe and West Asia with the addition of Vietnam. System encodings for Thai and for East Asian languages were numbered to match similar IBM code pages and are used as both ANSI and OEM code pages; these are covered in following sections.
ID | Description | Relationship to ISO 8859 or other established encodings |
---|---|---|
1250[7][8] | Latin 2 / Central European | Similar to ISO-8859-2 but moves several characters, including multiple letters. |
1251[9][10] | Cyrillic | Incompatible with both ISO-8859-5 and KOI-8. |
1252[11][12] | Latin 1 / Western European | Superset of ISO-8859-1 (without C1 controls). Letter repertoire accordingly similar to CP850. |
1253[13][14] | Greek | Similar to ISO 8859-7 but moves several characters, including a letter. |
1254[15][16] | Turkish | Superset of ISO 8859-9 (without C1 controls). |
1255[17][18] | Hebrew | Almost a superset of ISO 8859-8, but with two incompatible punctuation changes. |
1256[19][20] | Arabic | Not compatible with ISO 8859-6; rather, OEM Code page 708 is an ISO 8859-6 (ASMO 708) superset. |
1257[21][22] | Baltic | Not ISO 8859-4; the later ISO 8859-13 is closely related, but with some differences in available punctuation. |
1258[23][24] | Vietnamese (also OEM) | Not related to VSCII or VISCII, uses fewer base characters with combining diacritics. |
DOS code pages[edit]
These are also ASCII-based. Most of these are included for use as OEM code pages; code page 874 is also used as an ANSI code page.
- 437 – IBM PC US, 8-bit SBCS extended ASCII.[25] Known as OEM-US, the encoding of the primary built-in font of VGA graphics cards.
- 708 – Arabic, extended ISO 8859-6 (ASMO 708)
- 720 – Arabic, retaining box drawing characters in their usual locations
- 737 – «MS-DOS Greek». Retains all box drawing characters. More popular than 869.
- 775 – «MS-DOS Baltic Rim»
- 850 – «MS-DOS Latin 1». Full (re-arranged) repertoire of ISO 8859-1.
- 852 – «MS-DOS Latin 2»
- 855 – «MS-DOS Cyrillic». Mainly used for South Slavic languages. Includes (re-arranged) repertoire of ISO-8859-5. Not to be confused with cp866.
- 857 – «MS-DOS Turkish»
- 858 – Western European with euro sign
- 860 – «MS-DOS Portuguese»
- 861 – «MS-DOS Icelandic»
- 862 – «MS-DOS Hebrew»
- 863 – «MS-DOS French Canada»
- 864 – Arabic
- 865 – «MS-DOS Nordic»
- 866 – «MS-DOS Cyrillic Russian», cp866. Sole purely OEM code page (rather than ANSI or both) included as a legacy encoding in WHATWG Encoding Standard for HTML5.
- 869 – «MS-DOS Greek 2», IBM869. Full (re-arranged) repertoire of ISO 8859-7.
- 874 – Thai, also used as the ANSI code page, extends ISO 8859-11 (and therefore TIS-620) with a few additional characters from Windows-1252. Corresponds to IBM code page 1162 (IBM-874 is similar but has different extensions).
East Asian multi-byte code pages[edit]
These often differ from the IBM code pages of the same number: code pages 932, 949 and 950 only partly match the IBM code pages of the same number, while the number 936 was used by IBM for another Simplified Chinese encoding which is now deprecated and Windows-951, as part of a kludge, is unrelated to IBM-951. IBM equivalent code pages are given in the second column. Code pages 932, 936, 949 and 950/951 are used as both ANSI and OEM code pages on the locales in question.
ID | Language | Encoding | IBM Equivalent | Difference from IBM CCSID of same number | Use |
---|---|---|---|---|---|
932 | Japanese | Shift JIS (Microsoft variant) | 943[26] | IBM-932 is also Shift JIS, has fewer extensions (but those extensions it has are in common), and swaps some variant Chinese characters (itaiji) for interoperability with earlier editions of JIS C 6226. | ANSI/OEM (Japan) |
936 | Chinese (simplified) | GBK | 1386 | IBM-936 is a different Simplified Chinese encoding with a different encoding method, which has been deprecated since 1993. | ANSI/OEM (PRC, Singapore) |
949 | Korean | Unified Hangul Code | 1363 | IBM-949 is also an EUC-KR superset, but with different (colliding) extensions. | ANSI/OEM (Republic of Korea) |
950 | Chinese (traditional) | Big5 (Microsoft variant) | 1373[27] | IBM-950 is also Big5, but includes a different subset of the ETEN extensions, adds further extensions with an expanded trail byte range, and lacks the Euro. | ANSI/OEM (Taiwan, Hong Kong) |
951 | Chinese (traditional) including Cantonese | Big5-HKSCS (2001 ed.) | 5471[28] | IBM-951 is the double-byte plane from IBM-949 (see above), and unrelated to Microsoft’s internal use of the number 951. | ANSI/OEM (Hong Kong, 98/NT4/2000/XP with HKSCS patch) |
A few further multiple-byte code pages are supported for decoding or encoding using operating system libraries, but not used as either sort of system encoding in any locale.
ID | IBM Equivalent | Language | Encoding | Use |
---|---|---|---|---|
1361 | — | Korean | Johab (KS C 5601-1992 annex 3) | Conversion |
20000 | — | Chinese (traditional) | An encoding of CNS 11643 | Conversion |
20001 | — | Chinese (traditional) | TCA | Conversion |
20002 | — | Chinese (traditional) | Big5 (ETEN variant) | Conversion |
20003 | 938 | Chinese (traditional) | IBM 5550 | Conversion |
20004 | — | Chinese (traditional) | Teletext | Conversion |
20005 | — | Chinese (traditional) | Wang | Conversion |
20932 | 954 (roughly) | Japanese | EUC-JP | Conversion |
20936 | 5479 | Chinese (simplified) | GB 2312 | Conversion |
20949, 51949 | 970 | Korean | Wansung (8-bit with ASCII, i.e. EUC-KR)[29] | Conversion |
EBCDIC code pages[edit]
- 37 – IBM EBCDIC US-Canada, 8-bit SBCS[30]
- 500 – Latin 1
- 870 – IBM870
- 875 – cp875
- 1026 – EBCDIC Turkish
- 1047 – IBM01047 – Latin 1
- 1140 – IBM01141
- 1141 – IBM01141
- 1142 – IBM01142
- 1143 – IBM01143
- 1144 – IBM01144
- 1145 – IBM01145
- 1146 – IBM01146
- 1147 – IBM01147
- 1148 – IBM01148
- 1149 – IBM01149
- 20273 – EBCDIC Germany
- 20277 – EBCDIC Denmark/Norway
- 20278 – EBCDIC Finland/Sweden
- 20280 – EBCDIC Italy
- 20284 – EBCDIC Latin America/Spain
- 20285 – EBCDIC United Kingdom
- 20290 – EBCDIC Japanese
- 20297 – EBCDIC France
- 20420 – EBCDIC Arabic
- 20423 – EBCDIC Greek
- 20424 – x-EBCDIC-KoreanExtended
- 20833 – Korean
- 20838 – EBCDIC Thai
- 20924 – IBM00924 – IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
- 20871 – EBCDIC Icelandic
- 20880 – EBCDIC Cyrillic
- 20905 – EBCDIC Turkish
- 21025 – EBCDIC Cyrillic
- 21027 – Japanese EBCDIC (incomplete,[31] deprecated)[32]
[edit]
- 1200 – Unicode (BMP of ISO 10646, UTF-16LE). Available only to managed applications.[32]
- 1201 – Unicode (UTF-16BE). Available only to managed applications.[32]
- 12000 – UTF-32. Available only to managed applications.[32]
- 12001 – UTF-32. Big-endian. Available only to managed applications.[32]
- 65000 – Unicode (UTF-7)
- 65001 – Unicode (UTF-8)
Macintosh compatibility code pages[edit]
- 10000 – Apple Macintosh Roman
- 10001 – Apple Macintosh Japanese
- 10002 – Apple Macintosh Chinese (traditional) (BIG-5)
- 10003 – Apple Macintosh Korean
- 10004 – Apple Macintosh Arabic
- 10005 – Apple Macintosh Hebrew
- 10006 – Apple Macintosh Greek
- 10007 – Apple Macintosh Cyrillic
- 10008 – Apple Macintosh Chinese (simplified) (GB 2312)
- 10010 – Apple Macintosh Romanian
- 10017 – Apple Macintosh Ukrainian
- 10021 – Apple Macintosh Thai
- 10029 – Apple Macintosh Roman II / Central Europe
- 10079 – Apple Macintosh Icelandic
- 10081 – Apple Macintosh Turkish
- 10082 – Apple Macintosh Croatian
ISO 8859 code pages[edit]
- 28591 – ISO-8859-1 – Latin-1 (IBM equivalent: 819)
- 28592 – ISO-8859-2 – Latin-2
- 28593 – ISO-8859-3 – Latin-3 or South European
- 28594 – ISO-8859-4 – Latin-4 or North European
- 28595 – ISO-8859-5 – Latin/Cyrillic
- 28596 – ISO-8859-6 – Latin/Arabic
- 28597 – ISO-8859-7 – Latin/Greek
- 28598 – ISO-8859-8 – Latin/Hebrew
- 28599 – ISO-8859-9 – Latin-5 or Turkish
- 28600 – ISO-8859-10 – Latin-6
- 28601 – ISO-8859-11 – Latin/Thai
- 28602 – ISO-8859-12 – reserved for Latin/Devanagari but abandoned (not supported)
- 28603 – ISO-8859-13 – Latin-7 or Baltic Rim
- 28604 – ISO-8859-14 – Latin-8 or Celtic
- 28605 – ISO-8859-15 – Latin-9
- 28606 – ISO-8859-16 – Latin-10 or South-Eastern European
- 38596 – ISO-8859-6-I – Latin/Arabic (logical bidirectional order)
- 38598 – ISO-8859-8-I – Latin/Hebrew (logical bidirectional order)
ITU-T code pages[edit]
- 20105 – 7-bit IA5 IRV (Western European)[33][34][35]
- 20106 – 7-bit IA5 German (DIN 66003)[33][34][36]
- 20107 – 7-bit IA5 Swedish (SEN 850200 C)[33][34][37]
- 20108 – 7-bit IA5 Norwegian (NS 4551-2)[33][34][38]
- 20127 – 7-bit US-ASCII[33][34][39]
- 20261 – T.61 (T.61-8bit)
- 20269 – ISO-6937
KOI8 code pages[edit]
- 20866 – Russian – KOI8-R
- 21866 – Ukrainian – KOI8-U (or KOI8-RU in some versions)[40]
Problems arising from the use of code pages[edit]
Microsoft strongly recommends using Unicode in modern applications, but many applications or data files still depend on the legacy code pages.
- Programs need to know what code page to use in order to display the contents of (pre-Unicode) files correctly. If a program uses the wrong code page it may show text as mojibake.
- The code page in use may differ between machines, so (pre-Unicode) files created on one machine may be unreadable on another.
- Data is often improperly tagged with the code page, or not tagged at all, making determination of the correct code page to read the data difficult.
- These Microsoft code pages differ to various degrees from some of the standards and other vendors’ implementations. This isn’t a Microsoft issue per se, as it happens to all vendors, but the lack of consistency makes interoperability with other systems unreliable in some cases.
- The use of code pages limits the set of characters that may be used.
- Characters expressed in an unsupported code page may be converted to question marks (?) or other replacement characters, or to a simpler version (such as removing accents from a letter). In either case, the original character may be lost.
See also[edit]
- AppLocale – a utility to run non-Unicode (code page-based) applications in a locale of the user’s choice.
References[edit]
- ^ «Code Pages». 2016-03-07. Archived from the original on 2016-03-07. Retrieved 2021-05-26.
- ^ a b c «Glossary of Terms Used on this Site». December 8, 2018. Archived from the original on 2018-12-08.
The term «ANSI» as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft—which became International Organization for Standardization (ISO) Standard 8859-1. «ANSI applications» are usually a reference to non-Unicode or code page–based applications.
- ^ «Character Sets». www.iana.org. Archived from the original on 2021-05-25. Retrieved 2021-05-26.
- ^ «Extensible Markup Language (XML) 1.1 (Second Edition): Character encodings». W3C. 29 September 2006. Archived from the original on 19 April 2021. Retrieved 5 October 2020.
- ^ hylom (2017-11-14). «Windows 10のInsider PreviewでシステムロケールをUTF-8にするオプションが追加される» [The option to make UTF-8 the system locale added in Windows 10 Insider Preview]. スラド (in Japanese). Archived from the original on 2018-05-11. Retrieved 2018-05-10.
- ^ «Character Sets». IANA. Archived from the original on 2016-12-03. Retrieved 2019-04-07.
- ^ Microsoft. «Windows 1250». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01250». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. «Windows 1251». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01251». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. «Windows 1252». Archived from the original on 2013-05-04. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01252». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. «Windows 1253». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01253». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. «Windows 1254». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01254». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. «Windows 1255». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01255». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. «Windows 1256». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01256». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. «Windows 1257». Archived from the original on 2013-03-16. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01257». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ Microsoft. «Windows 1258». Archived from the original on 2013-10-25. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document CPGID 01258». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
- ^ IBM. «SBCS code page information document — CPGID 00437». Archived from the original on 2016-06-09. Retrieved 2014-07-04.
- ^ «IBM-943 and IBM-932». IBM Knowledge Center. IBM. Archived from the original on 2018-08-18. Retrieved 2020-07-08.
- ^ «Converter Explorer: ibm-1373_P100-2002». ICU Demonstration. International Components for Unicode. Archived from the original on 2021-05-26. Retrieved 2020-06-27.
- ^ «Coded character set identifiers – CCSID 5471». IBM Globalization. IBM. Archived from the original on 2014-11-29.
- ^ Julliard, Alexandre. «dump_krwansung_codepage: build Korean Wansung table from the KSX1001 file». make_unicode: Generate code page .c files from ftp.unicode.org descriptions. Wine Project. Archived from the original on 2021-05-26. Retrieved 2021-03-14.
- ^ IBM. «SBCS code page information document — CPGID 00037». Archived from the original on 2014-07-14. Retrieved 2014-07-04.
- ^ Steele, Shawn (2005-09-12). «Code Page 21027 «Extended/Ext Alpha Lowercase»«. MSDN. Archived from the original on 2019-04-06. Retrieved 2019-04-06.
- ^ a b c d e «Code Page Identifiers». docs.microsoft.com. Archived from the original on 2019-04-07. Retrieved 2019-04-07.
- ^ a b c d e «Code Page Identifiers». Microsoft Developer Network. Microsoft. 2014. Archived from the original on 2016-06-19. Retrieved 2016-06-19.
- ^ a b c d e «Web Encodings — Internet Explorer — Encodings». WHATWG Wiki. 2012-10-23. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. «Western European (IA5) encoding — Windows charsets». WUtils.com — Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. «German (IA5) encoding – Windows charsets». WUtils.com – Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. «Swedish (IA5) encoding — Windows charsets». WUtils.com — Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. «Norwegian (IA5) encoding — Windows charsets». WUtils.com — Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Foller, Antonin (2014) [2011]. «US-ASCII encoding — Windows charsets». WUtils.com — Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
- ^ Nechayev, Valentin (2013) [2001]. «Review of 8-bit Cyrillic encodings universe». Archived from the original on 2016-12-05. Retrieved 2016-12-05.
External links[edit]
- National Language Support (NLS) API Reference. Table showing ANSI and OEM codepages per language (from web-archive since Microsoft removed the original page)
- IANA Charset Name Registrations
- Unicode mapping table for Windows code pages
- Unicode mappings of windows code pages with «best fit»