Номер кодировки utf 8 в windows

From Wikipedia, the free encyclopedia

Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows,^{[citation needed]} although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

There are two groups of system code pages in Windows systems: OEM and Windows-native («ANSI») code pages.
(ANSI is the American National Standards Institute.) Code pages in both of these groups are extended ASCII code pages. Additional code pages are supported by standard Windows conversion routines, but not used as either type of system code page.

ANSI code page[edit]

Windows-125x series

Alias(es)	ANSI (misnomer)
Standard	WHATWG Encoding Standard
Extends	US-ASCII
Preceded by	ISO 8859
Succeeded by	Unicode UTF-16 (in Win32 API)
v t e

ANSI code pages (officially called «Windows code pages» ^[1] after Microsoft accepted the former term being a misnomer ^[2]) are used for native non-Unicode (say, byte oriented) applications using a graphical user interface on Windows systems. The term «ANSI» is a misnomer because these Windows code pages do not comply with any ANSI (American National Standards Institute) standard; code page 1252 was based on an early ANSI draft that became the international standard ISO 8859-1, ^[2] which adds a further 32 control codes and space for 96 printable characters. Among other differences, Windows code-pages allocate printable characters to the supplementary control code space, making them at best illegible to standards-compliant operating systems.)

Most legacy «ANSI» code pages have code page numbers in the pattern 125x. However, 874 (Thai) and the East Asian multi-byte «ANSI» code pages (932, 936, 949, 950), all of which are also used as OEM code pages, are numbered to match IBM encodings, none of which are identical to the Windows encodings (although most are similar). While code page 1258 is also used as an OEM code page, it is original to Microsoft rather than an extension to an existing encoding. IBM have assigned their own, different numbers for Microsoft’s variants, these are given for reference in the lists below where applicable.

All of the 125x Windows code pages, as well as 874 and 936, are labelled by Internet Assigned Numbers Authority (IANA) as «Windows-number«, although «Windows-936» is treated as a synonym for «GBK». Windows code page 932 is instead labelled as «Windows-31J».^[3]

ANSI Windows code pages, and especially the code page 1252, were so called since they were purportedly based on drafts submitted or intended for ANSI. However, ANSI and ISO have not standardized any of these code pages. Instead they are either:^[2]

Supersets of the standard sets such as those of ISO 8859 and the various national standards (like Windows-1252 vs. ISO-8859-1),
Major modifications of these (making them incompatible to various degrees, like Windows-1250 vs. ISO-8859-2)
Having no parallel encoding (like Windows-1257 vs. ISO-8859-4; ISO-8859-13 was introduced much later). Also, Windows-1251 follows neither the ISO-standardised ISO-8859-5 nor the then-prevailing KOI-8.

Microsoft assigned about twelve of the typography and business characters (including notably, the euro sign, €) in CP1252 to the code points 0x80–0x9F that, in ISO 8859, are assigned to C1 control codes. These assignments are also present in many other ANSI/Windows code pages at the same code-points. Windows did not use the C1 control codes, so this decision had no direct effect on Windows users. However, if included in a file transferred to a standards-compliant platform like Unix or MacOS, the information was invisible and potentially disruptive.^{[citation needed]}

OEM code page[edit]

The OEM code pages (original equipment manufacturer) are used by Win32 console applications, and by virtual DOS, and can be considered a holdover from DOS and the original IBM PC architecture. A separate suite of code pages was implemented not only due to compatibility, but also because the fonts of VGA (and descendant) hardware suggest encoding of line-drawing characters to be compatible with code page 437. Most OEM code pages share many code points, particularly for non-letter characters, with the second (non-ASCII) half of CP437.

A typical OEM code page, in its second half, does not resemble any ANSI/Windows code page even roughly. Nevertheless, two single-byte, fixed-width code pages (874 for Thai and 1258 for Vietnamese) and four multibyte CJK code pages (932, 936, 949, 950) are used as both OEM and ANSI code pages. Code page 1258 uses combining diacritics, as Vietnamese requires more than 128 letter-diacritic combinations. This is in contrast to VISCII, which replaces some of the C0 (i.e. ASCII) control codes.

History[edit]

Initially, computer systems and system programming languages did not make a distinction between characters and bytes: for the segmental scripts used in most of Africa, the Americas, southern and south-east Asia, the Middle East and Europe, a character needs just one byte, but two or more bytes are needed for the ideographic sets used in the rest of the world. This subsequently led to much confusion. Microsoft software and systems prior to the Windows NT line are examples of this, because they use the OEM and ANSI code pages that do not make the distinction.

Since the late 1990s, software and systems have adopted Unicode as their preferred storage format; this trend has been improved by the widespread adoption of XML which default to UTF-8 but also provides a mechanism for labelling the encoding used.^[4] All current Microsoft products and application program interfaces use Unicode internally,^{[citation needed]} but some applications continue to use the default encoding of the computer’s ‘locale’ when reading and writing text data to files or standard output.^{[citation needed]} Therefore, files may still be encountered that are legible and intelligible in one part of the world but unintelligible mojibake in another.

UTF-8, UTF-16[edit]

Microsoft adopted a Unicode encoding (first the now-obsolete UCS-2, which was then Unicode’s only encoding), i.e. UTF-16 for all its operating systems from Windows NT onwards, but additionally supports UTF-8 (aka CP_UTF8) since Windows 10 version 1803.^[5]
UTF-16 uniquely encodes all Unicode characters in the Basic Multilingual Plane (BMP) using 16 bits but the remaining Unicode (e.g. emojis) is encoded with a 32-bit (four byte) code – while the rest of the industry (Unix-like systems and the web), and now Microsoft chose UTF-8 (which uses one byte for the 7-bit ASCII character set, two or three bytes for other characters in the BMP, and four bytes for the remainder).

List[edit]

The following Windows code pages exist:

Windows-125x series[edit]

These nine code pages are all extended ASCII 8-bit SBCS encodings, and were designed by Microsoft for use as ANSI codepages on Windows. They are commonly known by their IANA-registered^[6] names as windows-<number>, but are also sometimes called cp<number>, «cp» for «code page». They are all used as ANSI code pages; Windows-1258 is also used as an OEM code page.

The Windows-125x series includes nine of the ANSI code pages, and mostly covers scripts from Europe and West Asia with the addition of Vietnam. System encodings for Thai and for East Asian languages were numbered to match similar IBM code pages and are used as both ANSI and OEM code pages; these are covered in following sections.

ID	Description	Relationship to ISO 8859 or other established encodings
1250^[7]^[8]	Latin 2 / Central European	Similar to ISO-8859-2 but moves several characters, including multiple letters.
1251^[9]^[10]	Cyrillic	Incompatible with both ISO-8859-5 and KOI-8.
1252^[11]^[12]	Latin 1 / Western European	Superset of ISO-8859-1 (without C1 controls). Letter repertoire accordingly similar to CP850.
1253^[13]^[14]	Greek	Similar to ISO 8859-7 but moves several characters, including a letter.
1254^[15]^[16]	Turkish	Superset of ISO 8859-9 (without C1 controls).
1255^[17]^[18]	Hebrew	Almost a superset of ISO 8859-8, but with two incompatible punctuation changes.
1256^[19]^[20]	Arabic	Not compatible with ISO 8859-6; rather, OEM Code page 708 is an ISO 8859-6 (ASMO 708) superset.
1257^[21]^[22]	Baltic	Not ISO 8859-4; the later ISO 8859-13 is closely related, but with some differences in available punctuation.
1258^[23]^[24]	Vietnamese (also OEM)	Not related to VSCII or VISCII, uses fewer base characters with combining diacritics.

DOS code pages[edit]

These are also ASCII-based. Most of these are included for use as OEM code pages; code page 874 is also used as an ANSI code page.

437 – IBM PC US, 8-bit SBCS extended ASCII.^[25] Known as OEM-US, the encoding of the primary built-in font of VGA graphics cards.
708 – Arabic, extended ISO 8859-6 (ASMO 708)
720 – Arabic, retaining box drawing characters in their usual locations
737 – «MS-DOS Greek». Retains all box drawing characters. More popular than 869.
775 – «MS-DOS Baltic Rim»
850 – «MS-DOS Latin 1». Full (re-arranged) repertoire of ISO 8859-1.
852 – «MS-DOS Latin 2»
855 – «MS-DOS Cyrillic». Mainly used for South Slavic languages. Includes (re-arranged) repertoire of ISO-8859-5. Not to be confused with cp866.
857 – «MS-DOS Turkish»
858 – Western European with euro sign
860 – «MS-DOS Portuguese»
861 – «MS-DOS Icelandic»
862 – «MS-DOS Hebrew»
863 – «MS-DOS French Canada»
864 – Arabic
865 – «MS-DOS Nordic»
866 – «MS-DOS Cyrillic Russian», cp866. Sole purely OEM code page (rather than ANSI or both) included as a legacy encoding in WHATWG Encoding Standard for HTML5.
869 – «MS-DOS Greek 2», IBM869. Full (re-arranged) repertoire of ISO 8859-7.
874 – Thai, also used as the ANSI code page, extends ISO 8859-11 (and therefore TIS-620) with a few additional characters from Windows-1252. Corresponds to IBM code page 1162 (IBM-874 is similar but has different extensions).

East Asian multi-byte code pages[edit]

These often differ from the IBM code pages of the same number: code pages 932, 949 and 950 only partly match the IBM code pages of the same number, while the number 936 was used by IBM for another Simplified Chinese encoding which is now deprecated and Windows-951, as part of a kludge, is unrelated to IBM-951. IBM equivalent code pages are given in the second column. Code pages 932, 936, 949 and 950/951 are used as both ANSI and OEM code pages on the locales in question.

ID	Language	Encoding	IBM Equivalent	Difference from IBM CCSID of same number	Use
932	Japanese	Shift JIS (Microsoft variant)	943^[26]	IBM-932 is also Shift JIS, has fewer extensions (but those extensions it has are in common), and swaps some variant Chinese characters (itaiji) for interoperability with earlier editions of JIS C 6226.	ANSI/OEM (Japan)
936	Chinese (simplified)	GBK	1386	IBM-936 is a different Simplified Chinese encoding with a different encoding method, which has been deprecated since 1993.	ANSI/OEM (PRC, Singapore)
949	Korean	Unified Hangul Code	1363	IBM-949 is also an EUC-KR superset, but with different (colliding) extensions.	ANSI/OEM (Republic of Korea)
950	Chinese (traditional)	Big5 (Microsoft variant)	1373^[27]	IBM-950 is also Big5, but includes a different subset of the ETEN extensions, adds further extensions with an expanded trail byte range, and lacks the Euro.	ANSI/OEM (Taiwan, Hong Kong)
951	Chinese (traditional) including Cantonese	Big5-HKSCS (2001 ed.)	5471^[28]	IBM-951 is the double-byte plane from IBM-949 (see above), and unrelated to Microsoft’s internal use of the number 951.	ANSI/OEM (Hong Kong, 98/NT4/2000/XP with HKSCS patch)

A few further multiple-byte code pages are supported for decoding or encoding using operating system libraries, but not used as either sort of system encoding in any locale.

ID	IBM Equivalent	Language	Encoding	Use
1361	—	Korean	Johab (KS C 5601-1992 annex 3)	Conversion
20000	—	Chinese (traditional)	An encoding of CNS 11643	Conversion
20001	—	Chinese (traditional)	TCA	Conversion
20002	—	Chinese (traditional)	Big5 (ETEN variant)	Conversion
20003	938	Chinese (traditional)	IBM 5550	Conversion
20004	—	Chinese (traditional)	Teletext	Conversion
20005	—	Chinese (traditional)	Wang	Conversion
20932	954 (roughly)	Japanese	EUC-JP	Conversion
20936	5479	Chinese (simplified)	GB 2312	Conversion
20949, 51949	970	Korean	Wansung (8-bit with ASCII, i.e. EUC-KR)^[29]	Conversion

EBCDIC code pages[edit]

37 – IBM EBCDIC US-Canada, 8-bit SBCS^[30]
500 – Latin 1
870 – IBM870
875 – cp875
1026 – EBCDIC Turkish
1047 – IBM01047 – Latin 1
1140 – IBM01141
1141 – IBM01141
1142 – IBM01142
1143 – IBM01143
1144 – IBM01144
1145 – IBM01145
1146 – IBM01146
1147 – IBM01147
1148 – IBM01148
1149 – IBM01149
20273 – EBCDIC Germany
20277 – EBCDIC Denmark/Norway
20278 – EBCDIC Finland/Sweden
20280 – EBCDIC Italy
20284 – EBCDIC Latin America/Spain
20285 – EBCDIC United Kingdom
20290 – EBCDIC Japanese
20297 – EBCDIC France
20420 – EBCDIC Arabic
20423 – EBCDIC Greek
20424 – x-EBCDIC-KoreanExtended
20833 – Korean
20838 – EBCDIC Thai
20924 – IBM00924 – IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
20871 – EBCDIC Icelandic
20880 – EBCDIC Cyrillic
20905 – EBCDIC Turkish
21025 – EBCDIC Cyrillic
21027 – Japanese EBCDIC (incomplete,^[31] deprecated)^[32]

[edit]

1200 – Unicode (BMP of ISO 10646, UTF-16LE). Available only to managed applications.^[32]
1201 – Unicode (UTF-16BE). Available only to managed applications.^[32]
12000 – UTF-32. Available only to managed applications.^[32]
12001 – UTF-32. Big-endian. Available only to managed applications.^[32]
65000 – Unicode (UTF-7)
65001 – Unicode (UTF-8)

Macintosh compatibility code pages[edit]

10000 – Apple Macintosh Roman
10001 – Apple Macintosh Japanese
10002 – Apple Macintosh Chinese (traditional) (BIG-5)
10003 – Apple Macintosh Korean
10004 – Apple Macintosh Arabic
10005 – Apple Macintosh Hebrew
10006 – Apple Macintosh Greek
10007 – Apple Macintosh Cyrillic
10008 – Apple Macintosh Chinese (simplified) (GB 2312)
10010 – Apple Macintosh Romanian
10017 – Apple Macintosh Ukrainian
10021 – Apple Macintosh Thai
10029 – Apple Macintosh Roman II / Central Europe
10079 – Apple Macintosh Icelandic
10081 – Apple Macintosh Turkish
10082 – Apple Macintosh Croatian

ISO 8859 code pages[edit]

28591 – ISO-8859-1 – Latin-1 (IBM equivalent: 819)
28592 – ISO-8859-2 – Latin-2
28593 – ISO-8859-3 – Latin-3 or South European
28594 – ISO-8859-4 – Latin-4 or North European
28595 – ISO-8859-5 – Latin/Cyrillic
28596 – ISO-8859-6 – Latin/Arabic
28597 – ISO-8859-7 – Latin/Greek
28598 – ISO-8859-8 – Latin/Hebrew
28599 – ISO-8859-9 – Latin-5 or Turkish
28600 – ISO-8859-10 – Latin-6
28601 – ISO-8859-11 – Latin/Thai
28602 – ISO-8859-12 – reserved for Latin/Devanagari but abandoned (not supported)
28603 – ISO-8859-13 – Latin-7 or Baltic Rim
28604 – ISO-8859-14 – Latin-8 or Celtic
28605 – ISO-8859-15 – Latin-9
28606 – ISO-8859-16 – Latin-10 or South-Eastern European
38596 – ISO-8859-6-I – Latin/Arabic (logical bidirectional order)
38598 – ISO-8859-8-I – Latin/Hebrew (logical bidirectional order)

ITU-T code pages[edit]

20105 – 7-bit IA5 IRV (Western European)^[33]^[34]^[35]
20106 – 7-bit IA5 German (DIN 66003)^[33]^[34]^[36]
20107 – 7-bit IA5 Swedish (SEN 850200 C)^[33]^[34]^[37]
20108 – 7-bit IA5 Norwegian (NS 4551-2)^[33]^[34]^[38]
20127 – 7-bit US-ASCII^[33]^[34]^[39]
20261 – T.61 (T.61-8bit)
20269 – ISO-6937

KOI8 code pages[edit]

20866 – Russian – KOI8-R
21866 – Ukrainian – KOI8-U (or KOI8-RU in some versions)^[40]

Problems arising from the use of code pages[edit]

Microsoft strongly recommends using Unicode in modern applications, but many applications or data files still depend on the legacy code pages.

Programs need to know what code page to use in order to display the contents of (pre-Unicode) files correctly. If a program uses the wrong code page it may show text as mojibake.
The code page in use may differ between machines, so (pre-Unicode) files created on one machine may be unreadable on another.
Data is often improperly tagged with the code page, or not tagged at all, making determination of the correct code page to read the data difficult.
These Microsoft code pages differ to various degrees from some of the standards and other vendors’ implementations. This isn’t a Microsoft issue per se, as it happens to all vendors, but the lack of consistency makes interoperability with other systems unreliable in some cases.
The use of code pages limits the set of characters that may be used.
Characters expressed in an unsupported code page may be converted to question marks (?) or other replacement characters, or to a simpler version (such as removing accents from a letter). In either case, the original character may be lost.

References[edit]

^ «Code Pages». 2016-03-07. Archived from the original on 2016-03-07. Retrieved 2021-05-26.
^ ^a ^b ^c «Glossary of Terms Used on this Site». December 8, 2018. Archived from the original on 2018-12-08. The term «ANSI» as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft—which became International Organization for Standardization (ISO) Standard 8859-1. «ANSI applications» are usually a reference to non-Unicode or code page–based applications.
^ «Character Sets». www.iana.org. Archived from the original on 2021-05-25. Retrieved 2021-05-26.
^ «Extensible Markup Language (XML) 1.1 (Second Edition): Character encodings». W3C. 29 September 2006. Archived from the original on 19 April 2021. Retrieved 5 October 2020.
^ hylom (2017-11-14). «Windows 10のInsider PreviewでシステムロケールをUTF-8にするオプションが追加される» [The option to make UTF-8 the system locale added in Windows 10 Insider Preview]. スラド (in Japanese). Archived from the original on 2018-05-11. Retrieved 2018-05-10.
^ «Character Sets». IANA. Archived from the original on 2016-12-03. Retrieved 2019-04-07.
^ Microsoft. «Windows 1250». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01250». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ Microsoft. «Windows 1251». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01251». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ Microsoft. «Windows 1252». Archived from the original on 2013-05-04. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01252». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ Microsoft. «Windows 1253». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01253». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ Microsoft. «Windows 1254». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01254». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ Microsoft. «Windows 1255». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01255». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ Microsoft. «Windows 1256». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01256». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ Microsoft. «Windows 1257». Archived from the original on 2013-03-16. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01257». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ Microsoft. «Windows 1258». Archived from the original on 2013-10-25. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document CPGID 01258». Archived from the original on 2014-07-14. Retrieved 2014-07-06.
^ IBM. «SBCS code page information document — CPGID 00437». Archived from the original on 2016-06-09. Retrieved 2014-07-04.
^ «IBM-943 and IBM-932». IBM Knowledge Center. IBM. Archived from the original on 2018-08-18. Retrieved 2020-07-08.
^ «Converter Explorer: ibm-1373_P100-2002». ICU Demonstration. International Components for Unicode. Archived from the original on 2021-05-26. Retrieved 2020-06-27.
^ «Coded character set identifiers – CCSID 5471». IBM Globalization. IBM. Archived from the original on 2014-11-29.
^ Julliard, Alexandre. «dump_krwansung_codepage: build Korean Wansung table from the KSX1001 file». make_unicode: Generate code page .c files from ftp.unicode.org descriptions. Wine Project. Archived from the original on 2021-05-26. Retrieved 2021-03-14.
^ IBM. «SBCS code page information document — CPGID 00037». Archived from the original on 2014-07-14. Retrieved 2014-07-04.
^ Steele, Shawn (2005-09-12). «Code Page 21027 «Extended/Ext Alpha Lowercase»«. MSDN. Archived from the original on 2019-04-06. Retrieved 2019-04-06.
^ ^a ^b ^c ^d ^e «Code Page Identifiers». docs.microsoft.com. Archived from the original on 2019-04-07. Retrieved 2019-04-07.
^ ^a ^b ^c ^d ^e «Code Page Identifiers». Microsoft Developer Network. Microsoft. 2014. Archived from the original on 2016-06-19. Retrieved 2016-06-19.
^ ^a ^b ^c ^d ^e «Web Encodings — Internet Explorer — Encodings». WHATWG Wiki. 2012-10-23. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
^ Foller, Antonin (2014) [2011]. «Western European (IA5) encoding — Windows charsets». WUtils.com — Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
^ Foller, Antonin (2014) [2011]. «German (IA5) encoding – Windows charsets». WUtils.com – Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
^ Foller, Antonin (2014) [2011]. «Swedish (IA5) encoding — Windows charsets». WUtils.com — Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
^ Foller, Antonin (2014) [2011]. «Norwegian (IA5) encoding — Windows charsets». WUtils.com — Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
^ Foller, Antonin (2014) [2011]. «US-ASCII encoding — Windows charsets». WUtils.com — Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20.
^ Nechayev, Valentin (2013) [2001]. «Review of 8-bit Cyrillic encodings universe». Archived from the original on 2016-12-05. Retrieved 2016-12-05.

External links[edit]

National Language Support (NLS) API Reference. Table showing ANSI and OEM codepages per language (from web-archive since Microsoft removed the original page)
IANA Charset Name Registrations
Unicode mapping table for Windows code pages
Unicode mappings of windows code pages with «best fit»

Источник

UTF-8

Standard	Unicode Standard
Classification	Unicode Transformation Format, extended ASCII, variable-length encoding
Extends	ASCII
Transforms / Encodes	ISO/IEC 10646 (Unicode)
Preceded by	UTF-1
v t e

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.^[1]

UTF-8 is capable of encoding all 1,112,064^[a] valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-length encoding with partial ASCII compatibility which lacked some features including self-synchronization and fully ASCII-compatible handling of characters such as slashes. Ken Thompson and Rob Pike produced the first implementation for the Plan 9 operating system in September 1992.^[2]^[3] This led to its adoption by X/Open as its specification for FSS-UTF,^[4] which would first be officially presented at USENIX in January 1993^[5] and subsequently adopted by the Internet Engineering Task Force (IETF) in RFC 2277 (BCP 18)^[6] for future internet standards work, replacing Single Byte Character Sets such as Latin-1 in older RFCs.

UTF-8 results in fewer internationalization issues^[7]^[8] than any alternative text encoding, and it has been implemented in all modern operating systems, including Microsoft Windows, and standards such as JSON, where, as is increasingly the case, it is the only allowed form of Unicode.

UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98.0% of all web pages, 99.0% of the top 10,000 pages, and up to 100% for many languages, as of 2023.^[9] Virtually all countries and languages have 95% or more use of UTF-8 encodings on the web.

Naming[edit]

The official name for the encoding is UTF-8, the spelling used in all Unicode Consortium documents. Most standards officially list it in upper case as well, but all that do are also case-insensitive and utf-8 is often used in code.^{[citation needed]}

Some other spellings may also be accepted by standards, e.g. web standards (which include CSS, HTML, XML, and HTTP headers) explicitly allow utf8 (and disallow «unicode») and many aliases for encodings.^[10] Spellings with a space e.g. «UTF 8» should not be used. The official Internet Assigned Numbers Authority also lists csUTF8 as the only alias,^[11] which is rarely used.

In Windows, UTF-8 is codepage 65001^[12] (i.e. CP_UTF8 in source code).

In MySQL, UTF-8 is called utf8mb4^[13] (with utf8mb3, and its alias utf8, being a subset encoding for characters in the Basic Multilingual Plane^[14]). In HP PCL, the Symbol-ID for UTF-8 is 18N.^[15]

In Oracle Database (since version 9.0), AL32UTF8^[16] means UTF-8. See also CESU-8 for an almost synonym with UTF-8 that rarely should be used.

UTF-8-BOM and UTF-8-NOBOM are sometimes used for text files which contain or do not contain a byte order mark (BOM), respectively.^{[citation needed]} In Japan especially, UTF-8 encoding without a BOM is sometimes called UTF-8N.^[17]^[18]

Encoding[edit]

UTF-8 encodes code points in one to four bytes, depending on the value of the code point. In the following table, the x characters are replaced by the bits of the code point:

Code point ↔ UTF-8 conversion

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0xxxxxxx
U+0080	U+07FF	110xxxxx	10xxxxxx
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
U+10000	^[b]U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

The first 128 code points (ASCII) need one byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also IPA extensions, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N’Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for the remaining 61,440 code points of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters. Four bytes are needed for the 1,048,576 code points in the other planes of Unicode, which include emoji (pictographic symbols), less common CJK characters, various historic scripts, and mathematical symbols.

A «character» can take more than 4 bytes because it is made of more than one code point. For instance a national flag character takes 8 bytes since it is «constructed from a pair of Unicode scalar values» both from outside the BMP.^[19]^[c]

Encoding process[edit]

In these examples, red, green, and blue digits indicate how bits from the code point are distributed among the UTF-8 bytes. Additional bits added by the UTF-8 encoding process are shown in black.

The Unicode code point for the euro sign € is U+20AC.
As this code point lies between U+0800 and U+FFFF, this will take three bytes to encode.
Hexadecimal 20AC is binary 0010 0000 1010 1100. The two leading zeros are added because a three-byte encoding needs exactly sixteen bits from the code point.
Because the encoding will be three bytes long, its leading byte starts with three 1s, then a 0 (1110…)
The four most significant bits of the code point are stored in the remaining low order four bits of this byte (11100010), leaving 12 bits of the code point yet to be encoded (…0000 1010 1100).
All continuation bytes contain exactly six bits from the code point. So the next six bits of the code point are stored in the low order six bits of the next byte, and 10 is stored in the high order two bits to mark it as a continuation byte (so 10000010).
Finally the last six bits of the code point are stored in the low order six bits of the final byte, and again 10 is stored in the high order two bits (10101100).

The three bytes 11100010 10000010 10101100 can be more concisely written in hexadecimal, as E2 82 AC.

The following table summarizes this conversion, as well as others with different lengths in UTF-8.

UTF-8 encoding process

Character	Binary code point	Binary UTF-8	Hex UTF-8
$	U+0024	010 0100	00100100	24
£	U+00A3	000 1010 0011	11000010 10100011	C2 A3
И	U+0418	100 0001 1000	11010000 10011000	D0 98
ह	U+0939	0000 1001 0011 1001	11100000 10100100 10111001	E0 A4 B9
€	U+20AC	0010 0000 1010 1100	11100010 10000010 10101100	E2 82 AC
한	U+D55C	1101 0101 0101 1100	11101101 10010101 10011100	ED 95 9C
𐍈	U+10348	0 0001 0000 0011 0100 1000	11110000 10010000 10001101 10001000	F0 90 8D 88

Example[edit]

In these examples, colored digits indicate multi-byte sequences used to encode characters beyond ASCII, while digits in black are ASCII.

As an example, the Vietnamese phrase Mình nói tiếng Việt (𨉟呐㗂越, «I speak Vietnamese») is encoded as follows:

Character	M	ì	n	h		n	ó	i		t	i	ế	n	g		V	i	ệ	t
Code point	4D	EC	6E	68	20	6E	F3	69	20	74	69	1EBF	6E	67	20	56	69	1EC7	74
Hex UTF-8	C3	AC	C3	B3	E1	BA	BF	E1	BB	87

Character	𨉟	呐	㗂	越
Code point	2825F	5450	35C2	8D8A
Hex UTF-8	F0	A8	89	9F	E5	91	90	E3	97	82	E8	B6	8A

Codepage layout[edit]

The following table summarizes usage of UTF-8 code units (individual bytes or octets) in a code page format. The upper half is for bytes used only in single-byte codes, so it looks like a normal code page; the lower half is for continuation bytes and leading bytes and is explained further in the legend below.

UTF-8

NUL

SOH

STX

ETX

EOT

ENQ

ACK

BEL

DLE

DC1

DC2

DC3

DC4

NAK

SYN

ETB

CAN

SUB

ESC

‘

(

)

—

;

[

]

{

}

DEL

+10

+11

+12

+13

+14

+15

+16

+17

+18

+19

+1A

+1B

+1C

+1D

+1E

+1F

+20

+21

+22

+23

+24

+25

+26

+27

+28

+29

+2A

+2B

+2C

+2D

+2E

+2F

+30

+31

+32

+33

+34

+35

+36

+37

+38

+39

+3A

+3B

+3C

+3D

+3E

+3F

7-bit (single-byte) code points. They must not be followed by a continuation byte.^[20]

Continuation bytes.^[21] The cell shows in hexadecimal the value of the 6 bits they add.^[d]

Leading bytes for a sequence of multiple bytes, must be followed by exactly N−1 continuation bytes.^[22] The tooltip shows the code point range and the Unicode blocks encoded by sequences starting with this byte.

Leading bytes where not all arrangements of continuation bytes are valid.

E0 and

F0 could start overlong encodings.

F4 can start code points greater than U+10FFFF.

ED can start code points in the range U+D800–U+DFFF, which are invalid UTF-16 surrogate halves.^[23]

Do not appear in a valid UTF-8 sequence.

C0 and

C1 could be used only for an «overlong» encoding of a 1-byte character.^[24]

F5 to

FD are leading bytes of 4-byte or longer sequences that can only encode code points larger than U+10FFFF.^[23]

FE and

FF were never assigned any meaning.^[25]

Overlong encodings[edit]

In principle, it would be possible to inflate the number of bytes in an encoding by padding the code point with leading 0s. To encode the euro sign € from the above example in four bytes instead of three, it could be padded with leading 0s until it was 21 bits long –
000 000010 000010 101100, and encoded as 11110000 10000010 10000010 10101100 (or F0 82 82 AC in hexadecimal). This is called an overlong encoding.

The standard specifies that the correct encoding of a code point uses only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point. This ensures that string comparisons and searches are well-defined.

Invalid sequences and error handling[edit]

Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

invalid bytes
an unexpected continuation byte
a non-continuation byte before the end of the character
the string ending before the end of the character (which can happen in simple string truncation)
an overlong encoding
a sequence that decodes to an invalid code point

Many of the first UTF-8 decoders would decode these, ignoring incorrect bits and accepting overlong results. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as NUL, slash, or quotes. Invalid UTF-8 has been used to bypass security validations in high-profile products including Microsoft’s IIS web server^[26] and Apache’s Tomcat servlet container.^[27] RFC 3629 states «Implementations of the decoding algorithm MUST protect against decoding invalid sequences.»^[23] The Unicode Standard requires decoders to «…treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence.»

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence. Not decoding unpaired surrogate halves makes it impossible to store invalid UTF-16 (such as Windows filenames or UTF-16 that has been split between the surrogates) as UTF-8,^[28] while it is possible with WTF-8.

Some implementations of decoders throw exceptions on errors.^[29] This has the disadvantage that it can turn what would otherwise be harmless errors (such as a «no such file» error) into a denial of service. For instance early versions of Python 3.0 would exit immediately if the command line or environment variables contained invalid UTF-8.^[30]

Since Unicode 6^[31] (October 2010), the standard (chapter 3) has recommended a «best practice» where the error is either one byte long, or ends before the first byte that is disallowed. In these decoders E1,A0,C0 is two errors (2 bytes in the first one). This means an error is no more than three bytes long and never contains the start of a valid character, and there are 21,952 different possible errors.^[32] The standard also recommends replacing each error with the replacement character «�» (U+FFFD).

These recommendations are not often followed. It is common to consider each byte to be an error, in which case E1,A0,C0 is three errors (each 1 byte long). This means there are only 128 different errors, and it is also common to replace them with 128 different characters, to make the decoding «lossless».^[33]

Byte order mark[edit]

If the Unicode byte order mark (BOM, U+FEFF) character is at the start of a UTF-8 file, the first three bytes will be 0xEF, 0xBB, 0xBF.

The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding.^[34] While ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn’t prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM (or the file only contains ASCII).^[35]

Adoption[edit]

Declared character set for the 10 million most popular websites since 2010

Use of the main encodings on the web from 2001–2012 as recorded by Google,^[36] with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). UTF-8 is the only encoding of Unicode (explicitly) listed there, and the rest only provide subsets of Unicode. The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header.

UTF-8 has been the most common encoding for the World Wide Web since 2008.^[37] As of October 2023, UTF-8 is used by 98.0% of surveyed web sites.^[9]^[e] Although many pages only use ASCII characters to display content, few websites now declare their encoding to only be ASCII instead of UTF-8.^[38] Over 50% of the languages tracked have 100% UTF-8 use.

Many standards only support UTF-8, e.g. JSON exchange requires it (without a byte order mark (BOM)).^[39] UTF-8 is also the recommendation from the WHATWG for HTML and DOM specifications, and stating «UTF-8 encoding is the most appropriate encoding for interchange of Unicode»^[8] and the Internet Mail Consortium recommends that all e‑mail programs be able to display and create mail using UTF-8.^[40]^[41] The World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML (and not just using UTF-8, also declaring it in metadata), «even when all characters are in the ASCII range … Using non-UTF-8 encodings can have unexpected results».^[42]

Lots of software has the ability to read/write UTF-8. It may though require the user to change options from the normal settings, or may require a BOM (byte order mark) as the first character to read the file. Examples of software supporting UTF-8 include Microsoft Word,^[43]^[44]^[45] Microsoft Excel (2016 and later),^[46]^[47] Google Drive, LibreOffice and most databases.

However for local text files UTF-8 usage is less prevalent, where legacy single-byte (and a few CJK multi-byte) encodings remain in use. The primary cause for this are outdated text editors that refuse to read UTF-8 unless the first bytes of the file encode a byte order mark character (BOM).^[48]

Some recent software can only read and write UTF-8 or at least do not require a BOM.^[49] Windows Notepad, in all currently supported versions of Windows, defaults to writing UTF-8 without a BOM (a change from the outdated/unsupported Windows 7), bringing it into line with most other text editors.^[50] Some system files on Windows 11 require UTF-8^[51] with no requirement for a BOM, and almost all files on macOS and Linux are required to be UTF-8 without a BOM.^{[citation needed]} Java 18 defaults to reading and writing files as UTF-8,^[52] and in older versions (e.g. LTS versions) only the NIO API was changed to do so. Many other programming languages default to UTF-8 for I/O, including Ruby 3.0^[53]^[54] and R 4.2.2.^[55] All currently supported versions of Python support UTF-8 for I/O, even on Windows (where it is opt-in for the open() function^[56]), and plans exist to make UTF-8 I/O the default in Python 3.15 on all platforms.^[57]^[58] C++23 adopts UTF-8 as the only portable source code file format (surprisingly there was none before).^[59]

Usage of UTF-8 in memory is much lower than in other areas, UTF-16 is often used instead. This occurs particularly in Windows, but also in JavaScript, Python,^[f] Qt, and many other cross-platform software libraries. Compatibility with the Windows API is the primary reason for this, that choice was initially done due to the belief that direct indexing of the BMP would improve speed. Translating from/to external text which is in UTF-8 slows software down, and more importantly introduces bugs when different pieces of code do not do the exact same translation.

Back-compatibility is a serious impediment to changing code to use UTF-8 instead of a 16-bit encoding, but this is happening. The default string primitive in Go,^[61] Julia, Rust, Swift 5,^[62] and PyPy^[63] uses UTF-8 internally in all cases, while Python, since Python 3.3, uses UTF-8 internally in some cases (for Python C API extensions);^[60]^[64] a future version of Python is planned to store strings as UTF-8 by default;^[65]^[66] and modern versions of Microsoft Visual Studio use UTF-8 internally.^[67] Microsoft’s SQL Server 2019 added support for UTF-8, and using it results in a 35% speed increase, and «nearly 50% reduction in storage requirements.»^[68]

All currently supported Windows versions support UTF-8 in some way (including Xbox);^[7] partial support has existed since at least Windows XP. As of May 2019, Microsoft has reversed its previous position of only recommending UTF-16; the capability to set UTF-8 as the «code page» for the Windows API was introduced; and Microsoft recommends programmers use UTF-8,^[69] and even states «UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms.»^[7]

History[edit]

The International Organization for Standardization (ISO) set out to compose a universal multi-byte character set in 1989. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, among other problems, and the biggest problem was probably that it did not have a clear separation between ASCII and non-ASCII: new UTF-1 tools would be backward compatible with ASCII-encoded text, but UTF-1-encoded text could confuse existing code expecting ASCII (or extended ASCII), because it could contain continuation bytes in the range 0x21–0x7E that meant something else in ASCII, e.g., 0x2F for ‘/’, the Unix path directory separator, and this example is reflected in the name and introductory text of its replacement. The table below was derived from a textual description in the annex.

UTF-1

Number of bytes	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5
1	U+0000	U+009F	00–9F
2	U+00A0	U+00FF	A0	A0–FF
2	U+0100	U+4015	A1–F5	21–7E, A0–FF
3	U+4016	U+38E2D	F6–FB	21–7E, A0–FF	21–7E, A0–FF
5	U+38E2E	U+7FFFFFFF	FC–FF	21–7E, A0–FF	21–7E, A0–FF	21–7E, A0–FF	21–7E, A0–FF

In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multi-byte sequences would include only bytes where the high bit was set. The name File System Safe UCS Transformation Format (FSS-UTF) and most of the text of this proposal were later preserved in the final specification.^[70]^[71]^[72]^[73]

FSS-UTF[edit]

FSS-UTF proposal (1992)

Number of bytes	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5
1	U+0000	U+007F	0xxxxxxx
2	U+0080	U+207F	10xxxxxx	1xxxxxxx
3	U+2080	U+8207F	110xxxxx	1xxxxxxx	1xxxxxxx
4	U+82080	U+208207F	1110xxxx	1xxxxxxx	1xxxxxxx	1xxxxxxx
5	U+2082080	U+7FFFFFFF	11110xxx	1xxxxxxx	1xxxxxxx	1xxxxxxx	1xxxxxxx

In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it self-synchronizing, letting a reader start anywhere and immediately detect character boundaries, at the cost of being somewhat less bit-efficient than the previous proposal. It also abandoned the use of biases and instead added the rule that only the shortest possible encoding is allowed; the additional loss in compactness is relatively insignificant, but readers now have to look out for invalid encodings to avoid reliability and especially security issues. Thompson’s design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.^[72]

FSS-UTF (1992) / UTF-8 (1993)^[2]

Number of bytes	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
1	U+0000	U+007F	0xxxxxxx
2	U+0080	U+07FF	110xxxxx	10xxxxxx
3	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	U+10000	U+1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
5	U+200000	U+3FFFFFF	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
6	U+4000000	U+7FFFFFFF	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993. The Internet Engineering Task Force adopted UTF-8 in its Policy on Character Sets and Languages in RFC 2277 (BCP 18) for future internet standards work, replacing Single Byte Character Sets such as Latin-1 in older RFCs.^[6]

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

Standards[edit]

There are several current definitions of UTF-8 in various standards documents:

RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard internet protocol element
RFC 5198 defines UTF-8 NFC for Network Interchange (2008)
ISO/IEC 10646:2014 §9.1 (2014)^[74]
The Unicode Standard, Version 15.0.0 (2022)^[75]

They supersede the definitions given in the following obsolete works:

The Unicode Standard, Version 2.0, Appendix A (1996)
ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)
RFC 2044 (1996)
RFC 2279 (1998)
The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1 : UTF-8 Shortest Form (2000)
Unicode Standard Annex #27: Unicode 3.1 (2001)^[76]
The Unicode Standard, Version 5.0 (2006)^[77]
The Unicode Standard, Version 6.0 (2010)^[78]

They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.

Comparison with other encodings[edit]

Some of the important features of this encoding are as follows:

Backward compatibility: Backward compatibility with ASCII and the enormous amount of software designed to process ASCII-encoded text was the main driving force behind the design of UTF-8. In UTF-8, single bytes with values in the range of 0 to 127 map directly to Unicode code points in the ASCII range. Single bytes in this range represent characters, as they do in ASCII. Moreover, 7-bit bytes (bytes where the most significant bit is 0) never appear in a multi-byte sequence, and no valid multi-byte sequence decodes to an ASCII code-point. A sequence of 7-bit bytes is both valid ASCII and valid UTF-8, and under either interpretation represents the same sequence of characters. Therefore, the 7-bit bytes in a UTF-8 stream represent all and only the ASCII characters in the stream. Thus, many text processors, parsers, protocols, file formats, text display programs, etc., which use ASCII characters for formatting and control purposes, will continue to work as intended by treating the UTF-8 byte stream as a sequence of single-byte characters, without decoding the multi-byte sequences. ASCII characters on which the processing turns, such as punctuation, whitespace, and control characters will never be encoded as multi-byte sequences. It is therefore safe for such processors to simply ignore or pass-through the multi-byte sequences, without decoding them. For example, ASCII whitespace may be used to tokenize a UTF-8 stream into words; ASCII line-feeds may be used to split a UTF-8 stream into lines; and ASCII NUL characters can be used to split UTF-8-encoded data into null-terminated strings. Similarly, many format strings used by library functions like «printf» will correctly handle UTF-8-encoded input arguments.
Fallback and auto-detection: Only a small subset of possible byte strings are a valid UTF-8 string: several bytes cannot appear; a byte with the high bit set cannot be alone; and further requirements mean that it is extremely unlikely that a readable text in any extended ASCII is valid UTF-8. Part of the popularity of UTF-8 is due to it providing a form of backward compatibility for these as well. A UTF-8 processor which erroneously receives extended ASCII as input can thus «auto-detect» this with very high reliability. A UTF-8 stream may simply contain errors, resulting in the auto-detection scheme producing false positives; but auto-detection is successful in the vast majority of cases, especially with longer texts, and is widely used. It also works to «fall back» or replace 8-bit bytes using the appropriate code-point for a legacy encoding when errors in the UTF-8 are detected, allowing recovery even if UTF-8 and legacy encoding is concatenated in the same file.
Prefix code: The first byte indicates the number of bytes in the sequence. Reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication. The length of multi-byte sequences is easily determined by humans as it is simply the number of high-order 1s in the leading byte. An incorrect character will not be decoded if a stream ends mid-sequence.
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with the bits 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
Sorting order: The chosen values of the leading bytes means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.

Single-byte[edit]

UTF-8 can encode any Unicode character, avoiding the need to figure out and set a «code page» or otherwise indicate what character set is in use, and allowing output in multiple scripts at the same time. For many scripts there have been more than one single-byte encoding in usage, so even knowing the script was insufficient information to display it correctly.
The bytes 0xFE and 0xFF do not appear, so a valid UTF-8 stream never matches the UTF-16 byte order mark and thus cannot be confused with it. The absence of 0xFF (0377) also eliminates the need to escape this byte in Telnet (and FTP control connection).
UTF-8 encoded text is larger than specialized single-byte encodings except for plain ASCII characters. In the case of scripts which used 8-bit character sets with non-Latin characters encoded in the upper half (such as most Cyrillic and Greek alphabet code pages), characters in UTF-8 will be double the size. For some scripts, such as Thai and Devanagari (which is used by various South Asian languages), characters will triple in size. There are even examples where a single byte turns into a composite character in Unicode and is thus six times larger in UTF-8. This has caused objections in India and other countries.^{[citation needed]}
It is possible in UTF-8 (or any other multi-byte encoding) to split or truncate a string in the middle of a character. If the two pieces are not re-appended later before interpretation as characters, this can introduce an invalid sequence at both the end of the previous section and the start of the next, and some decoders will not preserve these bytes and result in data loss. Because UTF-8 is self-synchronizing this will however never introduce a different valid character, and it is also fairly easy to move the truncation point backward to the start of a character.
If the code points are all the same size, measurements of a fixed number of them is easy. Due to ASCII-era documentation where «character» is used as a synonym for «byte» this is often considered important. However, by measuring string positions using bytes instead of «characters» most algorithms can be easily and efficiently adapted for UTF-8. Searching for a string within a long string can for example be done byte by byte; the self-synchronization property prevents false positives.

Other multi-byte[edit]

UTF-8 can encode any Unicode character. Files in different scripts can be displayed correctly without having to choose the correct code page or font. For instance, Chinese and Arabic can be written in the same file without specialized markup or manual settings that specify an encoding.
UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, one can always locate the next valid character and resume processing. If there is a need to shorten a string to fit a specified field, the previous valid character can easily be found. Many multi-byte encodings such as Shift JIS are much harder to resynchronize. This also means that byte-oriented string-searching algorithms can be used with UTF-8 (as a character is the same as a «word» made up of that many bytes), optimized versions of byte searches can be much faster due to hardware support and lookup tables that have only 256 entries. Self-synchronization does however require that bits be reserved for these markers in every byte, increasing the size.
Efficient to encode using simple bitwise operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike Shift JIS, GB 2312 and other encodings).
UTF-8 will take more space than a multi-byte encoding designed for a specific script. East Asian legacy encodings generally used two bytes per character yet take three bytes per character in UTF-8.

UTF-16[edit]

Byte encodings and UTF-8 are represented by byte arrays in programs, and often nothing needs to be done to a function when converting source code from a byte encoding to UTF-8. UTF-16 is represented by 16-bit word arrays, and converting to UTF-16 while maintaining compatibility with existing ASCII-based programs (such as was done with Windows) requires every API and data structure that takes a string to be duplicated, one version accepting byte strings and another version accepting UTF-16. If backward compatibility is not needed, all string handling still must be modified.
Text encoded in UTF-8 will be smaller than the same text encoded in UTF-16 if there are more code points below U+0080 than in the range U+0800..U+FFFF. This is true for all modern European languages. It is often true even for languages like Chinese, due to the large number of spaces, newlines, digits, and HTML markup in typical files.
Most communication (e.g. HTML and IP) and storage (e.g. for Unix) was designed for a stream of bytes. A UTF-16 string must use a pair of bytes for each code unit:
- The order of those two bytes becomes an issue and must be specified in the UTF-16 protocol, such as with a byte order mark.
- If an odd number of bytes is missing from UTF-16, the whole rest of the string will be meaningless text. Any bytes missing from UTF-8 will still allow the text to be recovered accurately starting with the next character after the missing bytes.

Derivatives[edit]

The following implementations show slight differences from the UTF-8 specification. They are incompatible with the UTF-8 specification and may be rejected by conforming UTF-8 applications.

CESU-8[edit]

Unicode Technical Report #26^[79] assigns the name CESU-8 to a nonstandard variant of UTF-8, in which Unicode characters in supplementary planes are encoded using six bytes, rather than the four bytes required by UTF-8. CESU-8 encoding treats each half of a four-byte UTF-16 surrogate pair as a two-byte UCS-2 character, yielding two three-byte UTF-8 characters, which together represent the original supplementary character. Unicode characters within the Basic Multilingual Plane appear as they would normally in UTF-8. The Report was written to acknowledge and formalize the existence of data encoded as CESU-8, despite the Unicode Consortium discouraging its use, and notes that a possible intentional reason for CESU-8 encoding is preservation of UTF-16 binary collation.

CESU-8 encoding can result from converting UTF-16 data with supplementary characters to UTF-8, using conversion methods that assume UCS-2 data, meaning they are unaware of four-byte UTF-16 supplementary characters. It is primarily an issue on operating systems which extensively use UTF-16 internally, such as Microsoft Windows.^{[citation needed]}

In Oracle Database, the UTF8 character set uses CESU-8 encoding, and is deprecated. The AL32UTF8 character set uses standards-compliant UTF-8 encoding, and is preferred.^[80]^[81]

CESU-8 is prohibited for use in HTML5 documents.^[82]^[83]^[84]

MySQL utf8mb3[edit]

In MySQL, the utf8mb3 character set is defined to be UTF-8 encoded data with a maximum of three bytes per character, meaning only Unicode characters in the Basic Multilingual Plane (i.e. from UCS-2) are supported. Unicode characters in supplementary planes are explicitly not supported. utf8mb3 is deprecated in favor of the utf8mb4 character set, which uses standards-compliant UTF-8 encoding. utf8 is an alias for utf8mb3, but is intended to become an alias to utf8mb4 in a future release of MySQL.^[14] It is possible, though unsupported, to store CESU-8 encoded data in utf8mb3, by handling UTF-16 data with supplementary characters as though it is UCS-2.

Modified UTF-8[edit]

Modified UTF-8 (MUTF-8) originated in the Java programming language. In Modified UTF-8, the null character (U+0000) uses the two-byte overlong encoding 11000000 10000000 (hexadecimal C0 80), instead of 00000000 (hexadecimal 00).^[85] Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000,^[86] which allows such strings (with a null byte appended) to be processed by traditional null-terminated string functions. All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8.

In normal usage, the language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter (if it is the platform’s default character set or as requested by the program). However it uses Modified UTF-8 for object serialization^[87] among other applications of DataInput and DataOutput, for the Java Native Interface,^[88] and for embedding constant strings in class files.^[89]

The dex format defined by Dalvik also uses the same modified UTF-8 to represent string values.^[90] Tcl also uses the same modified UTF-8^[91] as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.

WTF-8[edit]

In WTF-8 (Wobbly Transformation Format, 8-bit) unpaired surrogate halves (U+D800 through U+DFFF) are allowed.^[92] This is necessary to store possibly-invalid UTF-16, such as Windows filenames. Many systems that deal with UTF-8 work this way without considering it a different encoding, as it is simpler.^[93]

The term «WTF-8» has also been used humorously to refer to erroneously doubly-encoded UTF-8^[94]^[95] sometimes with the implication that CP1252 bytes are the only ones encoded.^[96]

PEP 383[edit]

Version 3 of the Python programming language treats each byte of an invalid UTF-8 bytestream as an error (see also changes with new UTF-8 mode in Python 3.7^[97]); this gives 128 different possible errors. Extensions have been created to allow any byte sequence that is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to reserved code points, and transforming those code points back to error bytes to output UTF-8. The most common approach is to translate the codes to U+DC80…U+DCFF which are low (trailing) surrogate values and thus «invalid» UTF-16, as used by Python’s PEP 383 (or «surrogateescape») approach.^[33] Another encoding called MirBSD OPTU-8/16 converts them to U+EF80…U+EFFF in a Private Use Area.^[98] In either approach, the byte value is encoded in the low eight bits of the output code point.

These encodings are very useful because they avoid the need to deal with «invalid» byte strings until much later, if at all, and allow «text» and «data» byte arrays to be the same object. If a program wants to use UTF-16 internally these are required to preserve and use filenames that can use invalid UTF-8;^[99] as the Windows filesystem API uses UTF-16, the need to support invalid UTF-8 is less there.^[33]

For the encoding to be reversible, the standard UTF-8 encodings of the code points used for erroneous bytes must be considered invalid. This makes the encoding incompatible with WTF-8 or CESU-8 (though only for 128 code points). When re-encoding it is necessary to be careful of sequences of error code points which convert back to valid UTF-8, which may be used by malicious software to get unexpected characters in the output, though this cannot produce ASCII characters so it is considered comparatively safe, since malicious sequences (such as cross-site scripting) usually rely on ASCII characters.^[99]

Notes[edit]

^ 17 planes times 2¹⁶ code points per plane, minus 2¹¹ technically-invalid surrogates.
^ There are enough x bits to encode up to 0x1FFFFF, but the current RFC 3629 §3 limits UTF-8 encoding to code point U+10FFFF, to match the limits of UTF-16. The obsolete RFC 2279 allowed UTF-8 encoding up to (then legal) code point U+7FFFFFF.
^ Some complex emoji characters can take even more than this; the transgender flag emoji (🏳️‍⚧️), which consists of the five-codepoint sequence U+1F3F3 U+FE0F U+200D U+26A7 U+FE0F, requires sixteen bytes to encode, while that for the flag of Scotland (🏴󠁧󠁢󠁳󠁣󠁴󠁿) requires a total of twenty-eight bytes for the seven-codepoint sequence U+1F3F4 U+E0067 U+E0062 U+E0073 U+E0063 U+E0074 U+E007F.
^ For example, cell 9D says +1D. The hexadecimal number 9D in binary is 10011101, and since the 2 highest bits (10) are reserved for marking this as a continuation byte, the remaining 6 bits (011101) have a hexadecimal value of 1D.
^ W3Techs.com survey^[9] is based on the encoding as declared in the server’s response, see https://w3techs.com/forum/topic/22994
^ Python uses a number of encodings for what it calls «Unicode», however none of these encodings are UTF-8. Python 2 and early version 3 used UTF-16 on Windows and UTF-32 on Unix. More recent implementations of Python 3 use three fixed-length encodings: ISO-8859-1, UCS-2, and UTF-32, depending on the maximum code point needed.^[60]

References[edit]

^ «Chapter 2. General Structure». The Unicode Standard (6.0 ed.). Mountain View, California, US: The Unicode Consortium. ISBN 978-1-936213-01-6.
^ ^a ^b Pike, Rob (30 April 2003). «UTF-8 history».
^ Pike, Rob; Thompson, Ken (1993). «Hello World or Καλημέρα κόσμε or こんにちは世界» (PDF). Proceedings of the Winter 1993 USENIX Conference.
^ «File System Safe UCS — Transformation Format (FSS-UTF) — X/Open Preliminary Specification» (PDF). unicode.org.
^ «USENIX Winter 1993 Conference Proceedings». usenix.org.
^ ^a ^b Alvestrand, Harald T. (January 1998). IETF Policy on Character Sets and Languages. IETF. doi:10.17487/RFC2277. BCP 18. RFC 2277.
^ ^a ^b ^c «UTF-8 support in the Microsoft Game Development Kit (GDK) — Microsoft Game Development Kit». learn.microsoft.com. Retrieved 2023-03-05. By operating in UTF-8, you can ensure maximum compatibility [..] Windows operates natively in UTF-16 (or WCHAR), which requires code page conversions by using MultiByteToWideChar and WideCharToMultiByte. This is a unique burden that Windows places on code that targets multiple platforms. [..] The Microsoft Game Development Kit (GDK) and Windows in general are moving forward to support UTF-8 to remove this unique burden of Windows on code targeting or interchanging with multiple platforms and the web. Also, this results in fewer internationalization issues in apps and games and reduces the test matrix that’s required to get it right.
^ ^a ^b «Encoding Standard». encoding.spec.whatwg.org. Retrieved 2020-04-15.
^ ^a ^b ^c «Usage Survey of Character Encodings broken down by Ranking». w3techs.com. Retrieved 2023-10-01.
^ «Encoding Standard § 4.2. Names and labels». WHATWG. Retrieved 2018-04-29.
^ «Character Sets». Internet Assigned Numbers Authority. 2013-01-23. Retrieved 2013-02-08.
^ Liviu (2014-02-07). «UTF-8 codepage 65001 in Windows 7 — part I». Retrieved 2018-01-30. Previously under XP (and, unverified, but probably Vista, too) for loops simply did not work while codepage 65001 was active
^ «MySQL :: MySQL 8.0 Reference Manual :: 10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)». MySQL 8.0 Reference Manual. Oracle Corporation. Retrieved 2023-03-14.
^ ^a ^b «MySQL :: MySQL 8.0 Reference Manual :: 10.9.2 The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding)». MySQL 8.0 Reference Manual. Oracle Corporation. Retrieved 2023-02-24.
^ «HP PCL Symbol Sets | Printer Control Language (PCL & PXL) Support Blog». 2015-02-19. Archived from the original on 2015-02-19. Retrieved 2018-01-30.
^ «Database Globalization Support Guide». docs.oracle.com. Retrieved 2023-03-16.
^ «BOM». suikawiki (in Japanese). Archived from the original on 2009-01-17.
^ Davis, Mark. «Forms of Unicode». IBM. Archived from the original on 2005-05-06. Retrieved 2013-09-18.
^ «Apple Developer Documentation». developer.apple.com. Retrieved 2021-03-15.
^ «Chapter 3» (PDF), The Unicode Standard, p. 54
^ «Chapter 3» (PDF), The Unicode Standard, p. 55
^ «Chapter 3» (PDF), The Unicode Standard, p. 55
^ ^a ^b ^c Yergeau, F. (November 2003). UTF-8, a transformation format of ISO 10646. IETF. doi:10.17487/RFC3629. STD 63. RFC 3629. Retrieved August 20, 2020.
^ «Chapter 3» (PDF), The Unicode Standard, p. 54
^ «Chapter 3» (PDF), The Unicode Standard, p. 55
^ Marin, Marvin (2000-10-17). «Web Server Folder Traversal MS00-078».
^ «Summary for CVE-2008-2938». National Vulnerability Database.
^ «PEP 529 — Change Windows filesystem encoding to UTF-8». Python.org. Retrieved 2022-05-10. This PEP proposes changing the default filesystem encoding on Windows to utf-8, and changing all filesystem functions to use the Unicode APIs for filesystem paths. [..] can correctly round-trip all characters used in paths (on POSIX with surrogateescape handling; on Windows because str maps to the native representation). On Windows bytes cannot round-trip all characters used in paths
^ «DataInput (Java Platform SE 8)». docs.oracle.com. Retrieved 2021-03-24.
^ «Non-decodable Bytes in System Character Interfaces». python.org. 2009-04-22. Retrieved 2014-08-13.
^ «Unicode 6.0.0».
^ 128 1-byte, (16+5)×64 2-byte, and 5×64×64 3-byte. There may be somewhat fewer if more precise tests are done for each continuation byte.
^ ^a ^b ^c von Löwis, Martin (2009-04-22). «Non-decodable Bytes in System Character Interfaces». Python Software Foundation. PEP 383.
^ «Chapter 2» (PDF), The Unicode Standard — Version 15.0.0, p. 39
^ «UTF-8 and Unicode FAQ for Unix/Linux».
^ Davis, Mark (2012-02-03). «Unicode over 60 percent of the web». Official Google blog. Archived from the original on 2018-08-09. Retrieved 2020-07-24.
^ Davis, Mark (2008-05-05). «Moving to Unicode 5.1». Official Google Blog. Retrieved 2023-03-13.
^ «Usage statistics and market share of ASCII for websites, October 2021». w3techs.com. Retrieved 2020-10-01.
^ Bray, Tim (December 2017). Bray, T. (ed.). The JavaScript Object Notation (JSON) Data Interchange Format. IETF. doi:10.17487/RFC8259. RFC 8259. Retrieved 16 February 2018.
^ «Usage of Internet mail in the world characters». washingtonindependent.com. 1998-08-01. Retrieved 2007-11-08.
^ «Encoding Standard». encoding.spec.whatwg.org. Retrieved 2018-11-15.
^ «Specifying the document’s character encoding». HTML 5.2 (Report). World Wide Web Consortium. 14 December 2017. Retrieved 2018-06-03.
^ «Choose text encoding when you open and save files». support.microsoft.com. Retrieved 2021-11-01.
^ «utf 8 — Character encoding of Microsoft Word DOC and DOCX files?». Stack Overflow. Retrieved 2021-11-01.
^ «Exporting a UTF-8 .txt file from Word».
^ «excel — Are XLSX files UTF-8 encoded by definition?». Stack Overflow. Retrieved 2021-11-01.
^ «How to open UTF-8 CSV file in Excel without mis-conversion of characters in Japanese and Chinese language for both Mac and Windows?». answers.microsoft.com. Retrieved 2021-11-01.
^ «How can I make Notepad to save text in UTF-8 without the BOM?». Stack Overflow. Retrieved 2021-03-24.
^ Galloway, Matt (October 2012). «Character encoding for iOS developers. Or, UTF-8 what now?». www.galloway.me.uk. Retrieved 2021-01-02. in reality, you usually just assume UTF-8 since that is by far the most common encoding.
^ «Windows 10 Notepad is getting better UTF-8 encoding support». BleepingComputer. Retrieved 2021-03-24. Microsoft is now defaulting to saving new text files as UTF-8 without BOM, as shown below.
^ «Customize the Windows 11 Start menu». docs.microsoft.com. Retrieved 2021-06-29. Make sure your LayoutModification.json uses UTF-8 encoding.
^ «JEP 400: UTF-8 by default». openjdk.java.net. Retrieved 2022-03-30.
^ «Feature #16604: Set default for Encoding.default_external to UTF-8 on Windows». bugs.ruby-lang.org. Ruby master – Ruby Issue Tracking System. Retrieved 2022-08-01.
^ «Feature #12650: Use UTF-8 encoding for ENV on Windows». bugs.ruby-lang.org. Ruby master – Ruby Issue Tracking System. Retrieved 2022-08-01.
^ «New features in R 4.2.0». The Jumping Rivers Blog. R bloggers. 2022-04-01. Retrieved 2022-08-01.
^ «PEP 540 – add a new UTF-8 mode». peps.python.org. Retrieved 2022-09-23.
^ «PEP 686 – Make UTF-8 mode default | peps.python.org». peps.python.org. Retrieved 2023-07-26.
^ «PEP 597 – add optional EncodingWarning». Python.org. Retrieved 2021-08-24.
^ «Support for UTF-8 as a portable source file encoding» (PDF).
^ ^a ^b «PEP 393 – Flexible String Representation». Python.org. Retrieved 2022-05-18. As interaction with other libraries will often require some sort of internal representation, the specification chooses UTF-8 as the recommended way of exposing strings to C code. [..] The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). [..] The recommended way to create a Unicode object is to use the function PyUnicode_New [..] A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation.
^ «Source code representation». The Go Programming Language Specification. golang.org (Report). Retrieved 2021-02-10.
^ Tsai, Michael J. (21 March 2019). «UTF-8 string in Swift 5» (blog). Retrieved 2021-03-15. Switching to UTF-8 fulfills one of string’s long-term goals, to enable high-performance processing, […] also lays the groundwork for providing even more performant APIs in the future.
^ «PyPy v7.1 released; now uses UTF-8 internally for Unicode strings». Mattip. PyPy status blog. 2019-03-24. Retrieved 2020-11-21.
^ «Unicode Objects and Codecs». Python documentation. Retrieved 2023-08-19. UTF-8 representation is created on demand and cached in the Unicode object.
^ «PEP 623 – remove wstr from Unicode». Python.org. Retrieved 2020-11-21. Until we drop [the] legacy Unicode object, it is very hard to try other Unicode implementation[s], like UTF-8 based implementation in PyPy.
^ Wouters, Thomas (2023-07-11). «Python Insider: Python 3.12.0 beta 4 released». Python Insider. Retrieved 2023-07-26. The deprecated wstr and wstr_length members of the C implementation of unicode objects were removed, per PEP 623.
^ «/validate-charset (validate for compatible characters)». docs.microsoft.com. Retrieved 2021-07-19. Visual Studio uses UTF-8 as the internal character encoding during conversion between the source character set and the execution character set.
^ «Introducing UTF-8 support for SQL Server». techcommunity.microsoft.com. 2019-07-02. Retrieved 2021-08-24. For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. [..] In the ASCII range, when doing intensive read/write I/O on UTF-8, we measured an average 35% performance improvement over UTF-16 using clustered tables with a non-clustered index on the string column, and an average 11% performance improvement over UTF-16 using a heap.
^ «Use the Windows UTF-8 code page – UWP applications». docs.microsoft.com. Retrieved 2020-06-06. As of Windows version 1903 (May 2019 update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. […] CP_ACP equates to CP_UTF8 only if running on Windows version 1903 (May 2019 update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend using CP_UTF8 explicitly.
^ «Appendix F. FSS-UTF / File System Safe UCS Transformation format» (PDF). The Unicode Standard 1.1. Archived (PDF) from the original on 2016-06-07. Retrieved 2016-06-07.
^ Whistler, Kenneth (2001-06-12). «FSS-UTF, UTF-2, UTF-8, and UTF-16». Archived from the original on 2016-06-07. Retrieved 2006-06-07.
^ ^a ^b Pike, Rob (2003-04-30). «UTF-8 history». Retrieved 2012-09-07.
^ Pike, Rob (2012-09-06). «UTF-8 turned 20 years old yesterday». Retrieved 2012-09-07.
^ ISO/IEC 10646:2014 §9.1, 2014.
^ The Unicode Standard, Version 15.0 §3.9 D92, §3.10 D95, 2021.
^ Unicode Standard Annex #27: Unicode 3.1, 2001.
^ The Unicode Standard, Version 5.0 §3.9–§3.10 ch. 3, 2006.
^ The Unicode Standard, Version 6.0 §3.9 D92, §3.10 D95, 2010.
^ McGowan, Rick (2011-12-19). «Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)». Unicode Consortium. Unicode Technical Report #26.
^ «Character Set Support». Oracle Database 19c Documentation, SQL Language Reference. Oracle Corporation.
^ «Supporting Multilingual Databases with Unicode § Support for the Unicode Standard in Oracle Database». Database Globalization Support Guide. Oracle Corporation.
^ «8.2.2.3. Character encodings». HTML 5.1 Standard. W3C.
^ «8.2.2.3. Character encodings». HTML 5 Standard. W3C.
^ «12.2.3.3 Character encodings». HTML Living Standard. WHATWG.
^ «Java SE documentation for Interface java.io.DataInput, subsection on Modified UTF-8». Oracle Corporation. 2015. Retrieved 2015-10-16.
^ «The Java Virtual Machine Specification, section 4.4.7: «The CONSTANT_Utf8_info Structure»«. Oracle Corporation. 2015. Retrieved 2015-10-16.
^ «Java Object Serialization Specification, chapter 6: Object Serialization Stream Protocol, section 2: Stream Elements». Oracle Corporation. 2010. Retrieved 2015-10-16.
^ «Java Native Interface Specification, chapter 3: JNI Types and Data Structures, section: Modified UTF-8 Strings». Oracle Corporation. 2015. Retrieved 2015-10-16.
^ «The Java Virtual Machine Specification, section 4.4.7: «The CONSTANT_Utf8_info Structure»«. Oracle Corporation. 2015. Retrieved 2015-10-16.
^ «ART and Dalvik». Android Open Source Project. Archived from the original on 2013-04-26. Retrieved 2013-04-09.
^ «UTF-8 bit by bit». Tcler’s Wiki. 2001-02-28. Retrieved 2022-09-03.
^ Sapin, Simon (2016-03-11) [2014-09-25]. «The WTF-8 encoding». Archived from the original on 2016-05-24. Retrieved 2016-05-24.
^ Sapin, Simon (2015-03-25) [2014-09-25]. «The WTF-8 encoding § Motivation». Archived from the original on 2020-08-16. Retrieved 2020-08-26.
^ «WTF-8.com». 2006-05-18. Retrieved 2016-06-21.
^ Speer, Robyn (2015-05-21). «ftfy (fixes text for you) 4.0: changing less and fixing more». Archived from the original on 2015-05-30. Retrieved 2016-06-21.
^ «WTF-8, a transformation format of code page 1252». Archived from the original on 2016-10-13. Retrieved 2016-10-12.
^ «PEP 540 — Add a new UTF-8 Mode». Python.org. Retrieved 2021-03-24.
^ «RTFM optu8to16(3), optu8to16vis(3)». www.mirbsd.org.
^ ^a ^b Davis, Mark; Suignard, Michel (2014). «3.7 Enabling Lossless Conversion». Unicode Security Considerations. Unicode Technical Report #36.

External links[edit]

Original UTF-8 paper (or pdf) for Plan 9 from Bell Labs
History of UTF-8 by Rob Pike
UTF-8 test pages:
- Andreas Prilop Archived 2017-11-30 at the Wayback Machine
- Jost Gippert
- World Wide Web Consortium
Unix/Linux: UTF-8/Unicode FAQ, Linux Unicode HOWTO, UTF-8 and Gentoo
Characters, Symbols and the Unicode Miracle on YouTube

Источник

Таблица ASCII (American standard code for information interchange) является мировым стандартом для кодирования букв английского алфавита, популярных спец символов (! $ # % & и т.д.) и некоторых непечатных символов (например, возврат каретки 0x0D и перенос строки 0х0А).

Таблица создавалась те времена, когда возникла необходимость связать символы и числа. А такое соответствие необходимо было для того что бы с помощью чисел можно было передать текстовое сообщение между разными устройствами с цифровой связью.

Таблица CP1251 (windows-1251)

Эта кодировочная таблица может называться или CP1251 или Windows-1251 Это стандарт кодирования кириллических символов в операционных системах windows с русскоязычным интерфейсом.

Первая часть этой таблицы (до байта 0x7F) повторяет таблицу ASCII, а вторая часть (от 0x80 до 0xFF) кодирует кириллические символы в алфавитном порядке.

CP1251 (windows-1251)

Таблица IS0-8859-5

Эта кодировка применяется в дисплеях Nextion для кодирования кириллических символов.

Стоит обратить внимание, что в данной таблице кириллические символы расположены в алфавитном порядке и сдвинуты ровно на 16 байт по сравнению с кодировочной таблицей windows-1251.

Кодировка UTF-8
(Unicode Transformation Format)

Очень распространенный формат кодирования символов, позволяющий кодировать символы переменным количеством байт.

Например, если для кодирования номера символа требуется 21 бит, то используется 4 байта для кодировки. Если для кодирования достаточно 11 бит, то используют 2 байта. А если номер символа может быть закодирован 7 битами, то используется один байт.

Кодировка UTF-8

Все ASCII символы в кодировке UTF8 закодированы без изменений, то есть 1 байтом, как в стандартной таблице ASCII.

А вот остальные символы закодированы количеством байт от 2 до 4.

Кириллические символы закодированы двумя байтами.

Источник

Отличие utf-8 и windows 1251. Рассмотрим, чем отличаются две кодировки «utf-8 и windows 1251» в теории и на практике. И как победить некоторые проблемы для кириллицы в utf-8!?

О кодировках utf-8 и windows 1251

Самое главное. что нас интересует, как и меня — в чем же отличие кодировок utf-8 и windows 1251. И отличается только кириллица!

Чем отличаются utf-8 и windows 1251

UTF-8 — это много-байтовая кодировка, а Windows- 1251 однобайтовая. И более того, отличие только в кириллице.

Количество байтов кириллицы в UTF-8 будет в 2 раза больше, чем 1). латиницы в UTF-8 и 2). латиницы + кириллицы в Windows- 1251 → пример

Главное отличие кодировок – это используемый набор символов. В UTF-8 гораздо больше количество символов возможно представить, чем в Windows- 1251. Кодировка Windows- 1251 однобайтовая, т.е. представить в ней можно только 255 символов. Для кириллицы, впрочем, этого вполне достаточно, именно поэтому однобайтовые кодировки до сих пор так массово применяются.

Что такое кодировка windows 1251

Windows-1251 – набор символов и кодировка, являющаяся стандартной 8-битной кодировкой для всех русских версий Microsoft Windows. Пользуется довольно большой популярностью. Windows-1251 выгодно отличается от других 8‑битных кириллических кодировок (таких как CP866, KOI8-R и ISO 8859-5) наличием практически всех символов, использующихся в русской типографике для обычного текста; она также содержит все символы для близких к русскому языку языков: украинского, белорусского, сербского и болгарского.

Что такое кодировка UTF-8

UTF-8 – в настоящее время распространённая кодировка, реализующая представление Юникода, совместимое с 8-битным кодированием текста. Нашла широкое применение в операционных системах и веб-пространстве. Текст, состоящий только из символов Юникода с номерами меньше 128, при записи в UTF-8 превращается в обычный текст ASCII. Остальные символы Юникода изображаются последовательностями длиной от 2 до 6 байт.

Символ в кодировке UTF-8 может кодироваться аж 6 байтами (пока используется только 4 и больше не планируется). Для русского языка, например, символ занимает 2 байта. Все символы, которые есть в таблице символов – поддерживаются этой кодировкой. К примеру, если вам нужен знак копирайта (©), то вам не нужно искать особый шрифт или же изображать символов в графическом формате.

Пример вывода текста в кодировках utf-8 латиницы

Когда и если вы прочитали теорию о разнице кодировок utf-8 и windows 1251 — это уже победа! wall
смайлы

А если вы еще и поняли о чем идет речь, то вы вообще Эйнштейн! good
смайлы, то и смысла особого вам читать дальше нет.

А для всех остальных продолжим…

Чем отличается текст в кодировках utf-8 и windows 1251

Теория — это конечно классно и круто, но как обстоит дело на практике!

Как показать отличие двух кодировок!?

У нас на сайте основная кодировка utf-8, и мы не напрягаясь можем посмотреть, что творится с текстом в этой кодировке!

Нам понадобится какой-то текст на латинице:

И… нам нужно такое слово, чтобы имело одинаковое количество букв в слове, ну пусть это будет моё имя…

Пусть это будет слово — «Marat!»

Далее нам потребуется функция var_dump.

И выведем прямо здесь вот такую конструкцию :

var_dump(‘Marat’);

Результат:

string(5) «Marat»

Что мы здесь можем прочитать!?

Что это строка, и что в ней 5 элементов.

Пример вывода текста в кодировках utf-8 кириллицы

Теперь, проделаем тоже самое со строкой на кириллице:

У нас все таже кодировка utf-8.

Но теперь нам понадобится текст на кириллице:

Пусть это будет слово — «Марат!»

Опять var_dump.

И выведем прямо здесь вот такую конструкцию :

var_dump(‘Марат’);

Результат:

string(10) «Марат»

И что мы здесь видим!?

Что количество элементов в строке 10… Если вы читали теорию внимательно, то вот вам показатель того, что одна буква состоит из двух символов, а латиницы это не касается…!

Поэтому, и возникают проблемы с текстом в кодировке utf-8 кириллицы, множество функций тупо не работают.

Как пример…как-то я задолбался со strtolower в utf-8 для кириллицы, что решил написать собственную функцию strtolower, чтобы каждый раз не городить этажерку из нескольких функций…

Пример отличия в кодировках utf-8 и windows 1251

Если вы поленились прочитать два верхних пункта, то ещё раз выведем результаты вывода текста на латинице и на кириллице с одним количеством букв.

Результат вывода var_dump(‘Marat’);:
string(5) «Marat»

Результат var_dump(‘Марат’);:
string(10) «Марат»

Что делать, если функция для кириллицы на utf-8 не работают?

Поскольку я давно занимаюсь сайтами, то могу сказать, что на самом деле таких случаев не так много, когда нужна какая-то специальная функция для обработки кириллицы на utf-8.

Но если уж она возникала, то есть несколько вариантов решения!

Это функции с приставкой «mb_», естественно надо проверять, работает ли она у вас на хостинге.

Второй вариант, это написать собственную функцию, которая будет работать и для латиницы и кириллицы? как это я показал на функции strtolower

И третий вариант перекодировать строку из utf-8 в windows 1251

Рассмотрим, первый попавшийся на ум пример…

Пусть это будет функция str_split и её аналог mb_str_split

print_r (str_split(‘Марат’)); выдаст :

Array

(

[0] => �

[1] => �

[2] => �

[3] => �

[4] => �

[5] => �

[6] => �

[7] => �

[8] => �

[9] => �

)

print_r (mb_str_split(‘Марат’)); выдаст :

Как видим… полный отстой…

Мы далее разбирались с этим здесь.

Как перекодировать строку из utf-8 в windows 1251

Итак… есть третий вариант, борьбы с квадратиками(непонимание кодировки) — перекодировать строку из utf-8 в windows 1251:

iconv(«UTF-8», «windows-1251», $text)

После того, как вы выполнили все намеченные действия с текстом, возвращаем его в исходную кодировку :

iconv(«windows-1251», «UTF-8», $text)

Рассмотрим пример перекодировки текста из UTF-8 в windows-1251 и обратно

Мы использовали var_dump, и он посчитал не правильно, поскольку просто так, на страницу вывести данные с помощью var_dump нельзя, мы использовали вот такой костыль :

ob_start();

var_dump( ‘Марат’ );

echo ob_get_clean();

Теперь попробуем перекодировать строку прямо внутри :

ob_start();

var_dump(iconv(«UTF-8», «windows-1251», ‘Марат’)) ;

echo ob_get_clean() ;

Результат подсчета знаков верный, но видим что слово не было перекодировано обратно :

string(5) «��»

Исправим:

ob_start();

var_dump(iconv(«UTF-8», «windows-1251», ‘Марат’)) ;

echo iconv(«windows-1251», «UTF-8», ob_get_clean());

Результат :

string(5) «Марат»

Итак… вы видели процесс кодировки и перекодировки текста из utf-8 в windows 1251, а потом обратно!

Вы наверное подумали :

Что за дичь здесь происходит!? Это не дичь! Когда ты внутри, а не снаружи, то все кажется не простым, а очень простым.

И чем больше ты в теме, это просто, как есть, пить, дышать… просто не задумываешься…

Я не говорю, что всегда так, иногда бывает очень трудно какаю-то задачку решить…
смайлы

Что лучше для кириллицы utf-8 или…

Интересный поисковый запрос — «Что лучше для кириллицы utf-8 или…«…

Дело в том, что я выбрал кодировку «utf-8» уже… 14 лет(число динамическое) назад… и… уже сейчас трудно вспомнить, почему именно её… но точно вам могу заявить, что когда-то пользовался «windows-1251″… и у неё были какие-то заморочки, в виде неадекватного вывода информации, что, я волей неволей перешел на «utf-8»

Какие минусы у utf-8?

Одна из самых главных проблем «utf-8» — это многобайтовость…

Да! Это несколько неудобно в самом начале, но для всякой функции, которая не хочет работать с кириллицей, существуют замены.

В процессе создания сайта у вас может возникнуть несколько проблем, которые вы решите и «тупо» забудете об этом…

Задумывался ли я о переходе с кодировки utf-8 на другую?

Смысл задумываться о переходе с кодировки utf-8 на другую, если всё работает так, как нужно!

Источник

0 U+0000 00 � Control character: Null 1 U+0001 01 Control character: Start Of Heading 2 U+0002 02 Control character: Start Of Text 3 U+0003 03 Control character: End Of Text 4 U+0004 04 Control character: End Of Transmission 5 U+0005 05 Control character: Enquiry 6 U+0006 06 Control character: Acknowledge 7 U+0007 07 Control character: Bell 8 U+0008 08 Control character: Backspace 9 U+0009 09 Control character: Character Tabulation 10 U+000A 0A Control character: Line Feed (lf) 11 U+000B 0B Control character: Line Tabulation 12 U+000C 0C Control character: Form Feed (ff) 13 U+000D 0D Control character: Carriage Return (cr) 14 U+000E 0E Control character: Shift Out 15 U+000F 0F Control character: Shift In 16 U+0010 10 Control character: Data Link Escape 17 U+0011 11 Control character: Device Control One 18 U+0012 12 Control character: Device Control Two 19 U+0013 13 Control character: Device Control Three 20 U+0014 14 Control character: Device Control Four 21 U+0015 15 Control character: Negative Acknowledge 22 U+0016 16 Control character: Synchronous Idle 23 U+0017 17 Control character: End Of Transmission Block 24 U+0018 18 Control character: Cancel 25 U+0019 19 Control character: End Of Medium 26 U+001A 1A Control character: Substitute 27 U+001B 1B Control character: Escape 28 U+001C 1C Control character: Information Separator Four 29 U+001D 1D Control character: Information Separator Three 30 U+001E 1E Control character: Information Separator Two 31 U+001F 1F Control character: Information Separator One 32 U+0020 20 Space 33 U+0021 21 ! Exclamation Mark 34 U+0022 22 « Quotation Mark 35 U+0023 23 # Number Sign 36 U+0024 24 $ Dollar Sign 37 U+0025 25 % Percent Sign 38 U+0026 26 & Ampersand 39 U+0027 27 ‘ Apostrophe 40 U+0028 28 ( Left Parenthesis 41 U+0029 29 ) Right Parenthesis 42 U+002A 2A * Asterisk 43 U+002B 2B + Plus Sign 44 U+002C 2C , Comma 45 U+002D 2D — Hyphen-minus 46 U+002E 2E . Full Stop 47 U+002F 2F / Solidus 48 U+0030 30 0 Digit Zero 49 U+0031 31 1 Digit One 50 U+0032 32 2 Digit Two 51 U+0033 33 3 Digit Three 52 U+0034 34 4 Digit Four 53 U+0035 35 5 Digit Five 54 U+0036 36 6 Digit Six 55 U+0037 37 7 Digit Seven 56 U+0038 38 8 Digit Eight 57 U+0039 39 9 Digit Nine 58 U+003A 3A : Colon 59 U+003B 3B ; Semicolon 60 U+003C 3C < Less-than Sign 61 U+003D 3D = Equals Sign 62 U+003E 3E > Greater-than Sign 63 U+003F 3F ? Question Mark 64 U+0040 40 @ Commercial At 65 U+0041 41 A Latin Capital Letter A 66 U+0042 42 B Latin Capital Letter B 67 U+0043 43 C Latin Capital Letter C 68 U+0044 44 D Latin Capital Letter D 69 U+0045 45 E Latin Capital Letter E 70 U+0046 46 F Latin Capital Letter F 71 U+0047 47 G Latin Capital Letter G 72 U+0048 48 H Latin Capital Letter H 73 U+0049 49 I Latin Capital Letter I 74 U+004A 4A J Latin Capital Letter J 75 U+004B 4B K Latin Capital Letter K 76 U+004C 4C L Latin Capital Letter L 77 U+004D 4D M Latin Capital Letter M 78 U+004E 4E N Latin Capital Letter N 79 U+004F 4F O Latin Capital Letter O 80 U+0050 50 P Latin Capital Letter P 81 U+0051 51 Q Latin Capital Letter Q 82 U+0052 52 R Latin Capital Letter R 83 U+0053 53 S Latin Capital Letter S 84 U+0054 54 T Latin Capital Letter T 85 U+0055 55 U Latin Capital Letter U 86 U+0056 56 V Latin Capital Letter V 87 U+0057 57 W Latin Capital Letter W 88 U+0058 58 X Latin Capital Letter X 89 U+0059 59 Y Latin Capital Letter Y 90 U+005A 5A Z Latin Capital Letter Z 91 U+005B 5B [ Left Square Bracket 92 U+005C 5C \ Reverse Solidus 93 U+005D 5D ] Right Square Bracket 94 U+005E 5E ^ Circumflex Accent 95 U+005F 5F _ Low Line 96 U+0060 60 ` Grave Accent 97 U+0061 61 a Latin Small Letter A 98 U+0062 62 b Latin Small Letter B 99 U+0063 63 c Latin Small Letter C 100 U+0064 64 d Latin Small Letter D 101 U+0065 65 e Latin Small Letter E 102 U+0066 66 f Latin Small Letter F 103 U+0067 67 g Latin Small Letter G 104 U+0068 68 h Latin Small Letter H 105 U+0069 69 i Latin Small Letter I 106 U+006A 6A j Latin Small Letter J 107 U+006B 6B k Latin Small Letter K 108 U+006C 6C l Latin Small Letter L 109 U+006D 6D m Latin Small Letter M 110 U+006E 6E n Latin Small Letter N 111 U+006F 6F o Latin Small Letter O 112 U+0070 70 p Latin Small Letter P 113 U+0071 71 q Latin Small Letter Q 114 U+0072 72 r Latin Small Letter R 115 U+0073 73 s Latin Small Letter S 116 U+0074 74 t Latin Small Letter T 117 U+0075 75 u Latin Small Letter U 118 U+0076 76 v Latin Small Letter V 119 U+0077 77 w Latin Small Letter W 120 U+0078 78 x Latin Small Letter X 121 U+0079 79 y Latin Small Letter Y 122 U+007A 7A z Latin Small Letter Z 123 U+007B 7B { Left Curly Bracket 124 U+007C 7C | Vertical Line 125 U+007D 7D } Right Curly Bracket 126 U+007E 7E ~ Tilde 127 U+007F 7F Control character: Delete 128 U+0080 C2 80 € Control Character or Euro Sign, See Note 1 129 U+0081 C2 81 Control character: Unknown 130 U+0082 C2 82 ‚ Control character: Break Permitted Here 131 U+0083 C2 83 ƒ Control character: No Break Here 132 U+0084 C2 84 „ Control character: Unknown 133 U+0085 C2 85 … Control character: Next Line (nel) 134 U+0086 C2 86 † Control character: Start Of Selected Area 135 U+0087 C2 87 ‡ Control character: End Of Selected Area 136 U+0088 C2 88 ˆ Control character: Character Tabulation Set 137 U+0089 C2 89 ‰ Control character: Character Tabulation With Justification 138 U+008A C2 8A Š Control character: Line Tabulation Set 139 U+008B C2 8B ‹ Control character: Partial Line Forward 140 U+008C C2 8C Œ Control character: Partial Line Backward 141 U+008D C2 8D Control character: Reverse Line Feed 142 U+008E C2 8E Ž Control character: Single Shift Two 143 U+008F C2 8F Control character: Single Shift Three 144 U+0090 C2 90 Control character: Device Control String 145 U+0091 C2 91 ‘ Control character: Private Use One 146 U+0092 C2 92 ’ Control character: Private Use Two 147 U+0093 C2 93 “ Control character: Set Transmit State 148 U+0094 C2 94 ” Control character: Cancel Character 149 U+0095 C2 95 • Control character: Message Waiting 150 U+0096 C2 96 – Control character: Start Of Guarded Area 151 U+0097 C2 97 — Control character: End Of Guarded Area 152 U+0098 C2 98 ˜ Control character: Start Of String 153 U+0099 C2 99 ™ Control character: Unknown 154 U+009A C2 9A š Control character: Single Character Introducer 155 U+009B C2 9B › Control character: Control Sequence Introducer 156 U+009C C2 9C œ Control character: String Terminator 157 U+009D C2 9D Control character: Operating System Command 158 U+009E C2 9E ž Control character: Privacy Message 159 U+009F C2 9F Ÿ Control character: Application Program Command 160 U+00A0 C2 A0 No-break Space 161 U+00A1 C2 A1 ¡ Inverted Exclamation Mark 162 U+00A2 C2 A2 ¢ Cent Sign 163 U+00A3 C2 A3 £ Pound Sign 164 U+00A4 C2 A4 ¤ Currency Sign 165 U+00A5 C2 A5 ¥ Yen Sign 166 U+00A6 C2 A6 ¦ Broken Bar 167 U+00A7 C2 A7 § Section Sign 168 U+00A8 C2 A8 ¨ Diaeresis 169 U+00A9 C2 A9 © Copyright Sign 170 U+00AA C2 AA ª Feminine Ordinal Indicator 171 U+00AB C2 AB « Left-pointing Double Angle Quotation Mark 172 U+00AC C2 AC ¬ Not Sign 173 U+00AD C2 AD Soft Hyphen 174 U+00AE C2 AE ® Registered Sign 175 U+00AF C2 AF ¯ Macron 176 U+00B0 C2 B0 ° Degree Sign 177 U+00B1 C2 B1 ± Plus-minus Sign 178 U+00B2 C2 B2 ² Superscript Two 179 U+00B3 C2 B3 ³ Superscript Three 180 U+00B4 C2 B4 ´ Acute Accent 181 U+00B5 C2 B5 µ Micro Sign 182 U+00B6 C2 B6 ¶ Pilcrow Sign 183 U+00B7 C2 B7 · Middle Dot 184 U+00B8 C2 B8 ¸ Cedilla 185 U+00B9 C2 B9 ¹ Superscript One 186 U+00BA C2 BA º Masculine Ordinal Indicator 187 U+00BB C2 BB » Right-pointing Double Angle Quotation Mark 188 U+00BC C2 BC ¼ Vulgar Fraction One Quarter 189 U+00BD C2 BD ½ Vulgar Fraction One Half 190 U+00BE C2 BE ¾ Vulgar Fraction Three Quarters 191 U+00BF C2 BF ¿ Inverted Question Mark 192 U+00C0 C3 80 À Latin Capital Letter A With Grave 193 U+00C1 C3 81 Á Latin Capital Letter A With Acute 194 U+00C2 C3 82 Â Latin Capital Letter A With Circumflex 195 U+00C3 C3 83 Ã Latin Capital Letter A With Tilde 196 U+00C4 C3 84 Ä Latin Capital Letter A With Diaeresis 197 U+00C5 C3 85 Å Latin Capital Letter A With Ring Above 198 U+00C6 C3 86 Æ Latin Capital Letter Ae 199 U+00C7 C3 87 Ç Latin Capital Letter C With Cedilla 200 U+00C8 C3 88 È Latin Capital Letter E With Grave 201 U+00C9 C3 89 É Latin Capital Letter E With Acute 202 U+00CA C3 8A Ê Latin Capital Letter E With Circumflex 203 U+00CB C3 8B Ë Latin Capital Letter E With Diaeresis 204 U+00CC C3 8C Ì Latin Capital Letter I With Grave 205 U+00CD C3 8D Í Latin Capital Letter I With Acute 206 U+00CE C3 8E Î Latin Capital Letter I With Circumflex 207 U+00CF C3 8F Ï Latin Capital Letter I With Diaeresis 208 U+00D0 C3 90 Ð Latin Capital Letter Eth 209 U+00D1 C3 91 Ñ Latin Capital Letter N With Tilde 210 U+00D2 C3 92 Ò Latin Capital Letter O With Grave 211 U+00D3 C3 93 Ó Latin Capital Letter O With Acute 212 U+00D4 C3 94 Ô Latin Capital Letter O With Circumflex 213 U+00D5 C3 95 Õ Latin Capital Letter O With Tilde 214 U+00D6 C3 96 Ö Latin Capital Letter O With Diaeresis 215 U+00D7 C3 97 × Multiplication Sign 216 U+00D8 C3 98 Ø Latin Capital Letter O With Stroke 217 U+00D9 C3 99 Ù Latin Capital Letter U With Grave 218 U+00DA C3 9A Ú Latin Capital Letter U With Acute 219 U+00DB C3 9B Û Latin Capital Letter U With Circumflex 220 U+00DC C3 9C Ü Latin Capital Letter U With Diaeresis 221 U+00DD C3 9D Ý Latin Capital Letter Y With Acute 222 U+00DE C3 9E Þ Latin Capital Letter Thorn 223 U+00DF C3 9F ß Latin Small Letter Sharp S 224 U+00E0 C3 A0 à Latin Small Letter A With Grave 225 U+00E1 C3 A1 á Latin Small Letter A With Acute 226 U+00E2 C3 A2 â Latin Small Letter A With Circumflex 227 U+00E3 C3 A3 ã Latin Small Letter A With Tilde 228 U+00E4 C3 A4 ä Latin Small Letter A With Diaeresis 229 U+00E5 C3 A5 å Latin Small Letter A With Ring Above 230 U+00E6 C3 A6 æ Latin Small Letter Ae 231 U+00E7 C3 A7 ç Latin Small Letter C With Cedilla 232 U+00E8 C3 A8 è Latin Small Letter E With Grave 233 U+00E9 C3 A9 é Latin Small Letter E With Acute 234 U+00EA C3 AA ê Latin Small Letter E With Circumflex 235 U+00EB C3 AB ë Latin Small Letter E With Diaeresis 236 U+00EC C3 AC ì Latin Small Letter I With Grave 237 U+00ED C3 AD í Latin Small Letter I With Acute 238 U+00EE C3 AE î Latin Small Letter I With Circumflex 239 U+00EF C3 AF ï Latin Small Letter I With Diaeresis 240 U+00F0 C3 B0 ð Latin Small Letter Eth 241 U+00F1 C3 B1 ñ Latin Small Letter N With Tilde 242 U+00F2 C3 B2 ò Latin Small Letter O With Grave 243 U+00F3 C3 B3 ó Latin Small Letter O With Acute 244 U+00F4 C3 B4 ô Latin Small Letter O With Circumflex 245 U+00F5 C3 B5 õ Latin Small Letter O With Tilde 246 U+00F6 C3 B6 ö Latin Small Letter O With Diaeresis 247 U+00F7 C3 B7 ÷ Division Sign 248 U+00F8 C3 B8 ø Latin Small Letter O With Stroke 249 U+00F9 C3 B9 ù Latin Small Letter U With Grave 250 U+00FA C3 BA ú Latin Small Letter U With Acute 251 U+00FB C3 BB û Latin Small Letter U With Circumflex 252 U+00FC C3 BC ü Latin Small Letter U With Diaeresis 253 U+00FD C3 BD ý Latin Small Letter Y With Acute 254 U+00FE C3 BE þ Latin Small Letter Thorn 255 U+00FF C3 BF ÿ Latin Small Letter Y With Diaeresis 256 U+0100 C4 80 Ā Latin Capital Letter A With Macron 257 U+0101 C4 81 ā Latin Small Letter A With Macron 258 U+0102 C4 82 Ă Latin Capital Letter A With Breve 259 U+0103 C4 83 ă Latin Small Letter A With Breve 260 U+0104 C4 84 Ą Latin Capital Letter A With Ogonek 261 U+0105 C4 85 ą Latin Small Letter A With Ogonek 262 U+0106 C4 86 Ć Latin Capital Letter C With Acute 263 U+0107 C4 87 ć Latin Small Letter C With Acute 264 U+0108 C4 88 Ĉ Latin Capital Letter C With Circumflex 265 U+0109 C4 89 ĉ Latin Small Letter C With Circumflex 266 U+010A C4 8A Ċ Latin Capital Letter C With Dot Above 267 U+010B C4 8B ċ Latin Small Letter C With Dot Above 268 U+010C C4 8C Č Latin Capital Letter C With Caron 269 U+010D C4 8D č Latin Small Letter C With Caron 270 U+010E C4 8E Ď Latin Capital Letter D With Caron 271 U+010F C4 8F ď Latin Small Letter D With Caron 272 U+0110 C4 90 Đ Latin Capital Letter D With Stroke 273 U+0111 C4 91 đ Latin Small Letter D With Stroke 274 U+0112 C4 92 Ē Latin Capital Letter E With Macron 275 U+0113 C4 93 ē Latin Small Letter E With Macron 276 U+0114 C4 94 Ĕ Latin Capital Letter E With Breve 277 U+0115 C4 95 ĕ Latin Small Letter E With Breve 278 U+0116 C4 96 Ė Latin Capital Letter E With Dot Above 279 U+0117 C4 97 ė Latin Small Letter E With Dot Above 280 U+0118 C4 98 Ę Latin Capital Letter E With Ogonek 281 U+0119 C4 99 ę Latin Small Letter E With Ogonek 282 U+011A C4 9A Ě Latin Capital Letter E With Caron 283 U+011B C4 9B ě Latin Small Letter E With Caron 284 U+011C C4 9C Ĝ Latin Capital Letter G With Circumflex 285 U+011D C4 9D ĝ Latin Small Letter G With Circumflex 286 U+011E C4 9E Ğ Latin Capital Letter G With Breve 287 U+011F C4 9F ğ Latin Small Letter G With Breve 288 U+0120 C4 A0 Ġ Latin Capital Letter G With Dot Above 289 U+0121 C4 A1 ġ Latin Small Letter G With Dot Above 290 U+0122 C4 A2 Ģ Latin Capital Letter G With Cedilla 291 U+0123 C4 A3 ģ Latin Small Letter G With Cedilla 292 U+0124 C4 A4 Ĥ Latin Capital Letter H With Circumflex 293 U+0125 C4 A5 ĥ Latin Small Letter H With Circumflex 294 U+0126 C4 A6 Ħ Latin Capital Letter H With Stroke 295 U+0127 C4 A7 ħ Latin Small Letter H With Stroke 296 U+0128 C4 A8 Ĩ Latin Capital Letter I With Tilde 297 U+0129 C4 A9 ĩ Latin Small Letter I With Tilde 298 U+012A C4 AA Ī Latin Capital Letter I With Macron 299 U+012B C4 AB ī Latin Small Letter I With Macron 300 U+012C C4 AC Ĭ Latin Capital Letter I With Breve 301 U+012D C4 AD ĭ Latin Small Letter I With Breve 302 U+012E C4 AE Į Latin Capital Letter I With Ogonek 303 U+012F C4 AF į Latin Small Letter I With Ogonek 304 U+0130 C4 B0 İ Latin Capital Letter I With Dot Above 305 U+0131 C4 B1 ı Latin Small Letter Dotless I 306 U+0132 C4 B2 Ĳ Latin Capital Ligature Ij 307 U+0133 C4 B3 ĳ Latin Small Ligature Ij 308 U+0134 C4 B4 Ĵ Latin Capital Letter J With Circumflex 309 U+0135 C4 B5 ĵ Latin Small Letter J With Circumflex 310 U+0136 C4 B6 Ķ Latin Capital Letter K With Cedilla 311 U+0137 C4 B7 ķ Latin Small Letter K With Cedilla 312 U+0138 C4 B8 ĸ Latin Small Letter Kra 313 U+0139 C4 B9 Ĺ Latin Capital Letter L With Acute 314 U+013A C4 BA ĺ Latin Small Letter L With Acute 315 U+013B C4 BB Ļ Latin Capital Letter L With Cedilla 316 U+013C C4 BC ļ Latin Small Letter L With Cedilla 317 U+013D C4 BD Ľ Latin Capital Letter L With Caron 318 U+013E C4 BE ľ Latin Small Letter L With Caron 319 U+013F C4 BF Ŀ Latin Capital Letter L With Middle Dot 320 U+0140 C5 80 ŀ Latin Small Letter L With Middle Dot 321 U+0141 C5 81 Ł Latin Capital Letter L With Stroke 322 U+0142 C5 82 ł Latin Small Letter L With Stroke 323 U+0143 C5 83 Ń Latin Capital Letter N With Acute 324 U+0144 C5 84 ń Latin Small Letter N With Acute 325 U+0145 C5 85 Ņ Latin Capital Letter N With Cedilla 326 U+0146 C5 86 ņ Latin Small Letter N With Cedilla 327 U+0147 C5 87 Ň Latin Capital Letter N With Caron 328 U+0148 C5 88 ň Latin Small Letter N With Caron 329 U+0149 C5 89 ŉ Latin Small Letter N Preceded By Apostrophe 330 U+014A C5 8A Ŋ Latin Capital Letter Eng 331 U+014B C5 8B ŋ Latin Small Letter Eng 332 U+014C C5 8C Ō Latin Capital Letter O With Macron 333 U+014D C5 8D ō Latin Small Letter O With Macron 334 U+014E C5 8E Ŏ Latin Capital Letter O With Breve 335 U+014F C5 8F ŏ Latin Small Letter O With Breve 336 U+0150 C5 90 Ő Latin Capital Letter O With Double Acute 337 U+0151 C5 91 ő Latin Small Letter O With Double Acute 338 U+0152 C5 92 Œ Latin Capital Ligature Oe 339 U+0153 C5 93 œ Latin Small Ligature Oe 340 U+0154 C5 94 Ŕ Latin Capital Letter R With Acute 341 U+0155 C5 95 ŕ Latin Small Letter R With Acute 342 U+0156 C5 96 Ŗ Latin Capital Letter R With Cedilla 343 U+0157 C5 97 ŗ Latin Small Letter R With Cedilla 344 U+0158 C5 98 Ř Latin Capital Letter R With Caron 345 U+0159 C5 99 ř Latin Small Letter R With Caron 346 U+015A C5 9A Ś Latin Capital Letter S With Acute 347 U+015B C5 9B ś Latin Small Letter S With Acute 348 U+015C C5 9C Ŝ Latin Capital Letter S With Circumflex 349 U+015D C5 9D ŝ Latin Small Letter S With Circumflex 350 U+015E C5 9E Ş Latin Capital Letter S With Cedilla 351 U+015F C5 9F ş Latin Small Letter S With Cedilla 352 U+0160 C5 A0 Š Latin Capital Letter S With Caron 353 U+0161 C5 A1 š Latin Small Letter S With Caron 354 U+0162 C5 A2 Ţ Latin Capital Letter T With Cedilla 355 U+0163 C5 A3 ţ Latin Small Letter T With Cedilla 356 U+0164 C5 A4 Ť Latin Capital Letter T With Caron 357 U+0165 C5 A5 ť Latin Small Letter T With Caron 358 U+0166 C5 A6 Ŧ Latin Capital Letter T With Stroke 359 U+0167 C5 A7 ŧ Latin Small Letter T With Stroke 360 U+0168 C5 A8 Ũ Latin Capital Letter U With Tilde 361 U+0169 C5 A9 ũ Latin Small Letter U With Tilde 362 U+016A C5 AA Ū Latin Capital Letter U With Macron 363 U+016B C5 AB ū Latin Small Letter U With Macron 364 U+016C C5 AC Ŭ Latin Capital Letter U With Breve 365 U+016D C5 AD ŭ Latin Small Letter U With Breve 366 U+016E C5 AE Ů Latin Capital Letter U With Ring Above 367 U+016F C5 AF ů Latin Small Letter U With Ring Above 368 U+0170 C5 B0 Ű Latin Capital Letter U With Double Acute 369 U+0171 C5 B1 ű Latin Small Letter U With Double Acute 370 U+0172 C5 B2 Ų Latin Capital Letter U With Ogonek 371 U+0173 C5 B3 ų Latin Small Letter U With Ogonek 372 U+0174 C5 B4 Ŵ Latin Capital Letter W With Circumflex 373 U+0175 C5 B5 ŵ Latin Small Letter W With Circumflex 374 U+0176 C5 B6 Ŷ Latin Capital Letter Y With Circumflex 375 U+0177 C5 B7 ŷ Latin Small Letter Y With Circumflex 376 U+0178 C5 B8 Ÿ Latin Capital Letter Y With Diaeresis 377 U+0179 C5 B9 Ź Latin Capital Letter Z With Acute 378 U+017A C5 BA ź Latin Small Letter Z With Acute 379 U+017B C5 BB Ż Latin Capital Letter Z With Dot Above 380 U+017C C5 BC ż Latin Small Letter Z With Dot Above 381 U+017D C5 BD Ž Latin Capital Letter Z With Caron 382 U+017E C5 BE ž Latin Small Letter Z With Caron 383 U+017F C5 BF ſ Latin Small Letter Long S 384 U+0180 C6 80 ƀ Latin Small Letter B With Stroke 385 U+0181 C6 81 Ɓ Latin Capital Letter B With Hook 386 U+0182 C6 82 Ƃ Latin Capital Letter B With Topbar 387 U+0183 C6 83 ƃ Latin Small Letter B With Topbar 388 U+0184 C6 84 Ƅ Latin Capital Letter Tone Six 389 U+0185 C6 85 ƅ Latin Small Letter Tone Six 390 U+0186 C6 86 Ɔ Latin Capital Letter Open O 391 U+0187 C6 87 Ƈ Latin Capital Letter C With Hook 392 U+0188 C6 88 ƈ Latin Small Letter C With Hook 393 U+0189 C6 89 Ɖ Latin Capital Letter African D 394 U+018A C6 8A Ɗ Latin Capital Letter D With Hook 395 U+018B C6 8B Ƌ Latin Capital Letter D With Topbar 396 U+018C C6 8C ƌ Latin Small Letter D With Topbar 397 U+018D C6 8D ƍ Latin Small Letter Turned Delta 398 U+018E C6 8E Ǝ Latin Capital Letter Reversed E 399 U+018F C6 8F Ə Latin Capital Letter Schwa 400 U+0190 C6 90 Ɛ Latin Capital Letter Open E 401 U+0191 C6 91 Ƒ Latin Capital Letter F With Hook 402 U+0192 C6 92 ƒ Latin Small Letter F With Hook 403 U+0193 C6 93 Ɠ Latin Capital Letter G With Hook 404 U+0194 C6 94 Ɣ Latin Capital Letter Gamma 405 U+0195 C6 95 ƕ Latin Small Letter Hv 406 U+0196 C6 96 Ɩ Latin Capital Letter Iota 407 U+0197 C6 97 Ɨ Latin Capital Letter I With Stroke 408 U+0198 C6 98 Ƙ Latin Capital Letter K With Hook 409 U+0199 C6 99 ƙ Latin Small Letter K With Hook 410 U+019A C6 9A ƚ Latin Small Letter L With Bar 411 U+019B C6 9B ƛ Latin Small Letter Lambda With Stroke 412 U+019C C6 9C Ɯ Latin Capital Letter Turned M 413 U+019D C6 9D Ɲ Latin Capital Letter N With Left Hook 414 U+019E C6 9E ƞ Latin Small Letter N With Long Right Leg 415 U+019F C6 9F Ɵ Latin Capital Letter O With Middle Tilde 416 U+01A0 C6 A0 Ơ Latin Capital Letter O With Horn 417 U+01A1 C6 A1 ơ Latin Small Letter O With Horn 418 U+01A2 C6 A2 Ƣ Latin Capital Letter Oi 419 U+01A3 C6 A3 ƣ Latin Small Letter Oi 420 U+01A4 C6 A4 Ƥ Latin Capital Letter P With Hook 421 U+01A5 C6 A5 ƥ Latin Small Letter P With Hook 422 U+01A6 C6 A6 Ʀ Latin Letter Yr 423 U+01A7 C6 A7 Ƨ Latin Capital Letter Tone Two 424 U+01A8 C6 A8 ƨ Latin Small Letter Tone Two 425 U+01A9 C6 A9 Ʃ Latin Capital Letter Esh 426 U+01AA C6 AA ƪ Latin Letter Reversed Esh Loop 427 U+01AB C6 AB ƫ Latin Small Letter T With Palatal Hook 428 U+01AC C6 AC Ƭ Latin Capital Letter T With Hook 429 U+01AD C6 AD ƭ Latin Small Letter T With Hook 430 U+01AE C6 AE Ʈ Latin Capital Letter T With Retroflex Hook 431 U+01AF C6 AF Ư Latin Capital Letter U With Horn 432 U+01B0 C6 B0 ư Latin Small Letter U With Horn 433 U+01B1 C6 B1 Ʊ Latin Capital Letter Upsilon 434 U+01B2 C6 B2 Ʋ Latin Capital Letter V With Hook 435 U+01B3 C6 B3 Ƴ Latin Capital Letter Y With Hook 436 U+01B4 C6 B4 ƴ Latin Small Letter Y With Hook 437 U+01B5 C6 B5 Ƶ Latin Capital Letter Z With Stroke 438 U+01B6 C6 B6 ƶ Latin Small Letter Z With Stroke 439 U+01B7 C6 B7 Ʒ Latin Capital Letter Ezh 440 U+01B8 C6 B8 Ƹ Latin Capital Letter Ezh Reversed 441 U+01B9 C6 B9 ƹ Latin Small Letter Ezh Reversed 442 U+01BA C6 BA ƺ Latin Small Letter Ezh With Tail 443 U+01BB C6 BB ƻ Latin Letter Two With Stroke 444 U+01BC C6 BC Ƽ Latin Capital Letter Tone Five 445 U+01BD C6 BD ƽ Latin Small Letter Tone Five 446 U+01BE C6 BE ƾ Latin Letter Inverted Glottal Stop With Stroke 447 U+01BF C6 BF ƿ Latin Letter Wynn 448 U+01C0 C7 80 ǀ Latin Letter Dental Click 449 U+01C1 C7 81 ǁ Latin Letter Lateral Click 450 U+01C2 C7 82 ǂ Latin Letter Alveolar Click 451 U+01C3 C7 83 ǃ Latin Letter Retroflex Click 452 U+01C4 C7 84 Ǆ Latin Capital Letter Dz With Caron 453 U+01C5 C7 85 ǅ Latin Capital Letter D With Small Letter Z With Caron 454 U+01C6 C7 86 ǆ Latin Small Letter Dz With Caron 455 U+01C7 C7 87 Ǉ Latin Capital Letter Lj 456 U+01C8 C7 88 ǈ Latin Capital Letter L With Small Letter J 457 U+01C9 C7 89 ǉ Latin Small Letter Lj 458 U+01CA C7 8A Ǌ Latin Capital Letter Nj 459 U+01CB C7 8B ǋ Latin Capital Letter N With Small Letter J 460 U+01CC C7 8C ǌ Latin Small Letter Nj 461 U+01CD C7 8D Ǎ Latin Capital Letter A With Caron 462 U+01CE C7 8E ǎ Latin Small Letter A With Caron 463 U+01CF C7 8F Ǐ Latin Capital Letter I With Caron 464 U+01D0 C7 90 ǐ Latin Small Letter I With Caron 465 U+01D1 C7 91 Ǒ Latin Capital Letter O With Caron 466 U+01D2 C7 92 ǒ Latin Small Letter O With Caron 467 U+01D3 C7 93 Ǔ Latin Capital Letter U With Caron 468 U+01D4 C7 94 ǔ Latin Small Letter U With Caron 469 U+01D5 C7 95 Ǖ Latin Capital Letter U With Diaeresis And Macron 470 U+01D6 C7 96 ǖ Latin Small Letter U With Diaeresis And Macron 471 U+01D7 C7 97 Ǘ Latin Capital Letter U With Diaeresis And Acute 472 U+01D8 C7 98 ǘ Latin Small Letter U With Diaeresis And Acute 473 U+01D9 C7 99 Ǚ Latin Capital Letter U With Diaeresis And Caron 474 U+01DA C7 9A ǚ Latin Small Letter U With Diaeresis And Caron 475 U+01DB C7 9B Ǜ Latin Capital Letter U With Diaeresis And Grave 476 U+01DC C7 9C ǜ Latin Small Letter U With Diaeresis And Grave 477 U+01DD C7 9D ǝ Latin Small Letter Turned E 478 U+01DE C7 9E Ǟ Latin Capital Letter A With Diaeresis And Macron 479 U+01DF C7 9F ǟ Latin Small Letter A With Diaeresis And Macron 480 U+01E0 C7 A0 Ǡ Latin Capital Letter A With Dot Above And Macron 481 U+01E1 C7 A1 ǡ Latin Small Letter A With Dot Above And Macron 482 U+01E2 C7 A2 Ǣ Latin Capital Letter Ae With Macron 483 U+01E3 C7 A3 ǣ Latin Small Letter Ae With Macron 484 U+01E4 C7 A4 Ǥ Latin Capital Letter G With Stroke 485 U+01E5 C7 A5 ǥ Latin Small Letter G With Stroke 486 U+01E6 C7 A6 Ǧ Latin Capital Letter G With Caron 487 U+01E7 C7 A7 ǧ Latin Small Letter G With Caron 488 U+01E8 C7 A8 Ǩ Latin Capital Letter K With Caron 489 U+01E9 C7 A9 ǩ Latin Small Letter K With Caron 490 U+01EA C7 AA Ǫ Latin Capital Letter O With Ogonek 491 U+01EB C7 AB ǫ Latin Small Letter O With Ogonek 492 U+01EC C7 AC Ǭ Latin Capital Letter O With Ogonek And Macron 493 U+01ED C7 AD ǭ Latin Small Letter O With Ogonek And Macron 494 U+01EE C7 AE Ǯ Latin Capital Letter Ezh With Caron 495 U+01EF C7 AF ǯ Latin Small Letter Ezh With Caron 496 U+01F0 C7 B0 ǰ Latin Small Letter J With Caron 497 U+01F1 C7 B1 Ǳ Latin Capital Letter Dz 498 U+01F2 C7 B2 ǲ Latin Capital Letter D With Small Letter Z 499 U+01F3 C7 B3 ǳ Latin Small Letter Dz 500 U+01F4 C7 B4 Ǵ Latin Capital Letter G With Acute 501 U+01F5 C7 B5 ǵ Latin Small Letter G With Acute 502 U+01F6 C7 B6 Ƕ Latin Capital Letter Hwair 503 U+01F7 C7 B7 Ƿ Latin Capital Letter Wynn 504 U+01F8 C7 B8 Ǹ Latin Capital Letter N With Grave 505 U+01F9 C7 B9 ǹ Latin Small Letter N With Grave 506 U+01FA C7 BA Ǻ Latin Capital Letter A With Ring Above And Acute 507 U+01FB C7 BB ǻ Latin Small Letter A With Ring Above And Acute 508 U+01FC C7 BC Ǽ Latin Capital Letter Ae With Acute 509 U+01FD C7 BD ǽ Latin Small Letter Ae With Acute 510 U+01FE C7 BE Ǿ Latin Capital Letter O With Stroke And Acute 511 U+01FF C7 BF ǿ Latin Small Letter O With Stroke And Acute 512 U+0200 C8 80 Ȁ Latin Capital Letter A With Double Grave 513 U+0201 C8 81 ȁ Latin Small Letter A With Double Grave 514 U+0202 C8 82 Ȃ Latin Capital Letter A With Inverted Breve 515 U+0203 C8 83 ȃ Latin Small Letter A With Inverted Breve 516 U+0204 C8 84 Ȅ Latin Capital Letter E With Double Grave 517 U+0205 C8 85 ȅ Latin Small Letter E With Double Grave 518 U+0206 C8 86 Ȇ Latin Capital Letter E With Inverted Breve 519 U+0207 C8 87 ȇ Latin Small Letter E With Inverted Breve 520 U+0208 C8 88 Ȉ Latin Capital Letter I With Double Grave 521 U+0209 C8 89 ȉ Latin Small Letter I With Double Grave 522 U+020A C8 8A Ȋ Latin Capital Letter I With Inverted Breve 523 U+020B C8 8B ȋ Latin Small Letter I With Inverted Breve 524 U+020C C8 8C Ȍ Latin Capital Letter O With Double Grave 525 U+020D C8 8D ȍ Latin Small Letter O With Double Grave 526 U+020E C8 8E Ȏ Latin Capital Letter O With Inverted Breve 527 U+020F C8 8F ȏ Latin Small Letter O With Inverted Breve 528 U+0210 C8 90 Ȑ Latin Capital Letter R With Double Grave 529 U+0211 C8 91 ȑ Latin Small Letter R With Double Grave 530 U+0212 C8 92 Ȓ Latin Capital Letter R With Inverted Breve 531 U+0213 C8 93 ȓ Latin Small Letter R With Inverted Breve 532 U+0214 C8 94 Ȕ Latin Capital Letter U With Double Grave 533 U+0215 C8 95 ȕ Latin Small Letter U With Double Grave 534 U+0216 C8 96 Ȗ Latin Capital Letter U With Inverted Breve 535 U+0217 C8 97 ȗ Latin Small Letter U With Inverted Breve 536 U+0218 C8 98 Ș Latin Capital Letter S With Comma Below 537 U+0219 C8 99 ș Latin Small Letter S With Comma Below 538 U+021A C8 9A Ț Latin Capital Letter T With Comma Below 539 U+021B C8 9B ț Latin Small Letter T With Comma Below 540 U+021C C8 9C Ȝ Latin Capital Letter Yogh 541 U+021D C8 9D ȝ Latin Small Letter Yogh 542 U+021E C8 9E Ȟ Latin Capital Letter H With Caron 543 U+021F C8 9F ȟ Latin Small Letter H With Caron 544 U+0220 C8 A0 Ƞ Latin Capital Letter N With Long Right Leg 545 U+0221 C8 A1 ȡ Latin Small Letter D With Curl 546 U+0222 C8 A2 Ȣ Latin Capital Letter Ou 547 U+0223 C8 A3 ȣ Latin Small Letter Ou 548 U+0224 C8 A4 Ȥ Latin Capital Letter Z With Hook 549 U+0225 C8 A5 ȥ Latin Small Letter Z With Hook 550 U+0226 C8 A6 Ȧ Latin Capital Letter A With Dot Above 551 U+0227 C8 A7 ȧ Latin Small Letter A With Dot Above 552 U+0228 C8 A8 Ȩ Latin Capital Letter E With Cedilla 553 U+0229 C8 A9 ȩ Latin Small Letter E With Cedilla 554 U+022A C8 AA Ȫ Latin Capital Letter O With Diaeresis And Macron 555 U+022B C8 AB ȫ Latin Small Letter O With Diaeresis And Macron 556 U+022C C8 AC Ȭ Latin Capital Letter O With Tilde And Macron 557 U+022D C8 AD ȭ Latin Small Letter O With Tilde And Macron 558 U+022E C8 AE Ȯ Latin Capital Letter O With Dot Above 559 U+022F C8 AF ȯ Latin Small Letter O With Dot Above 560 U+0230 C8 B0 Ȱ Latin Capital Letter O With Dot Above And Macron 561 U+0231 C8 B1 ȱ Latin Small Letter O With Dot Above And Macron 562 U+0232 C8 B2 Ȳ Latin Capital Letter Y With Macron 563 U+0233 C8 B3 ȳ Latin Small Letter Y With Macron 564 U+0234 C8 B4 ȴ Latin Small Letter L With Curl 565 U+0235 C8 B5 ȵ Latin Small Letter N With Curl 566 U+0236 C8 B6 ȶ Latin Small Letter T With Curl 567 U+0237 C8 B7 ȷ Latin Small Letter Dotless J 568 U+0238 C8 B8 ȸ Latin Small Letter Db Digraph 569 U+0239 C8 B9 ȹ Latin Small Letter Qp Digraph 570 U+023A C8 BA Ⱥ Latin Capital Letter A With Stroke 571 U+023B C8 BB Ȼ Latin Capital Letter C With Stroke 572 U+023C C8 BC ȼ Latin Small Letter C With Stroke 573 U+023D C8 BD Ƚ Latin Capital Letter L With Bar 574 U+023E C8 BE Ⱦ Latin Capital Letter T With Diagonal Stroke 575 U+023F C8 BF ȿ Latin Small Letter S With Swash Tail 576 U+0240 C9 80 ɀ Latin Small Letter Z With Swash Tail 577 U+0241 C9 81 Ɂ Latin Capital Letter Glottal Stop 578 U+0242 C9 82 ɂ Latin Small Letter Glottal Stop 579 U+0243 C9 83 Ƀ Latin Capital Letter B With Stroke 580 U+0244 C9 84 Ʉ Latin Capital Letter U Bar 581 U+0245 C9 85 Ʌ Latin Capital Letter Turned V 582 U+0246 C9 86 Ɇ Latin Capital Letter E With Stroke 583 U+0247 C9 87 ɇ Latin Small Letter E With Stroke 584 U+0248 C9 88 Ɉ Latin Capital Letter J With Stroke 585 U+0249 C9 89 ɉ Latin Small Letter J With Stroke 586 U+024A C9 8A Ɋ Latin Capital Letter Small Q With Hook Tail 587 U+024B C9 8B ɋ Latin Small Letter Q With Hook Tail 588 U+024C C9 8C Ɍ Latin Capital Letter R With Stroke 589 U+024D C9 8D ɍ Latin Small Letter R With Stroke 590 U+024E C9 8E Ɏ Latin Capital Letter Y With Stroke 591 U+024F C9 8F ɏ Latin Small Letter Y With Stroke 592 U+0250 C9 90 ɐ Latin Small Letter Turned A 593 U+0251 C9 91 ɑ Latin Small Letter Alpha 594 U+0252 C9 92 ɒ Latin Small Letter Turned Alpha 595 U+0253 C9 93 ɓ Latin Small Letter B With Hook 596 U+0254 C9 94 ɔ Latin Small Letter Open O 597 U+0255 C9 95 ɕ Latin Small Letter C With Curl 598 U+0256 C9 96 ɖ Latin Small Letter D With Tail 599 U+0257 C9 97 ɗ Latin Small Letter D With Hook 600 U+0258 C9 98 ɘ Latin Small Letter Reversed E 601 U+0259 C9 99 ə Latin Small Letter Schwa 602 U+025A C9 9A ɚ Latin Small Letter Schwa With Hook 603 U+025B C9 9B ɛ Latin Small Letter Open E 604 U+025C C9 9C ɜ Latin Small Letter Reversed Open E 605 U+025D C9 9D ɝ Latin Small Letter Reversed Open E With Hook 606 U+025E C9 9E ɞ Latin Small Letter Closed Reversed Open E 607 U+025F C9 9F ɟ Latin Small Letter Dotless J With Stroke 608 U+0260 C9 A0 ɠ Latin Small Letter G With Hook 609 U+0261 C9 A1 ɡ Latin Small Letter Script G 610 U+0262 C9 A2 ɢ Latin Letter Small Capital G 611 U+0263 C9 A3 ɣ Latin Small Letter Gamma 612 U+0264 C9 A4 ɤ Latin Small Letter Rams Horn 613 U+0265 C9 A5 ɥ Latin Small Letter Turned H 614 U+0266 C9 A6 ɦ Latin Small Letter H With Hook 615 U+0267 C9 A7 ɧ Latin Small Letter Heng With Hook 616 U+0268 C9 A8 ɨ Latin Small Letter I With Stroke 617 U+0269 C9 A9 ɩ Latin Small Letter Iota 618 U+026A C9 AA ɪ Latin Letter Small Capital I 619 U+026B C9 AB ɫ Latin Small Letter L With Middle Tilde 620 U+026C C9 AC ɬ Latin Small Letter L With Belt 621 U+026D C9 AD ɭ Latin Small Letter L With Retroflex Hook 622 U+026E C9 AE ɮ Latin Small Letter Lezh 623 U+026F C9 AF ɯ Latin Small Letter Turned M 624 U+0270 C9 B0 ɰ Latin Small Letter Turned M With Long Leg 625 U+0271 C9 B1 ɱ Latin Small Letter M With Hook 626 U+0272 C9 B2 ɲ Latin Small Letter N With Left Hook 627 U+0273 C9 B3 ɳ Latin Small Letter N With Retroflex Hook 628 U+0274 C9 B4 ɴ Latin Letter Small Capital N 629 U+0275 C9 B5 ɵ Latin Small Letter Barred O 630 U+0276 C9 B6 ɶ Latin Letter Small Capital Oe 631 U+0277 C9 B7 ɷ Latin Small Letter Closed Omega 632 U+0278 C9 B8 ɸ Latin Small Letter Phi 633 U+0279 C9 B9 ɹ Latin Small Letter Turned R 634 U+027A C9 BA ɺ Latin Small Letter Turned R With Long Leg 635 U+027B C9 BB ɻ Latin Small Letter Turned R With Hook 636 U+027C C9 BC ɼ Latin Small Letter R With Long Leg 637 U+027D C9 BD ɽ Latin Small Letter R With Tail 638 U+027E C9 BE ɾ Latin Small Letter R With Fishhook 639 U+027F C9 BF ɿ Latin Small Letter Reversed R With Fishhook 640 U+0280 CA 80 ʀ Latin Letter Small Capital R 641 U+0281 CA 81 ʁ Latin Letter Small Capital Inverted R 642 U+0282 CA 82 ʂ Latin Small Letter S With Hook 643 U+0283 CA 83 ʃ Latin Small Letter Esh 644 U+0284 CA 84 ʄ Latin Small Letter Dotless J With Stroke And Hook 645 U+0285 CA 85 ʅ Latin Small Letter Squat Reversed Esh 646 U+0286 CA 86 ʆ Latin Small Letter Esh With Curl 647 U+0287 CA 87 ʇ Latin Small Letter Turned T 648 U+0288 CA 88 ʈ Latin Small Letter T With Retroflex Hook 649 U+0289 CA 89 ʉ Latin Small Letter U Bar 650 U+028A CA 8A ʊ Latin Small Letter Upsilon 651 U+028B CA 8B ʋ Latin Small Letter V With Hook 652 U+028C CA 8C ʌ Latin Small Letter Turned V 653 U+028D CA 8D ʍ Latin Small Letter Turned W 654 U+028E CA 8E ʎ Latin Small Letter Turned Y 655 U+028F CA 8F ʏ Latin Letter Small Capital Y 656 U+0290 CA 90 ʐ Latin Small Letter Z With Retroflex Hook 657 U+0291 CA 91 ʑ Latin Small Letter Z With Curl 658 U+0292 CA 92 ʒ Latin Small Letter Ezh 659 U+0293 CA 93 ʓ Latin Small Letter Ezh With Curl 660 U+0294 CA 94 ʔ Latin Letter Glottal Stop 661 U+0295 CA 95 ʕ Latin Letter Pharyngeal Voiced Fricative 662 U+0296 CA 96 ʖ Latin Letter Inverted Glottal Stop 663 U+0297 CA 97 ʗ Latin Letter Stretched C 664 U+0298 CA 98 ʘ Latin Letter Bilabial Click 665 U+0299 CA 99 ʙ Latin Letter Small Capital B 666 U+029A CA 9A ʚ Latin Small Letter Closed Open E 667 U+029B CA 9B ʛ Latin Letter Small Capital G With Hook 668 U+029C CA 9C ʜ Latin Letter Small Capital H 669 U+029D CA 9D ʝ Latin Small Letter J With Crossed-tail 670 U+029E CA 9E ʞ Latin Small Letter Turned K 671 U+029F CA 9F ʟ Latin Letter Small Capital L 672 U+02A0 CA A0 ʠ Latin Small Letter Q With Hook 673 U+02A1 CA A1 ʡ Latin Letter Glottal Stop With Stroke 674 U+02A2 CA A2 ʢ Latin Letter Reversed Glottal Stop With Stroke 675 U+02A3 CA A3 ʣ Latin Small Letter Dz Digraph 676 U+02A4 CA A4 ʤ Latin Small Letter Dezh Digraph 677 U+02A5 CA A5 ʥ Latin Small Letter Dz Digraph With Curl 678 U+02A6 CA A6 ʦ Latin Small Letter Ts Digraph 679 U+02A7 CA A7 ʧ Latin Small Letter Tesh Digraph 680 U+02A8 CA A8 ʨ Latin Small Letter Tc Digraph With Curl 681 U+02A9 CA A9 ʩ Latin Small Letter Feng Digraph 682 U+02AA CA AA ʪ Latin Small Letter Ls Digraph 683 U+02AB CA AB ʫ Latin Small Letter Lz Digraph 684 U+02AC CA AC ʬ Latin Letter Bilabial Percussive 685 U+02AD CA AD ʭ Latin Letter Bidental Percussive 686 U+02AE CA AE ʮ Latin Small Letter Turned H With Fishhook 687 U+02AF CA AF ʯ Latin Small Letter Turned H With Fishhook And Tail 688 U+02B0 CA B0 ʰ Modifier Letter Small H 689 U+02B1 CA B1 ʱ Modifier Letter Small H With Hook 690 U+02B2 CA B2 ʲ Modifier Letter Small J 691 U+02B3 CA B3 ʳ Modifier Letter Small R 692 U+02B4 CA B4 ʴ Modifier Letter Small Turned R 693 U+02B5 CA B5 ʵ Modifier Letter Small Turned R With Hook 694 U+02B6 CA B6 ʶ Modifier Letter Small Capital Inverted R 695 U+02B7 CA B7 ʷ Modifier Letter Small W 696 U+02B8 CA B8 ʸ Modifier Letter Small Y 697 U+02B9 CA B9 ʹ Modifier Letter Prime 698 U+02BA CA BA ʺ Modifier Letter Double Prime 699 U+02BB CA BB ʻ Modifier Letter Turned Comma 700 U+02BC CA BC ʼ Modifier Letter Apostrophe 701 U+02BD CA BD ʽ Modifier Letter Reversed Comma 702 U+02BE CA BE ʾ Modifier Letter Right Half Ring 703 U+02BF CA BF ʿ Modifier Letter Left Half Ring 704 U+02C0 CB 80 ˀ Modifier Letter Glottal Stop 705 U+02C1 CB 81 ˁ Modifier Letter Reversed Glottal Stop 706 U+02C2 CB 82 ˂ Modifier Letter Left Arrowhead 707 U+02C3 CB 83 ˃ Modifier Letter Right Arrowhead 708 U+02C4 CB 84 ˄ Modifier Letter Up Arrowhead 709 U+02C5 CB 85 ˅ Modifier Letter Down Arrowhead 710 U+02C6 CB 86 ˆ Modifier Letter Circumflex Accent 711 U+02C7 CB 87 ˇ Caron 712 U+02C8 CB 88 ˈ Modifier Letter Vertical Line 713 U+02C9 CB 89 ˉ Modifier Letter Macron 714 U+02CA CB 8A ˊ Modifier Letter Acute Accent 715 U+02CB CB 8B ˋ Modifier Letter Grave Accent 716 U+02CC CB 8C ˌ Modifier Letter Low Vertical Line 717 U+02CD CB 8D ˍ Modifier Letter Low Macron 718 U+02CE CB 8E ˎ Modifier Letter Low Grave Accent 719 U+02CF CB 8F ˏ Modifier Letter Low Acute Accent 720 U+02D0 CB 90 ː Modifier Letter Triangular Colon 721 U+02D1 CB 91 ˑ Modifier Letter Half Triangular Colon 722 U+02D2 CB 92 ˒ Modifier Letter Centred Right Half Ring 723 U+02D3 CB 93 ˓ Modifier Letter Centred Left Half Ring 724 U+02D4 CB 94 ˔ Modifier Letter Up Tack 725 U+02D5 CB 95 ˕ Modifier Letter Down Tack 726 U+02D6 CB 96 ˖ Modifier Letter Plus Sign 727 U+02D7 CB 97 ˗ Modifier Letter Minus Sign 728 U+02D8 CB 98 ˘ Breve 729 U+02D9 CB 99 ˙ Dot Above 730 U+02DA CB 9A ˚ Ring Above 731 U+02DB CB 9B ˛ Ogonek 732 U+02DC CB 9C ˜ Small Tilde 733 U+02DD CB 9D ˝ Double Acute Accent 734 U+02DE CB 9E ˞ Modifier Letter Rhotic Hook 735 U+02DF CB 9F ˟ Modifier Letter Cross Accent 736 U+02E0 CB A0 ˠ Modifier Letter Small Gamma 737 U+02E1 CB A1 ˡ Modifier Letter Small L 738 U+02E2 CB A2 ˢ Modifier Letter Small S 739 U+02E3 CB A3 ˣ Modifier Letter Small X 740 U+02E4 CB A4 ˤ Modifier Letter Small Reversed Glottal Stop 741 U+02E5 CB A5 ˥ Modifier Letter Extra-high Tone Bar 742 U+02E6 CB A6 ˦ Modifier Letter High Tone Bar 743 U+02E7 CB A7 ˧ Modifier Letter Mid Tone Bar 744 U+02E8 CB A8 ˨ Modifier Letter Low Tone Bar 745 U+02E9 CB A9 ˩ Modifier Letter Extra-low Tone Bar 746 U+02EA CB AA ˪ Modifier Letter Yin Departing Tone Mark 747 U+02EB CB AB ˫ Modifier Letter Yang Departing Tone Mark 748 U+02EC CB AC ˬ Modifier Letter Voicing 749 U+02ED CB AD ˭ Modifier Letter Unaspirated 750 U+02EE CB AE ˮ Modifier Letter Double Apostrophe 751 U+02EF CB AF ˯ Modifier Letter Low Down Arrowhead 752 U+02F0 CB B0 ˰ Modifier Letter Low Up Arrowhead 753 U+02F1 CB B1 ˱ Modifier Letter Low Left Arrowhead 754 U+02F2 CB B2 ˲ Modifier Letter Low Right Arrowhead 755 U+02F3 CB B3 ˳ Modifier Letter Low Ring 756 U+02F4 CB B4 ˴ Modifier Letter Middle Grave Accent 757 U+02F5 CB B5 ˵ Modifier Letter Middle Double Grave Accent 758 U+02F6 CB B6 ˶ Modifier Letter Middle Double Acute Accent 759 U+02F7 CB B7 ˷ Modifier Letter Low Tilde 760 U+02F8 CB B8 ˸ Modifier Letter Raised Colon 761 U+02F9 CB B9 ˹ Modifier Letter Begin High Tone 762 U+02FA CB BA ˺ Modifier Letter End High Tone 763 U+02FB CB BB ˻ Modifier Letter Begin Low Tone 764 U+02FC CB BC ˼ Modifier Letter End Low Tone 765 U+02FD CB BD ˽ Modifier Letter Shelf 766 U+02FE CB BE ˾ Modifier Letter Open Shelf 767 U+02FF CB BF ˿ Modifier Letter Low Left Arrow 768 U+0300 CC 80 ̀ Combining Grave Accent 769 U+0301 CC 81 ́ Combining Acute Accent 770 U+0302 CC 82 ̂ Combining Circumflex Accent 771 U+0303 CC 83 ̃ Combining Tilde 772 U+0304 CC 84 ̄ Combining Macron 773 U+0305 CC 85 ̅ Combining Overline 774 U+0306 CC 86 ̆ Combining Breve 775 U+0307 CC 87 ̇ Combining Dot Above 776 U+0308 CC 88 ̈ Combining Diaeresis 777 U+0309 CC 89 ̉ Combining Hook Above 778 U+030A CC 8A ̊ Combining Ring Above 779 U+030B CC 8B ̋ Combining Double Acute Accent 780 U+030C CC 8C ̌ Combining Caron 781 U+030D CC 8D ̍ Combining Vertical Line Above 782 U+030E CC 8E ̎ Combining Double Vertical Line Above 783 U+030F CC 8F ̏ Combining Double Grave Accent 784 U+0310 CC 90 ̐ Combining Candrabindu 785 U+0311 CC 91 ̑ Combining Inverted Breve 786 U+0312 CC 92 ̒ Combining Turned Comma Above 787 U+0313 CC 93 ̓ Combining Comma Above 788 U+0314 CC 94 ̔ Combining Reversed Comma Above 789 U+0315 CC 95 ̕ Combining Comma Above Right 790 U+0316 CC 96 ̖ Combining Grave Accent Below 791 U+0317 CC 97 ̗ Combining Acute Accent Below 792 U+0318 CC 98 ̘ Combining Left Tack Below 793 U+0319 CC 99 ̙ Combining Right Tack Below 794 U+031A CC 9A ̚ Combining Left Angle Above 795 U+031B CC 9B ̛ Combining Horn 796 U+031C CC 9C ̜ Combining Left Half Ring Below 797 U+031D CC 9D ̝ Combining Up Tack Below 798 U+031E CC 9E ̞ Combining Down Tack Below 799 U+031F CC 9F ̟ Combining Plus Sign Below 800 U+0320 CC A0 ̠ Combining Minus Sign Below 801 U+0321 CC A1 ̡ Combining Palatalized Hook Below 802 U+0322 CC A2 ̢ Combining Retroflex Hook Below 803 U+0323 CC A3 ̣ Combining Dot Below 804 U+0324 CC A4 ̤ Combining Diaeresis Below 805 U+0325 CC A5 ̥ Combining Ring Below 806 U+0326 CC A6 ̦ Combining Comma Below 807 U+0327 CC A7 ̧ Combining Cedilla 808 U+0328 CC A8 ̨ Combining Ogonek 809 U+0329 CC A9 ̩ Combining Vertical Line Below 810 U+032A CC AA ̪ Combining Bridge Below 811 U+032B CC AB ̫ Combining Inverted Double Arch Below 812 U+032C CC AC ̬ Combining Caron Below 813 U+032D CC AD ̭ Combining Circumflex Accent Below 814 U+032E CC AE ̮ Combining Breve Below 815 U+032F CC AF ̯ Combining Inverted Breve Below 816 U+0330 CC B0 ̰ Combining Tilde Below 817 U+0331 CC B1 ̱ Combining Macron Below 818 U+0332 CC B2 ̲ Combining Low Line 819 U+0333 CC B3 ̳ Combining Double Low Line 820 U+0334 CC B4 ̴ Combining Tilde Overlay 821 U+0335 CC B5 ̵ Combining Short Stroke Overlay 822 U+0336 CC B6 ̶ Combining Long Stroke Overlay 823 U+0337 CC B7 ̷ Combining Short Solidus Overlay 824 U+0338 CC B8 ̸ Combining Long Solidus Overlay 825 U+0339 CC B9 ̹ Combining Right Half Ring Below 826 U+033A CC BA ̺ Combining Inverted Bridge Below 827 U+033B CC BB ̻ Combining Square Below 828 U+033C CC BC ̼ Combining Seagull Below 829 U+033D CC BD ̽ Combining X Above 830 U+033E CC BE ̾ Combining Vertical Tilde 831 U+033F CC BF ̿ Combining Double Overline 832 U+0340 CD 80 ̀ Combining Grave Tone Mark 833 U+0341 CD 81 ́ Combining Acute Tone Mark 834 U+0342 CD 82 ͂ Combining Greek Perispomeni 835 U+0343 CD 83 ̓ Combining Greek Koronis 836 U+0344 CD 84 ̈́ Combining Greek Dialytika Tonos 837 U+0345 CD 85 ͅ Combining Greek Ypogegrammeni 838 U+0346 CD 86 ͆ Combining Bridge Above 839 U+0347 CD 87 ͇ Combining Equals Sign Below 840 U+0348 CD 88 ͈ Combining Double Vertical Line Below 841 U+0349 CD 89 ͉ Combining Left Angle Below 842 U+034A CD 8A ͊ Combining Not Tilde Above 843 U+034B CD 8B ͋ Combining Homothetic Above 844 U+034C CD 8C ͌ Combining Almost Equal To Above 845 U+034D CD 8D ͍ Combining Left Right Arrow Below 846 U+034E CD 8E ͎ Combining Upwards Arrow Below 847 U+034F CD 8F ͏ Combining Grapheme Joiner 848 U+0350 CD 90 ͐ Combining Right Arrowhead Above 849 U+0351 CD 91 ͑ Combining Left Half Ring Above 850 U+0352 CD 92 ͒ Combining Fermata 851 U+0353 CD 93 ͓ Combining X Below 852 U+0354 CD 94 ͔ Combining Left Arrowhead Below 853 U+0355 CD 95 ͕ Combining Right Arrowhead Below 854 U+0356 CD 96 ͖ Combining Right Arrowhead And Up Arrowhead Below 855 U+0357 CD 97 ͗ Combining Right Half Ring Above 856 U+0358 CD 98 ͘ Combining Dot Above Right 857 U+0359 CD 99 ͙ Combining Asterisk Below 858 U+035A CD 9A ͚ Combining Double Ring Below 859 U+035B CD 9B ͛ Combining Zigzag Above 860 U+035C CD 9C ͜ Combining Double Breve Below 861 U+035D CD 9D ͝ Combining Double Breve 862 U+035E CD 9E ͞ Combining Double Macron 863 U+035F CD 9F ͟ Combining Double Macron Below 864 U+0360 CD A0 ͠ Combining Double Tilde 865 U+0361 CD A1 ͡ Combining Double Inverted Breve 866 U+0362 CD A2 ͢ Combining Double Rightwards Arrow Below 867 U+0363 CD A3 ͣ Combining Latin Small Letter A 868 U+0364 CD A4 ͤ Combining Latin Small Letter E 869 U+0365 CD A5 ͥ Combining Latin Small Letter I 870 U+0366 CD A6 ͦ Combining Latin Small Letter O 871 U+0367 CD A7 ͧ Combining Latin Small Letter U 872 U+0368 CD A8 ͨ Combining Latin Small Letter C 873 U+0369 CD A9 ͩ Combining Latin Small Letter D 874 U+036A CD AA ͪ Combining Latin Small Letter H 875 U+036B CD AB ͫ Combining Latin Small Letter M 876 U+036C CD AC ͬ Combining Latin Small Letter R 877 U+036D CD AD ͭ Combining Latin Small Letter T 878 U+036E CD AE ͮ Combining Latin Small Letter V 879 U+036F CD AF ͯ Combining Latin Small Letter X 880 U+0370 CD B0 Ͱ Greek Capital Letter Heta 881 U+0371 CD B1 ͱ Greek Small Letter Heta 882 U+0372 CD B2 Ͳ Greek Capital Letter Archaic Sampi 883 U+0373 CD B3 ͳ Greek Small Letter Archaic Sampi 884 U+0374 CD B4 ʹ Greek Numeral Sign 885 U+0375 CD B5 ͵ Greek Lower Numeral Sign 886 U+0376 CD B6 Ͷ Greek Capital Letter Pamphylian Digamma 887 U+0377 CD B7 ͷ Greek Small Letter Pamphylian Digamma 888 U+0378 CD B8 ͸ 889 U+0379 CD B9 ͹ 890 U+037A CD BA ͺ Greek Ypogegrammeni 891 U+037B CD BB ͻ Greek Small Reversed Lunate Sigma Symbol 892 U+037C CD BC ͼ Greek Small Dotted Lunate Sigma Symbol 893 U+037D CD BD ͽ Greek Small Reversed Dotted Lunate Sigma Symbol 894 U+037E CD BE ; Greek Question Mark 895 U+037F CD BF Ϳ Greek Capital Letter Yot 896 U+0380 CE 80 ΀ 897 U+0381 CE 81 ΁ 898 U+0382 CE 82 ΂ 899 U+0383 CE 83 ΃ 900 U+0384 CE 84 ΄ Greek Tonos 901 U+0385 CE 85 ΅ Greek Dialytika Tonos 902 U+0386 CE 86 Ά Greek Capital Letter Alpha With Tonos 903 U+0387 CE 87 · Greek Ano Teleia 904 U+0388 CE 88 Έ Greek Capital Letter Epsilon With Tonos 905 U+0389 CE 89 Ή Greek Capital Letter Eta With Tonos 906 U+038A CE 8A Ί Greek Capital Letter Iota With Tonos 907 U+038B CE 8B ΋ 908 U+038C CE 8C Ό Greek Capital Letter Omicron With Tonos 909 U+038D CE 8D ΍ 910 U+038E CE 8E Ύ Greek Capital Letter Upsilon With Tonos 911 U+038F CE 8F Ώ Greek Capital Letter Omega With Tonos 912 U+0390 CE 90 ΐ Greek Small Letter Iota With Dialytika And Tonos 913 U+0391 CE 91 Α Greek Capital Letter Alpha 914 U+0392 CE 92 Β Greek Capital Letter Beta 915 U+0393 CE 93 Γ Greek Capital Letter Gamma 916 U+0394 CE 94 Δ Greek Capital Letter Delta 917 U+0395 CE 95 Ε Greek Capital Letter Epsilon 918 U+0396 CE 96 Ζ Greek Capital Letter Zeta 919 U+0397 CE 97 Η Greek Capital Letter Eta 920 U+0398 CE 98 Θ Greek Capital Letter Theta 921 U+0399 CE 99 Ι Greek Capital Letter Iota 922 U+039A CE 9A Κ Greek Capital Letter Kappa 923 U+039B CE 9B Λ Greek Capital Letter Lamda 924 U+039C CE 9C Μ Greek Capital Letter Mu 925 U+039D CE 9D Ν Greek Capital Letter Nu 926 U+039E CE 9E Ξ Greek Capital Letter Xi 927 U+039F CE 9F Ο Greek Capital Letter Omicron 928 U+03A0 CE A0 Π Greek Capital Letter Pi 929 U+03A1 CE A1 Ρ Greek Capital Letter Rho 930 U+03A2 CE A2 ΢ 931 U+03A3 CE A3 Σ Greek Capital Letter Sigma 932 U+03A4 CE A4 Τ Greek Capital Letter Tau 933 U+03A5 CE A5 Υ Greek Capital Letter Upsilon 934 U+03A6 CE A6 Φ Greek Capital Letter Phi 935 U+03A7 CE A7 Χ Greek Capital Letter Chi 936 U+03A8 CE A8 Ψ Greek Capital Letter Psi 937 U+03A9 CE A9 Ω Greek Capital Letter Omega 938 U+03AA CE AA Ϊ Greek Capital Letter Iota With Dialytika 939 U+03AB CE AB Ϋ Greek Capital Letter Upsilon With Dialytika 940 U+03AC CE AC ά Greek Small Letter Alpha With Tonos 941 U+03AD CE AD έ Greek Small Letter Epsilon With Tonos 942 U+03AE CE AE ή Greek Small Letter Eta With Tonos 943 U+03AF CE AF ί Greek Small Letter Iota With Tonos 944 U+03B0 CE B0 ΰ Greek Small Letter Upsilon With Dialytika And Tonos 945 U+03B1 CE B1 α Greek Small Letter Alpha 946 U+03B2 CE B2 β Greek Small Letter Beta 947 U+03B3 CE B3 γ Greek Small Letter Gamma 948 U+03B4 CE B4 δ Greek Small Letter Delta 949 U+03B5 CE B5 ε Greek Small Letter Epsilon 950 U+03B6 CE B6 ζ Greek Small Letter Zeta 951 U+03B7 CE B7 η Greek Small Letter Eta 952 U+03B8 CE B8 θ Greek Small Letter Theta 953 U+03B9 CE B9 ι Greek Small Letter Iota 954 U+03BA CE BA κ Greek Small Letter Kappa 955 U+03BB CE BB λ Greek Small Letter Lamda 956 U+03BC CE BC μ Greek Small Letter Mu 957 U+03BD CE BD ν Greek Small Letter Nu 958 U+03BE CE BE ξ Greek Small Letter Xi 959 U+03BF CE BF ο Greek Small Letter Omicron 960 U+03C0 CF 80 π Greek Small Letter Pi 961 U+03C1 CF 81 ρ Greek Small Letter Rho 962 U+03C2 CF 82 ς Greek Small Letter Final Sigma 963 U+03C3 CF 83 σ Greek Small Letter Sigma 964 U+03C4 CF 84 τ Greek Small Letter Tau 965 U+03C5 CF 85 υ Greek Small Letter Upsilon 966 U+03C6 CF 86 φ Greek Small Letter Phi 967 U+03C7 CF 87 χ Greek Small Letter Chi 968 U+03C8 CF 88 ψ Greek Small Letter Psi 969 U+03C9 CF 89 ω Greek Small Letter Omega 970 U+03CA CF 8A ϊ Greek Small Letter Iota With Dialytika 971 U+03CB CF 8B ϋ Greek Small Letter Upsilon With Dialytika 972 U+03CC CF 8C ό Greek Small Letter Omicron With Tonos 973 U+03CD CF 8D ύ Greek Small Letter Upsilon With Tonos 974 U+03CE CF 8E ώ Greek Small Letter Omega With Tonos 975 U+03CF CF 8F Ϗ Greek Capital Kai Symbol 976 U+03D0 CF 90 ϐ Greek Beta Symbol 977 U+03D1 CF 91 ϑ Greek Theta Symbol 978 U+03D2 CF 92 ϒ Greek Upsilon With Hook Symbol 979 U+03D3 CF 93 ϓ Greek Upsilon With Acute And Hook Symbol 980 U+03D4 CF 94 ϔ Greek Upsilon With Diaeresis And Hook Symbol 981 U+03D5 CF 95 ϕ Greek Phi Symbol 982 U+03D6 CF 96 ϖ Greek Pi Symbol 983 U+03D7 CF 97 ϗ Greek Kai Symbol 984 U+03D8 CF 98 Ϙ Greek Letter Archaic Koppa 985 U+03D9 CF 99 ϙ Greek Small Letter Archaic Koppa 986 U+03DA CF 9A Ϛ Greek Letter Stigma 987 U+03DB CF 9B ϛ Greek Small Letter Stigma 988 U+03DC CF 9C Ϝ Greek Letter Digamma 989 U+03DD CF 9D ϝ Greek Small Letter Digamma 990 U+03DE CF 9E Ϟ Greek Letter Koppa 991 U+03DF CF 9F ϟ Greek Small Letter Koppa 992 U+03E0 CF A0 Ϡ Greek Letter Sampi 993 U+03E1 CF A1 ϡ Greek Small Letter Sampi 994 U+03E2 CF A2 Ϣ Coptic Capital Letter Shei 995 U+03E3 CF A3 ϣ Coptic Small Letter Shei 996 U+03E4 CF A4 Ϥ Coptic Capital Letter Fei 997 U+03E5 CF A5 ϥ Coptic Small Letter Fei 998 U+03E6 CF A6 Ϧ Coptic Capital Letter Khei 999 U+03E7 CF A7 ϧ Coptic Small Letter Khei

Источник

ANSI code page[edit]

OEM code page[edit]

History[edit]

UTF-8, UTF-16[edit]

List[edit]

Windows-125x series[edit]

DOS code pages[edit]

East Asian multi-byte code pages[edit]

EBCDIC code pages[edit]

[edit]

Macintosh compatibility code pages[edit]

ISO 8859 code pages[edit]

ITU-T code pages[edit]

KOI8 code pages[edit]

Problems arising from the use of code pages[edit]

See also[edit]

References[edit]

External links[edit]

Naming[edit]

Encoding[edit]

Encoding process[edit]

Example[edit]

Codepage layout[edit]

Overlong encodings[edit]

Invalid sequences and error handling[edit]

Byte order mark[edit]

Adoption[edit]

History[edit]

FSS-UTF[edit]

Standards[edit]

Comparison with other encodings[edit]

Single-byte[edit]

Other multi-byte[edit]

UTF-16[edit]

Derivatives[edit]

CESU-8[edit]

MySQL utf8mb3[edit]

Modified UTF-8[edit]

WTF-8[edit]

PEP 383[edit]

See also[edit]

Notes[edit]

References[edit]

External links[edit]

Таблица CP1251 (windows-1251)

Таблица IS0-8859-5

Кодировка UTF-8 (Unicode Transformation Format)

О кодировках utf-8 и windows 1251

Чем отличаются utf-8 и windows 1251

Что такое кодировка windows 1251

Что такое кодировка UTF-8

Пример вывода текста в кодировках utf-8 латиницы

Чем отличается текст в кодировках utf-8 и windows 1251

Пример вывода текста в кодировках utf-8 кириллицы

Пример отличия в кодировках utf-8 и windows 1251

Что делать, если функция для кириллицы на utf-8 не работают?

Как перекодировать строку из utf-8 в windows 1251

Рассмотрим пример перекодировки текста из UTF-8 в windows-1251 и обратно

Что лучше для кириллицы utf-8 или…

Какие минусы у utf-8?

Задумывался ли я о переходе с кодировки utf-8 на другую?

Другие наши интересноые статьи:

Кодировка UTF-8
(Unicode Transformation Format)