|
Original |
Translation |
|
19
|
|
|
|
20
|
The largest new feature in Python 2.0 is a new fundamental data type: Unicode strings. Unicode uses 16-bit numbers to represent characters instead of the 8-bit number used by ASCII, meaning that 65,536 distinct characters can be supported.
|
|
|
21
|
The final interface for Unicode support was arrived at through countless often- stormy discussions on the python-dev mailing list, and mostly implemented by Marc-André Lemburg, based on a Unicode string type implementation by Fredrik Lundh. A detailed explanation of the interface was written up as :pep:`100`, "Python Unicode Integration". This article will simply cover the most significant points about the Unicode interfaces.
|
|
|
22
|
In Python source code, Unicode strings are written as ``u"string"``. Arbitrary Unicode characters can be written using a new escape sequence, ``\uHHHH``, where *HHHH* is a 4-digit hexadecimal number from 0000 to FFFF. The existing ``\xHHHH`` escape sequence can also be used, and octal escapes can be used for characters up to U+01FF, which is represented by ``\777``.
|
|
|
23
|
|
24
|
Combining 8-bit and Unicode strings always coerces to Unicode, using the default ASCII encoding; the result of ``'a' + u'bc'`` is ``u'abc'``.
|
|
|
25
|
New built-in functions have been added, and existing built-ins modified to support Unicode:
|
|
|
26
|
``unichr(ch)`` returns a Unicode string 1 character long, containing the character *ch*.
|
|
|
27
|
``ord(u)``, where *u* is a 1-character regular or Unicode string, returns the number of the character as an integer.
|
|
|
28
|
``unicode(string [, encoding] [, errors] )`` creates a Unicode string from an 8-bit string. ``encoding`` is a string naming the encoding to use. The ``errors`` parameter specifies the treatment of characters that are invalid for the current encoding; passing ``'strict'`` as the value causes an exception to be raised on any encoding error, while ``'ignore'`` causes errors to be silently ignored and ``'replace'`` uses U+FFFD, the official replacement character, in case of any problems.
|
|