|
Original |
Translation |
|
22
|
In Python source code, Unicode strings are written as ``u"string"``. Arbitrary Unicode characters can be written using a new escape sequence, ``\uHHHH``, where *HHHH* is a 4-digit hexadecimal number from 0000 to FFFF. The existing ``\xHHHH`` escape sequence can also be used, and octal escapes can be used for characters up to U+01FF, which is represented by ``\777``.
|
|
|
23
|
Unicode strings, just like regular strings, are an immutable sequence type. They can be indexed and sliced, but not modified in place. Unicode strings have an ``encode( [encoding] )`` method that returns an 8-bit string in the desired encoding. Encodings are named by strings, such as ``'ascii'``, ``'utf-8'``, ``'iso-8859-1'``, or whatever. A codec API is defined for implementing and registering new encodings that are then available throughout a Python program. If an encoding isn't specified, the default encoding is usually 7-bit ASCII, though it can be changed for your Python installation by calling the :func:`sys.setdefaultencoding(encoding)` function in a customised version of :file:`site.py`.
|
|
|
24
|
Combining 8-bit and Unicode strings always coerces to Unicode, using the default ASCII encoding; the result of ``'a' + u'bc'`` is ``u'abc'``.
|
|
|
25
|
New built-in functions have been added, and existing built-ins modified to support Unicode:
|
|
|
26
|
``unichr(ch)`` returns a Unicode string 1 character long, containing the character *ch*.
|
|
|
27
|
|
28
|
``unicode(string [, encoding] [, errors] )`` creates a Unicode string from an 8-bit string. ``encoding`` is a string naming the encoding to use. The ``errors`` parameter specifies the treatment of characters that are invalid for the current encoding; passing ``'strict'`` as the value causes an exception to be raised on any encoding error, while ``'ignore'`` causes errors to be silently ignored and ``'replace'`` uses U+FFFD, the official replacement character, in case of any problems.
|
|
|
29
|
The :keyword:`exec` statement, and various built-ins such as ``eval()``, ``getattr()``, and ``setattr()`` will also accept Unicode strings as well as regular strings. (It's possible that the process of fixing this missed some built-ins; if you find a built-in function that accepts strings but doesn't accept Unicode strings at all, please report it as a bug.)
|
|
|
30
|
A new module, :mod:`unicodedata`, provides an interface to Unicode character properties. For example, ``unicodedata.category(u'A')`` returns the 2-character string 'Lu', the 'L' denoting it's a letter, and 'u' meaning that it's uppercase. ``unicodedata.bidirectional(u'\u0660')`` returns 'AN', meaning that U+0660 is an Arabic number.
|
|
|
31
|
The :mod:`codecs` module contains functions to look up existing encodings and register new ones. Unless you want to implement a new encoding, you'll most often use the :func:`codecs.lookup(encoding)` function, which returns a 4-element tuple: ``(encode_func, decode_func, stream_reader, stream_writer)``.
|
|