Text reported | Email | Print
Because Unicode is designed to handle multibyte characters, you can also use the
special u and U escapes to encode binary character values that are larger than 8 bits:
>>> u'abx20cd' # 8-bit/1-byte characters
u'ab cd'
>>> u'abu0020cd' # 2-byte characters
u'ab cd'
>>> u'abU00000020cd' # 4-byte characters
u'ab cd'
The first of these statements embeds the binary code for a space character; its binary
value in hexadecimal notation is x20. The second and third statements do the same,
but give the value in 2-byte and 4-byte Unicode escape notation.
Even if you don’t think you will need Unicode strings, you might use them without
knowing it. Because some programming interfaces (e.g., the COM API on Windows,
and some XML parsers) represent text as Unicode, it may find its way into your
scripts as API inputs or results, and you may sometimes need to convert back and
forth between normal and Unicode types.
Because Python treats the two string types interchangeably in most contexts, though,
the presence of Unicode strings is often transparent to your code—you can largely
ignore the fact that text is being passed around as Unicode objects and use normal
string operations.
Unicode is a useful addition to Python, and because support is built-in, it’s easy to
handle such data in your scripts when needed. Of course, there’s a lot more to say
about the Unicode story. For example:
• Unicode objects provide an encode method that converts a Unicode string into a
normal 8-bit string using a specific encoding.
• The built-in function unicode and module codecs support registered Unicode
“codecs” (for “COders and DECoders”).
• The open function within the codecs module allows for processing Unicode text
files, where each character is stored as more than one byte.
• The unicodedata module provides access to the Unicode character database.
• The sys module includes calls for fetching and setting the default Unicode
encoding scheme (the default is usually ASCII).
• You may combine the raw and Unicode string formats (e.g., ur'ac').
Because Unicode is a relatively advanced and not so commonly used tool, we won’t
discuss it further in this introductory text. See the Python standard manual for the
rest of the Unicode story.
Views - Today : 328 Total : 328