Pragmatic Unicode
~ or ~
How Do I Stop the Pain?

Ned Batchelder   @nedbat

bit.ly/unipain

Hi, I'm Ned Batchelder. I've been writing in Python for over ten years, which means at least a half-dozen times, I've made the same Unicode mistakes that everyone else has.

The past

Wrote a nice program

It worked!

Accented chars

UnicodeError!

😁
👽
💥
😞

If you're like most Python programmers, you've done it too: you've built a nice application, and everything seemed to be going fine. Then one day an accented character appeared out of nowhere, and your program started belching UnicodeErrors.

You kind of knew what to do with those, so you added an encode or a decode where the error was raised, but the UnicodeError happened somewhere else. You went to the new place, and added a decode, maybe an encode. After playing whack-a-mole like this for a while, the problem seemed to be fixed.

Then a few days later, another accent appeared in another place, and you had to play a little bit more whack-a-mole until the problem finally stopped.

You

Annoyed

Angry

Uninterested

😠
😕

So now you have a program that works, but you're annoyed and uncomfortable, it took too long, you know it isn't "right," and you hate yourself. And the main thing you know about Unicode is that you don't like Unicode.

You don't want to know about weirdo character sets, you just want to be able to write a program that doesn't make you feel bad.

This talk

5 Facts of Life

3 Pro Tips

You don't have to play whack-a-mole. Unicode isn't simple, but it isn't difficult either. With knowledge and discipline, you can deal with Unicode easily and with grace.

I'll teach you five Facts of Life, and give you three pro tips that will solve your Unicode problems. We're going to cover the basics of Unicode, and how both Python 2 and Python 3 work. They are different, but the strategies you'll use are basically the same.

The World & Unicode

🌎   🌏

We'll start with the basics of Unicode.

Bytes

Fact of Life #1

Computers are built on bytes

Files + Networks

Everything

The first Fact of Life: everything in a computer is bytes. Files on disk are a series of bytes, and network connections transmit only bytes. Almost without exception, all the data going into or out of any program you write, is bytes.

Bytes by themselves are meaningless, we need conventions to give them meaning.

ASCII

To represent text, we've been using the ASCII code for nearly 50 years. Every byte is assigned one of 95 symbols. When I send you a byte 65, you know that I mean a upper-case A.

ISO 8859-1

ISO Latin 1, or 8859-1, extended ASCII with 96 more symbols. This is pretty much the best you can do to represent text as single bytes, because there's not much room left to add more symbols.

Windows-1252

Windows added 27 more symbols to produce CP1252.

Tower of Babel

Fact of Life #2

The world needs more than 256 symbols

Hello, world!   •   Здравствуй, мир!

Բարեւ, աշխարհի!   •   !مرحبا ، العالم

!שלום, עולם   •   여보세요 세계!

नमस्ते, दुनिया!   •   你好,世界!

But Fact of Life #2 is that there are way more symbols in the world's text than 256. A single byte simply can't represent text world-wide. During your darkest whack-a-mole moments, you may have wished that everyone spoke English, but it simply isn't so. People need lots of symbols to communicate.

Fact of Life #1 and Fact of Life #2 together create a fundamental conflict between the structure of our computing devices, and the needs of the world's people.

Character codes

Map single bytes to characters

Pretend FoL#2 doesn't exist

8859-1 through -16

cp850, cp1252, etc

EBCDIC, APL, BBQ, OMG, WTF

Chaos!

There have been a number of attempts to resolve this conflict. Single-byte character codes like ASCII map bytes to symbols, or characters. Each one pretends that Fact of Life #2 doesn't exist.

There are many single-byte codes, and they don't solve the problem. Each is only good for representing one small slice of human language. They can't solve the global text problem.

Character codes

Map two bytes to characters

Shift-JIS, GB2312, Big5, etc.

Still no agreement

People tried creating double-byte character sets, but they were still fragmented, serving different subsets of people. There were multiple standards in place, and ironically, they weren't large enough to deal with all the symbols needed.

Unicode

Assigns characters to code points (integers)

1.1M code points

110K assigned

A-Z …

☃ …

Жณ賃 …

💩 …

Unicode was designed to deal decisively with the issues with older character codes. Unicode assigns integers, known as code points, to characters. It has room for 1.1 million code points, of which 110,000 are already assigned, so there's plenty of room for future growth.

Unicode's goal is to have everything. It starts with ASCII, and includes thousands of symbols, including the famous Snowman, covers all the writing systems of the world, and is constantly being expanded. For example, the latest update gave us the symbol PILE OF POO.

Sample Unicode

ℙƴ☂ℌøἤ

U+2119:   DOUBLE-STRUCK CAPITAL P

U+01B4:   LATIN SMALL LETTER Y WITH HOOK

U+2602:   UMBRELLA

U+210C:   BLACK-LETTER CAPITAL H

U+00F8:   LATIN SMALL LETTER O WITH STROKE

U+1F24:   GREEK SMALL LETTER ETA WITH PSILI AND OXIA

Here is a string of six exotic Unicode characters. Unicode code points are written as 4-, 5-, or 6-digits of hex with a U+ prefix. Every character has an unambiguous full name which is always in uppercase ASCII.

Encodings

Have to map Unicode code points to bytes somehow

UTF-16, UTF-32, UCS-2, UCS-4, UTF-8

So Unicode makes room for all of the characters we could ever need, but we still have Fact of Life #1 to deal with: computers need bytes. We need a way to represent Unicode code points as bytes in order to store or transmit them.

The Unicode standard defines a number of ways to represent code points as bytes. These are called encodings.

UTF-8

The king of encodings

Variable length

ASCII characters are still one byte

4869e28499c6b4e29882e2848cc3b8e1bca4
Hiƴø

UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point, ASCII characters in particular are one byte each, using the same values as ASCII, so ASCII is a subset of UTF-8.

Here we show our exotic string as UTF-8. The ASCII characters H and I are single bytes, other characters use two or three bytes depending on their code point value. Some code points require four bytes, though we aren't using any of those here.

Python 2

🐍   🐍

OK, enough theory, let's talk about Python 2.

Str vs Unicode

str: a sequence of bytes

unicode: a sequence of code points (unicode)

2
    >>> my_string = "Hello World"
    >>> type(my_string)
    <type 'str'>
     
    >>> my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
    >>> type(my_unicode)
    <type 'unicode'>
    

In Python 2, there are two different string data types. A plain-old string literal gives you a "str" object, which stores bytes. If you use a "u" prefix, you get a "unicode" object, which stores code points. In a unicode string literal, you can use backslash-u to insert any Unicode code point.

Notice that the word "string" is problematic. Both "str" and "unicode" are kinds of strings, and it's tempting to call either or both of them "string," but better to use more specific terms to keep things straight.

.encode() and .decode()

unicode .encode() → bytes

bytes .decode() → unicode

2
    >>> my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
    >>> len(my_unicode)
    9
     
    >>> my_utf8 = my_unicode.encode('utf-8')
    >>> len(my_utf8)
    19
    >>> my_utf8
    'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
     
    >>> my_utf8.decode('utf-8')
    u'Hi \u2119\u01b4\u2602\u210c\xf8\u1f24'
    

To convert between bytes and unicode, each has a method. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. Each takes an argument, which is the name of the encoding to use for the operation.

We can define a Unicode string names my_unicode, and see that it has 9 characters. We can encode it to UTF-8 to create the my_utf8 byte string, which has 19 bytes. As you'd expect, decoding the UTF-8 string produces the original Unicode string.

Encoding errors

Many encodings only do a subset of Unicode

2
    >>> my_unicode.encode('ascii')
    Traceback (most recent call last):
    UnicodeEncodeError: 'ascii' codec can't encode characters in
              position 3-8: ordinal not in range(128)
    

Unfortunately, encoding and decoding can produce errors if the data isn't appropriate for the specified encoding. Here we try to encode our exotic Unicode string to ASCII. It fails because ASCII can only represent charaters in the range 0 to 127, and our Unicode string has code points well outside that range.

The UnicodeEncodeError that's raised indicates the encoding being used, in the form of the "codec", for coder/decoder, and the actual position of the character that caused the problem.

Decoding errors

Not all byte sequences are valid

2
    >>> my_utf8.decode("ascii")
    Traceback (most recent call last):
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
              position 3: ordinal not in range(128)
     
    >>> "\x78\x9a\xbc\xde\xf0".decode("utf-8")
    Traceback (most recent call last):
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in
              position 1: invalid start byte
    

Decoding can also produce errors. Here we try to decode our UTF-8 string as ASCII and get a UnicodeDecodeError because again, ASCII can only accepts values up to 127, and our UTF-8 string has bytes outside that range.

Even UTF-8 can't decode any sequence of bytes. Here we try to decode some random junk, and it also produces a UnicodeDecodeError. Actually, one of UTF-8's advantages is that there are invalid sequences of bytes, which helps to build robust systems: mistakes in data won't be accepted as if they were valid.

Error handling

2
    >>> my_unicode.encode("ascii", "replace")
    'Hi ??????'
     
    >>> my_unicode.encode("ascii", "xmlcharrefreplace")
    'Hi &#8473;&#436;&#9730;&#8460;&#248;&#7972;'
     
    >>> my_unicode.encode("ascii", "ignore")
    'Hi '
    

When encoding or decoding, you can specify what should happen when the codec can't handle the data. An optional second argument to encode or decode specifies the policy. The default value is "strict", which means raise an error, as we've seen.

A value of "replace" means, give me a standard replacement character. When encoding, the replacement character is a question mark, so any code point that can't be encoded using the specified encoding will simply produce a "?".

Other error handlers are more useful. "xmlcharrefreplace" produces an HTML/XML character entity reference, so that \u01B4 becomes "&#436;" (hex 01B4 is decimal 436.) This is very useful if you need to output unicode for an HTML file.

Notice that different error policies are used for different reasons. "Replace" is a defensive mechanism against data that cannot be interpreted, and loses information. "Xmlcharrefreplace" preserves all the original information, and is used when outputting data where XML escapes are acceptable.

Error handling

2
    >>> my_utf8.decode("ascii", "ignore")
    u'Hi '
     
    >>> my_utf8.decode("ascii", "replace")
    u'Hi \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'
    

Hi ����������������

You can also specify error handling when decoding. "Ignore" will drop bytes that can't decode properly. "Replace" will insert a Unicode U+FFFD, "REPLACEMENT CHARACTER" for problem bytes. Notice that since the decoder can't decode the data, it doesn't know how many Unicode characters were intended. Decoding our UTF-8 bytes as ASCII produces 16 replacement characters, one for each byte that couldn't be decoded, while those bytes were meant to only produce 6 Unicode characters.

Implicit conversion

Mixing bytes and unicode implicitly decodes

2
    >>> u"Hello " + "world"
    u'Hello world'
     
    >>> u"Hello " + ("world".decode("ascii"))
    u'Hello world'
     
    >>> sys.getdefaultencoding()
    'ascii'
    

Python 2 tries to be helpful when working with unicode and byte strings. If you try to perform a string operation that combines a unicode string with a byte string, Python 2 will automatically decode the byte string to produce a second unicode string, then will complete the operation with the two unicode strings.

For example, we try to concatenate a unicode "Hello " with a byte string "world". The result is a unicode "Hello world". On our behalf, Python 2 is decoding the byte string "world" using the ASCII codec. The encoding used for these implicit decodings is the value of sys.getdefaultencoding().

The implicit encoding is ASCII because it's the only safe guess: ASCII is so widely accepted, and is a subset of so many encodings, that it's unlikely to be wrong.

Implicit decoding errors

2
    >>> u"Hello " + my_utf8
    Traceback (most recent call last):
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
              position 3: ordinal not in range(128)
     
    >>> u"Hello " + (my_utf8.decode("ascii"))
    Traceback (most recent call last):
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
              position 3: ordinal not in range(128)
    

Of course, these implicit decodings are not immune to decoding errors. If you try to combine a byte string with a unicode string and the byte string can't be decoded as ASCII, then the operation will raise a UnicodeDecodeError.

This is the source of those painful UnicodeErrors. Your code mixes unicode strings and byte strings, and as long as the data is all ASCII, the implicit conversions silently succeed. Once a non-ASCII character finds its way into your program, an implicit decode will fail, causing a UnicodeDecodeError.

Python 2 is “helpful”

Converting implicitly: helpful?

Works great when everything is ASCII

When that fails: PAIN

Python 2's philosophy was that unicode strings and byte strings are confusing, and it tried to ease your burden by automatically converting between them, just as it does for ints and floats. But the conversion from int to float can't fail, while byte string to unicode string can.

Python 2 silently glosses over byte to unicode conversions, making it much easier to write code that deals with ASCII. The price you pay is that it will fail with non-ASCII data.

Other implicit conversions

2
    >>> "Title: %s" % my_unicode
    u'Title: Hi \u2119\u01b4\u2602\u210c\xf8\u1f24'
     
    >>> u"Title: %s" % my_string
    u'Title: Hello World'
     
    >>> print my_unicode
    Traceback (most recent call last):
    UnicodeEncodeError: 'ascii' codec can't encode characters in
              position 3-8: ordinal not in range(128)
     
    >>> my_utf8.encode('utf-8')
    Traceback (most recent call last):
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in
              position 3: ordinal not in range(128)
     
    >>> my_string.encode('utf-8')
    'Hello World'
    

There are lots of ways to combine two strings, and all of them will decode bytes to unicode, so you have to watch out for them.

First we use an ASCII format string, with unicode data. The format string will be decoded to unicode, then the formatting performed, resulting in a unicode string.

Next we switch the two: A unicode format string and a byte string again combine to produce a unicode string, because the byte string data is decoded as ASCII.

Simply attempting to print a unicode string will cause an implicit encoding: output is always bytes, so the unicode strings has to be encoded into bytes before it can be printed.

The next one is truly confusing: we ask to encode a byte string to UTF-8, and get an error about not being about to decode as ASCII! The problem here is that byte strings can't be encoded: remember encode is how you turn unicode into bytes. So to perform the encoding you want, Python 2 needs a unicode string, which it tries to get by implicitly decoding your bytes as ASCII.

Lastly, we encode an ASCII string to UTF-8. Here we're performing the same implicit decode to get a unicode string we can encode, but since the string is ASCII, it succeeds, and then goes on to encode it as UTF-8, producing the original byte string, since ASCII is a subset of UTF-8.

Bytes and Unicode

Fact of Life #3

Need to keep them straight

Need to deal with both

🙈   🙉   🙊

This is the most important Fact of Life: bytes and unicode are both important, and you need to deal with both of them. You can't pretend that everything is bytes, or everything is unicode. You need to use each for their purpose, and explicitly convert between them as needed.

Python 3

🐍   🐍   🐍

We've seen the source of Unicode pain in Python 2, now let's take a look at Python 3. The biggest change from Python 2 to Python 3 is their treatment of Unicode.

Str vs bytes

str: a sequence of code points (unicode)

bytes: a sequence of bytes

3
    >>> my_string = "Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
    >>> type(my_string)
    <class 'str'>
     
    >>> my_bytes = b"Hello World"
    >>> type(my_bytes)
    <class 'bytes'>
    

Just as in Python 2, Python 3 has two string types, one for unicode and one for bytes, but they are named differently.

Now the "str" type that you get from a plain string literal stores unicode, and the "bytes" types stores bytes. You can create a bytes literal with a b prefix.

So "str" in Python 2 is now called "bytes," and "unicode in Python 2 is now called "str". This makes more sense than the Python 2 names, since Unicode is how you want all text stored, and byte strings are only for when you are dealing with bytes.

No coercion!

Python 3 won’t implicitly change bytes ↔ unicode

3
    >>> "Hello " + b"world"
    Traceback (most recent call last):
    TypeError: Can't convert 'bytes' object to str implicitly
     
    >>> "Hello" == b"Hello"
    False
     
    >>> d = {"Hello": "world"}
    >>> d[b"Hello"]
    Traceback (most recent call last):
    KeyError: b'Hello'
    

The biggest change in the Unicode support in Python 3 is that there is no automatic decoding of byte strings. If you try to combine a byte string with a unicode string, you will get an error all the time, regardless of the data involved!

All of those operations I showed where Python 2 silently converted byte strings to unicode strings to complete an operation, every one of them is an error in Python 3.

In addition, Python 2 considers a Unicode string and a bytes string equal if they contain the same ASCII bytes, and Python 3 won't. A consequence of this is that Unicode dictionary keys can't be found with byte strings, and vice-versa, as they can be in Python 2.

Python 3 pain

Mixing bytes and unicode is always PAIN

You are forced to keep them straight

This drastically changes the nature of Unicode pain in Python 3. In Python 2, mixing Unicode and bytes succeeds so long as you only use ASCII data. In Python 3, it fails immediately regardless of the data.

So Python 2's pain is deferred: you think your program is correct, and find out later that it fails with exotic characters.

With Python 3, your code fails right off the bat, so even if you are only dealing with ASCII, you have to explicitly deal with the difference between bytes and Unicode.

Python 3 is strict about the difference between bytes and unicode. You are forced to be clear in your code which you are dealing with. This has been controversial.

Reading files

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.