2008년 12월 19일 금요일

Encoding name normalization in Python

Python provides codecs module, which provides the codec registry. You can query the registry with codecs.lookup function. codecs.lookup function receives an encoding name as an argument.

The venerable IANA, our Internet Assigned Numbers Authority, maintains official names for character sets. On the other hand, nobody really cares. According to IANA, "UTF-8" is an encoding name, with no aliases. Therefore, "UTF8", "UTF_8" or any such aliases should be invalid. That doesn't really work in the real world.

So Python normalizes encoding names received from codecs.lookup. How exactly this is done isn't really specified. It turns out that CPython does normalization in two separate places: one in Python in the standard library, and one in C in the implementation. There are normalize_encoding function in Lib/encodings/__init__.py, and normalizestring function in Python/codecs.c. Moreover, these two functions perform different normalizations.

IronPython, being an implementation of Python trying to be compatible with CPython, needs to cope with this. You will be surprised by the number of ways things can go wrong if you don't exactly match how this is done. But did you know that the following code work with CPython? (I don't recommend this!!!)

import codecs
codecs.lookup('utf!!!8')


Yes, those are three exclamation marks. I'm not kidding...