Switching to Python 3 causing UnicodeDecodeError

Switching to Python 3 causing UnicodeDecodeError

Python 3 decodes text files when reading, encodes when writing. The default encoding is taken from locale.getpreferredencoding(False), which evidently for your setup returns ASCII. See the open() function documenation:

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

Instead of relying on a system setting, you should open your text files using an explicit codec:

currentFile = open(filename, rt, encoding=latin1)

where you set the encoding parameter to match the file you are reading.

Python 3 supports UTF-8 as the default for source code.

The same applies to writing to a writeable text file; data written will be encoded, and if you rely on the system encoding you are liable to get UnicodeEncodingError exceptions unless you explicitly set a suitable codec. What codec to use when writing depends on what text you are writing and what you plan to do with the file afterward.

You may want to read up on Python 3 and Unicode in the Unicode HOWTO, which explains both about source code encoding and reading and writing Unicode data.

as far as I know Python3 is supposed to support utf-8 everywhere …
Not true. I have python 3.6 and my default encoding is NOT utf-8.
To change it to utf-8 in my code I use:

import locale
def getpreferredencoding(do_setlocale = True):
   return utf-8
locale.getpreferredencoding = getpreferredencoding

as explained in
Changing the “locale preferred encoding” in Python 3 in Windows

Switching to Python 3 causing UnicodeDecodeError

In general, I found 3 ways to fix Unicode related Errors in Python3:

  1. Use the encoding explicitly like currentFile = open(filename, rt,encoding=utf-8)

  2. As the bytes have no encoding, convert the string data to bytes before writing to file like data = string.encode(utf-8)

  3. Especially in Linux environment, check $LANG. Such issue usually arises when LANG=C which makes default encoding as ascii instead of utf-8. One can change it with other appropriate value like LANG=en_IN

Leave a Reply

Your email address will not be published. Required fields are marked *