unicode – uufeff in Python string
unicode – uufeff in Python string
I ran into this on Python 3 and found this question (and solution).
When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.
Without it, the BOM is included in the read result:
>>> f = open(file, mode=r)
>>> f.read()
ufefftest
Giving the correct encoding, the BOM is omitted in the result:
>>> f = open(file, mode=r, encoding=utf-8-sig)
>>> f.read()
test
Just my 2 cents.
The Unicode character U+FEFF
is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:
#!python2
#coding: utf8
u = uABC
e8 = u.encode(utf-8) # encode without BOM
e8s = u.encode(utf-8-sig) # encode with BOM
e16 = u.encode(utf-16) # encode with BOM
e16le = u.encode(utf-16le) # encode without BOM
e16be = u.encode(utf-16be) # encode without BOM
print utf-8 %r % e8
print utf-8-sig %r % e8s
print utf-16 %r % e16
print utf-16le %r % e16le
print utf-16be %r % e16be
print
print utf-8 w/ BOM decoded with utf-8 %r % e8s.decode(utf-8)
print utf-8 w/ BOM decoded with utf-8-sig %r % e8s.decode(utf-8-sig)
print utf-16 w/ BOM decoded with utf-16 %r % e16.decode(utf-16)
print utf-16 w/ BOM decoded with utf-16le %r % e16.decode(utf-16le)
Note that EF BB BF
is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).
Output:
utf-8 ABC
utf-8-sig xefxbbxbfABC
utf-16 xffxfeAx00Bx00Cx00 # Adds BOM and encodes using native processor endian-ness.
utf-16le Ax00Bx00Cx00
utf-16be x00Ax00Bx00C
utf-8 w/ BOM decoded with utf-8 uufeffABC # doesnt remove BOM if present.
utf-8 w/ BOM decoded with utf-8-sig uABC # removes BOM if present.
utf-16 w/ BOM decoded with utf-16 uABC # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le uufeffABC # doesnt remove BOM if present.
Note that the utf-16
codec requires BOM to be present, or Python wont know if the data is big- or little-endian.
unicode – uufeff in Python string
That character is the BOM or Byte Order Mark. It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to ascii, you should probably pick another encoding for whatever you were trying to do.