How to remove xa0 from string in Python?

How to remove xa0 from string in Python?

xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(uxa0, u )

When .encode(utf-8), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, xa0 is represented by 2 bytes xc2xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

Theres many useful things in Pythons unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize(NFKD, unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you dont get the results youre after.

How to remove xa0 from string in Python?

After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing xa0 characters from parsed HTML string.

Assume we have our raw html as following:

raw_html = <p>Dear Parent, </p><p><span style=font-size: 1rem;>This is a test message, </span><span style=font-size: 1rem;>kindly ignore it. </span></p><p><span style=font-size: 1rem;>Thanks</span></p>

So lets try to clean this HTML string:

from bs4 import BeautifulSoup
raw_html = <p>Dear Parent, </p><p><span style=font-size: 1rem;>This is a test message, </span><span style=font-size: 1rem;>kindly ignore it. </span></p><p><span style=font-size: 1rem;>Thanks</span></p>
text_string = BeautifulSoup(raw_html, lxml).text
print text_string
#uDear Parent,xa0This is a test message,xa0kindly ignore it.xa0Thanks

The above code produces these characters xa0 in the string. To remove them properly, we can use two ways.

Method # 1 (Recommended):
The first one is BeautifulSoups get_text method with strip argument as True
So our code becomes:

clean_text = BeautifulSoup(raw_html, lxml).get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2:
The other option is to use pythons library unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, lxml).text
clean_text = unicodedata.normalize(NFKD,text_string)
print clean_text
# uDear Parent,This is a test message,kindly ignore it.Thanks

I have also detailed these methods on this blog which you may want to refer.

Leave a Reply

Your email address will not be published. Required fields are marked *