How do I unescape HTML entities in a string in Python 3.1?
How do I unescape HTML entities in a string in Python 3.1?
You could use the function html.unescape:
In Python3.4+ (thanks to J.F. Sebastian for the update):
import html
html.unescape(Suzy & John)
# Suzy & John
html.unescape(")
#
In Python3.3 or older:
import html.parser
html.parser.HTMLParser().unescape(Suzy & John)
In Python2:
import HTMLParser
HTMLParser.HTMLParser().unescape(Suzy & John)
You can use xml.sax.saxutils.unescape
for this purpose. This module is included in the Python standard library, and is portable between Python 2.x and Python 3.x.
>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape(Suzy & John)
Suzy & John
How do I unescape HTML entities in a string in Python 3.1?
Apparently I dont have a high enough reputation to do anything but post this. unutbus answer does not unescape quotations. The only thing that I found that did was this function:
import re
from htmlentitydefs import name2codepoint as n2cp
def decodeHtmlentities(string):
def substitute_entity(match):
ent = match.group(2)
if match.group(1) == #:
return unichr(int(ent))
else:
cp = n2cp.get(ent)
if cp:
return unichr(cp)
else:
return match.group()
entity_re = re.compile(&(#?)(d{1,5}|w{1,8});)
return entity_re.subn(substitute_entity, string)[0]
Which I got from this page.