How do I unescape HTML entities in a string in Python 3.1?

How do I unescape HTML entities in a string in Python 3.1?

You could use the function html.unescape:

In Python3.4+ (thanks to J.F. Sebastian for the update):

import html
html.unescape(Suzy & John)
# Suzy & John

html.unescape(")
# 

In Python3.3 or older:

import html.parser    
html.parser.HTMLParser().unescape(Suzy & John)

In Python2:

import HTMLParser
HTMLParser.HTMLParser().unescape(Suzy & John)

You can use xml.sax.saxutils.unescape for this purpose. This module is included in the Python standard library, and is portable between Python 2.x and Python 3.x.

>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape(Suzy & John)
Suzy & John

How do I unescape HTML entities in a string in Python 3.1?

Apparently I dont have a high enough reputation to do anything but post this. unutbus answer does not unescape quotations. The only thing that I found that did was this function:

import re
from htmlentitydefs import name2codepoint as n2cp

def decodeHtmlentities(string):
    def substitute_entity(match):        
        ent = match.group(2)
        if match.group(1) == #:
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)
            if cp:
                return unichr(cp)
            else:
                return match.group()
    entity_re = re.compile(&(#?)(d{1,5}|w{1,8});)
    return entity_re.subn(substitute_entity, string)[0]

Which I got from this page.

Leave a Reply

Your email address will not be published. Required fields are marked *