python – How to open html file?

python – How to open html file?

import codecs
f=codecs.open(test.html, r)
print f.read()

Try something like this.

I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = utf-8:

with open(test.html, r, encoding=utf-8) as f:
    text= f.read()

python – How to open html file?

you can make use of the following code:

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup

f=codecs.open(test.html, r, utf-8)
document= BeautifulSoup(f.read()).get_text()
print(document)

If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:

import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match(^[A-Za-z]*$,line):
            if (line not in stop and len(line)>1):
                st=st+ +line
print st

*define st as a string initially, like st=

Leave a Reply

Your email address will not be published. Required fields are marked *