Get html using Python requests?

Get html using Python requests?

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H Accept-Encoding: gzip, deflate http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN DTD/xhtml1-transitional.dtd><html xmlns=http: //www.w3.org/1999/xhtml lang=en-US>
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesnt output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesnt detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasnt rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {Accept-Encoding: identity}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{<!doctype html public -//w3c//dtd xhtml 1.0 transitional//en dtd/xhtml1-transitional.dtd><html xmlns=http: //www.w3.org/1999/xhtml lang=en-US>,
 connection: Keep-Alive,
 content-encoding: gzip,
 content-length: 3659,
 content-type: text/html,
 date: Tue, 06 Jan 2015 17:42:06 GMT,
 keep-alive: timeout=5, max=100,
 server: Apache,
 vary: Accept-Encoding}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

The HTTP headers for this URL have now been fixed.

>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get(http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F)
>>> r.text[:100]
un<!DOCTYPE html>n<HTML>n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H
>>> r.headers
{content-length: 3672, content-encoding: gzip, vary: Accept-Encoding, keep-alive: timeout=5, max=100, server: Apache, connection: Keep-Alive, date: Thu, 12 Feb 2015 18:59:37 GMT, content-type: text/html; charset=utf-8}

Get html using Python requests?

Id solve that problem in a more simple way. Just import html library to decode HTML special characters:

import html

r = requests.get(http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F)

print(html.unescape(r.text))

Leave a Reply

Your email address will not be published. Required fields are marked *