Get html using Python requests?
Get html using Python requests?
The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
$ curl -D - -o /dev/null -s -H Accept-Encoding: gzip, deflate http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN DTD/xhtml1-transitional.dtd><html xmlns=http: //www.w3.org/1999/xhtml lang=en-US>
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..>
line there is not a valid HTTP header. As such, the remaining headers past Server
are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py
is a CGI script that doesnt output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
As such, requests
also doesnt detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasnt rather incomplete.
The work-around is to tell the server not to bother with compression:
headers = {Accept-Encoding: identity}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
>>> pprint(dict(r.headers))
{<!doctype html public -//w3c//dtd xhtml 1.0 transitional//en dtd/xhtml1-transitional.dtd><html xmlns=http: //www.w3.org/1999/xhtml lang=en-US>,
connection: Keep-Alive,
content-encoding: gzip,
content-length: 3659,
content-type: text/html,
date: Tue, 06 Jan 2015 17:42:06 GMT,
keep-alive: timeout=5, max=100,
server: Apache,
vary: Accept-Encoding}
and the content-encoding
information survives, so there requests
decodes the content for you, as expected.
The HTTP headers for this URL have now been fixed.
>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get(http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F)
>>> r.text[:100]
un<!DOCTYPE html>n<HTML>n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H
>>> r.headers
{content-length: 3672, content-encoding: gzip, vary: Accept-Encoding, keep-alive: timeout=5, max=100, server: Apache, connection: Keep-Alive, date: Thu, 12 Feb 2015 18:59:37 GMT, content-type: text/html; charset=utf-8}
Get html using Python requests?
Id solve that problem in a more simple way. Just import html
library to decode HTML special characters:
import html
r = requests.get(http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F)
print(html.unescape(r.text))