Parse HTML table to Python list?

Parse HTML table to Python list?

You should use some HTML parsing library like lxml:

from lxml import etree
s = <table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>

table = etree.HTML(s).find(body/table)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print dict(zip(headers, values))

prints

{End Date: c, Start Date: b, Event: a}
{End Date: f, Start Date: e, Event: d}
{End Date: i, Start Date: h, Event: g}

Hands down the easiest way to parse a HTML table is to use pandas.read_html() – it accepts both URLs and HTML.

import pandas as pd
url = rhttps://en.wikipedia.org/wiki/List_of_S%26P_500_companies
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest

Only downside is that read_html() doesnt preserve hyperlinks.

Parse HTML table to Python list?

Sven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions:

from xml.etree import ElementTree as ET

s = <table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>


table = ET.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print(dict(zip(headers, values)))

same output as Sven Marnachs answer…

Leave a Reply

Your email address will not be published. Required fields are marked *