DataFrame of DataFrames in Python (Pandas)

DataFrame of DataFrames in Python (Pandas)

I think that pandas offers better alternatives to what youre suggesting (rationale below).

For one, theres the pandas.Panel data structure, which was meant for things like youre doing here.

However, as Wes McKinney (the Pandas author) noted in his book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, multi-dimensional indices, to a large extent, offer a better alternative.

Consider the following alternative to your code:

dfs = []
for year in range(1967,2014):
    ....some codes that allow me to generate df1, df2 and df3 
    df1[year] = year
    df1[origin] = df1
    df2[year] = year
    df2[origin] = df2
    df3[year] = year
    df3[origin] = df3
    dfs.extend([df1, df2, df3])
df = pd.concat(dfs)

This gives you a DataFrame with 4 columns: firm, price, year, and origin.

This gives you the flexibility to:

  • Organize hierarchically by, say, year and origin: df.set_index([year, origin]), by, say, origin and price: df.set_index([origin, price])

  • Do groupbys according to different levels

  • In general, slice and dice the data along many different ways.

What youre suggesting in the question makes one dimension (origin) arbitrarily different, and its hard to think of an advantage to this. If a split along some dimension is necessary due, to, e.g., performance, you can combine DataFrames better with standard Python data structures:

  • A dictionary mapping each year to a Dataframe with the other three dimensions.

  • Three DataFrames, one for each origin, each having three dimensions.

DataFrame of DataFrames in Python (Pandas)

Leave a Reply

Your email address will not be published. Required fields are marked *