python – How to read a Parquet file into Pandas DataFrame?

python – How to read a Parquet file into Pandas DataFrame?

pandas 0.21 introduces new functions for Parquet:

pd.read_parquet(example_pa.parquet, engine=pyarrow)

or

pd.read_parquet(example_fp.parquet, engine=fastparquet)

The above link explains:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

python – How to read a Parquet file into Pandas DataFrame?

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe

The code is simple, just type:

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files

Leave a Reply

Your email address will not be published. Required fields are marked *