How to read parquet data from S3 to spark dataframe Python?

How to read parquet data from S3 to spark dataframe Python?

Youve to use SparkSession instead of sqlContext since Spark 2.0

spark = SparkSession.builder
                        .master(local)             
                        .appName(app name)             
                        .config(spark.some.config.option, true).getOrCreate()

df = spark.read.parquet(s3://path/to/parquet/file.parquet)

The file schema (s3)that you are using is not correct. Youll need to use the s3n schema or s3a (for bigger s3 objects):

// use sqlContext instead for spark <2 
val df = spark.read 
              .load(s3n://bucket-name/object-path)

I suggest that you read more about the Hadoop-AWS module: Integration with Amazon Web Services Overview.

How to read parquet data from S3 to spark dataframe Python?

Leave a Reply

Your email address will not be published. Required fields are marked *