apache spark – importing pyspark in python shell

Assuming one of the following:

  • Spark is downloaded on your system and you have an environment variable SPARK_HOME pointing to it
  • You have ran pip install pyspark

Here is a simple method (If you dont bother about how it works!!!)

Use findspark

  1. Go to your python shell

    pip install findspark
    import findspark
  2. import the necessary modules

    from pyspark import SparkContext
    from pyspark import SparkConf
  3. Done!!!

If it prints such error:

ImportError: No module named py4j.java_gateway

Please add $SPARK_HOME/python/build to PYTHONPATH:

export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4

Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark :

export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:

I added this line to my .bashrc file and the modules are now correctly found!

