Install Jupyter notebook
$ pip3 install jupyter
Install PySpark
Make sure you have Java 8 or higher installed on your computer and visit the Spark download page
Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.

Unzip it and move it to your /opt folder:
$ tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
$ sudo mv spark-2.4.0-bin-hadoop2.7 /opt/spark-2.4.0
A symbolic link is like a shortcut from one file to another. The contents of a symbolic link are the address of the actual file or folder that is being linked to.
Create a symbolic link (this will let you have multiple spark versions):
$ sudo ln -s /opt/spark-2.4.0 /opt/spark̀
Check that the link was indeed created
$ ls -l /opt/spark̀
lrwxr-xr-x 1 root wheel 16 Dec 26 15:08 /opt/spark̀ -> /opt/spark-2.4.0
Finally, tell your bash where to find Spark. To find what shell you are using, type:
$ echo $SHELL
/bin/bash
To do so, edit your bash file:
$ nano ~/.bash_profile
configure your $PATH variables by adding the following lines to your ~/.bash_profile file:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
# For python 3, You have to add the line below or you will get an error
export PYSPARK_PYTHON=python3
Now to run PySpark in Jupyter you’ll need to update the PySpark driver environment variables. Just add these lines to your ~/.bash_profile file:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Your ~/.bash_profile file may look like this:

Restart (our just source) your terminal and launch PySpark:
$ pyspark
This command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.
Running PySpark in Jupyter Notebook
The PySpark context can be
sc = SparkContext.getOrCreate()
To check if your notebook is initialized with SparkContext, you could try the following codes in your notebook:
sc = SparkContext.getOrCreate() import numpy as np TOTAL = 10000 dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache() print("Number of random points:", dots.count()) stats = dots.stats() print('Mean:', stats.mean()) print('stdev:', stats.stdev())
The result:

Running PySpark in your favorite IDE
Sometimes you need a full IDE to create more complex code, and PySpark isn’t on sys.path by default, but that doesn’t mean it can’t be used as a regular library. You can address this by adding PySpark to sys.path at runtime. The package findspark does that for you.
To install findspark just type:
$ pip3 install findspark
And then on your IDE (I use Eclipse and Pydev) to initialize PySpark, just call:
import findspark findspark.init() import pyspark sc = pyspark.SparkContext(appName="myAppName")
Here is a full example of a standalone application to test PySpark locally
import findspark findspark.init() import random from pyspark import SparkContext sc = SparkContext(appName="EstimatePi") def inside(p): x, y = random.random(), random.random() return x<em>x + y</em>y < 1 NUM_SAMPLES = 1000000 count = sc.parallelize(range(0, NUM_SAMPLES)) \ .filter(inside).count() print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)) sc.stop()
The result:

Enjoy!