PySpark – TechnoFob

Create an SNS topic. This topic is the target for the CloudWatch Events rule.

Open the AWS Identity and Access Management (IAM) console, and then choose Roles in the navigation pane.

Click on the name of the role that is attached to your cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instances (for example, EMR_EC2_DefaultRole) and click Attach policies

Attach the AmazonSNSFullAccess policy to the role. This policy allows SNS to send notifications based on state changes in your Amazon EMR cluster.

A summary page is presented with the message “Policy AmazonSNSFullAccess has been attached for the EMR_EC2_DefaultRole.”

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/ and create a CloudWatch Events Rule That Triggers on an Event

Choose Event Pattern, Build event pattern to match events by service. For Service Name, choose the service that emits the event to trigger the rule. For Event Type, choose the specific event that is to trigger the rule.

For Targets, choose Add Target and choose the AWS service that is to act when an event of the selected type is detected

Choose Configure details. For Rule definition, type a name and description for the rule.

Once created, the message “Rule emr_state_change_SNS was created.” will be presented

based on https://aws.amazon.com/premiumsupport/knowledge-center/alarm-emr-cluster-change/

Install Jupyter notebook

$ pip3 install jupyter

Install PySpark

Make sure you have Java 8 or higher installed on your computer and visit the Spark download page

Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.

Unzip it and move it to your /opt folder:

$ tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
$ sudo mv spark-2.4.0-bin-hadoop2.7 /opt/spark-2.4.0

A symbolic link is like a shortcut from one file to another. The contents of a symbolic link are the address of the actual file or folder that is being linked to.

Create a symbolic link (this will let you have multiple spark versions):

$ sudo ln -s /opt/spark-2.4.0 /opt/spark̀

Check that the link was indeed created

$ ls -l /opt/spark̀

lrwxr-xr-x  1 root  wheel  16 Dec 26 15:08 /opt/spark̀ -> /opt/spark-2.4.0

Finally, tell your bash where to find Spark. To find what shell you are using, type:

$ echo $SHELL
/bin/bash

To do so, edit your bash file:

$ nano ~/.bash_profile

configure your $PATH variables by adding the following lines to your ~/.bash_profile file:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
# For python 3, You have to add the line below or you will get an error
export PYSPARK_PYTHON=python3

Now to run PySpark in Jupyter you’ll need to update the PySpark driver environment variables. Just add these lines to your ~/.bash_profile file:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Your ~/.bash_profile file may look like this:

Restart (our just source) your terminal and launch PySpark:

$ pyspark

This command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.

Running PySpark in Jupyter Notebook

The PySpark context can be

sc = SparkContext.getOrCreate()

To check if your notebook is initialized with SparkContext, you could try the following codes in your notebook:

sc = SparkContext.getOrCreate()
import numpy as np
TOTAL = 10000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print("Number of random points:", dots.count())
stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())

The result:

Running PySpark in your favorite IDE

Sometimes you need a full IDE to create more complex code, and PySpark isn’t on sys.path by default, but that doesn’t mean it can’t be used as a regular library. You can address this by adding PySpark to sys.path at runtime. The package findspark does that for you.

To install findspark just type:

$ pip3 install findspark

And then on your IDE (I use Eclipse and Pydev) to initialize PySpark, just call:

import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")

Here is a full example of a standalone application to test PySpark locally

import findspark
findspark.init()
import random
from pyspark import SparkContext
sc = SparkContext(appName="EstimatePi")
def inside(p):
x, y = random.random(), random.random()
return x<em>x + y</em>y &lt; 1
NUM_SAMPLES = 1000000
count = sc.parallelize(range(0, NUM_SAMPLES)) \
.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
sc.stop()

The result:

Enjoy!

Based on this article and on this article

Category: PySpark

How can I create email notifications for when an Amazon EMR cluster or step changes state?

How to run PySpark 2.4.0 in Jupyter Notebook on Mac

Install Jupyter notebook

Install PySpark

Running PySpark in Jupyter Notebook

Running PySpark in your favorite IDE

Share this:

Install Jupyter notebook

Install PySpark

Running PySpark in Jupyter Notebook

Running PySpark in your favorite IDE

Share this: