How to run PySpark 2.4.0 in Jupyter Notebook on Mac

Install Jupyter notebook

$ pip3 install jupyter

Install PySpark

Make sure you have Java 8 or higher installed on your computer and visit the Spark download page

Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.

Unzip it and move it to your /opt folder:

$ tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
$ sudo mv spark-2.4.0-bin-hadoop2.7 /opt/spark-2.4.0

A symbolic link is like a shortcut from one file to another. The contents of a symbolic link are the address of the actual file or folder that is being linked to.

Create a symbolic link (this will let you have multiple spark versions):

$ sudo ln -s /opt/spark-2.4.0 /opt/spark̀

Check that the link was indeed created

$ ls -l /opt/spark̀

lrwxr-xr-x 1 root wheel 16 Dec 26 15:08 /opt/spark̀ -> /opt/spark-2.4.0

Finally, tell your bash where to find Spark. To find what shell you are using, type:

$ echo $SHELL
/bin/bash

To do so, edit your bash file:

$ nano ~/.bash_profile

configure your $PATH variables by adding the following lines to your ~/.bash_profile file:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
# For python 3, You have to add the line below or you will get an error
export PYSPARK_PYTHON=python3

Now to run PySpark in Jupyter you’ll need to update the PySpark driver environment variables. Just add these lines to your ~/.bash_profile file:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Your ~/.bash_profile file may look like this:

Restart (our just source) your terminal and launch PySpark:

$ pyspark

This command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.

Running PySpark in Jupyter Notebook

The PySpark context can be

sc = SparkContext.getOrCreate()

To check if your notebook is initialized with SparkContext, you could try the following codes in your notebook:

sc = SparkContext.getOrCreate()
import numpy as np
TOTAL = 10000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print("Number of random points:", dots.count())
stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())

The result:

Running PySpark in your favorite IDE

Sometimes you need a full IDE to create more complex code, and PySpark isn’t on sys.path by default, but that doesn’t mean it can’t be used as a regular library. You can address this by adding PySpark to sys.path at runtime. The package findspark does that for you.

To install findspark just type:

$ pip3 install findspark

And then on your IDE (I use Eclipse and Pydev) to initialize PySpark, just call:

import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")

Here is a full example of a standalone application to test PySpark locally 

import findspark
findspark.init()
import random
from pyspark import SparkContext
sc = SparkContext(appName="EstimatePi")
def inside(p):
x, y = random.random(), random.random()
return x<em>x + y</em>y &lt; 1
NUM_SAMPLES = 1000000
count = sc.parallelize(range(0, NUM_SAMPLES)) \
.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
sc.stop()

The result:

Enjoy!

Based on this article and on this article

Error when executing `jupyter notebook` (-bash: jupiter: command not found) [Mac]

After installing jupyter

pip3 install --upgrade pip
pip3 install jupyter

and trying to launch 

jupyter notebook

the following error message appeared

-bash: jupyter: command not found

The solution:  

pip3 install --upgrade --force-reinstall --no-cache-dir jupyter 

Fighting Digital Payments Fraud with Deep Learning

Interesting presentation today at the DataScience SG meet-up

Conventional fraud prevention methods are rule based, expansive and slow to implement

Q1 2016: $5 of every $100 subject to fraud attack!

Key fraud types: account takeover, friendly fraud & fraud due to stolen card information

Consumers want: easy, instant, customized, mobile and dynamic options to suit their personal situation. Consumers do NOT want to be part of the fraud detection process.

Key technology enablers:

Historically fraud detection systems have relied on rues hand-curated by fraud experts to catch fraudulent activity.

An auto-encoder is a neural network trained to reconstruct its inputs, which forces a hidden layer to try and to learn good representations of the input

Kaggle dataset:

Train Autoencoder on normal transactions and using the Autoencoder transformation there is now a clear separation between the normal and the fraudulent transactions.

The Secret Recipe Behind GO-FOOD’s Recommendations (PyData Meetup)

The December PyData Meetup started with Luis Smith, Data Scientist at GO-JEK, sharing the Secret Recipe Behind GO-FOOD’s Recommendations:

“For GO-FOOD, we believe the key to unlocking good recommendations is to derive vector representations for our users, dishes, and merchants. This way we are able to capture our users’ food preferences and recommend them the most relevant merchants and dishes.”

How do people think about the food?

  • Flavor profile
  • Trendy
  • Value for money
  • Portion size
  • Ingredients

… and much more

The preferred approach is to let the transactional data discover the pattern.

A sample ETL workflow:

Using StarSpace to learn the vector representations:

Go-Jek formulation of the problem:

User-to-dish similarity is surfaced in the app via the “dishes you might like”. The average vector of customer’s purchases represents the recommended dish.

Due to data sparsity, item-based collaborative filtering is used for merchant recommendation.

The cold start problem is still an issue, for inactive users or users that purchase infrequently.

(published here)

Understanding the Unpacking Operators (* and **) in Python 3.x

The * operator unpack the arguments out of a list or tuple.
> args = [3, 6]
> list(range(*args))
[3, 4, 5]

As an example, when we have a list of three arguments, we can use the * operator inside a function call to unpack it into the three arguments:

def f(a,b,c):
    print('a={},b={},c={}'.format(a,b,c))

> z = ['I','like','Python']
> f(*z)
a=I,b=like,c=Python

> z = [['I','really'],'like','Python']
> f(*z)
a=['I', 'really'],b=like,c=Python

In Python 3 it is possible to use the operator * on the left side of an assignment, allowing to specify a “catch-all” name which will be assigned a list of all items not assigned to a “regular” name:

> a, *b, c = range(5)
> a
0
> c
4
> b
[1, 2, 3]

The ** operator can be used to unpack a dictionary of arguments as a collection of keyword arguments. Calling the same function f that we defined above:

 

> d = {'c':'Python','b':'like', 'a':'I'}
> f(**d)
a=I,b=like,c=Python

and when there is a missing argument in the dictionary (‘a’ in this example),  the following error message will be printed:


> d2 = {'c':'Python','b':'like'}
> f(**d2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required positional argument: 'a'

 
Tried with: Python 3.6.5

Installing Wand (0.4) and ImageMagick v6 on Mac (macOS High Sierra v 10.13.5)

wizard

ImageMagick® is used to create, edit, compose, or convert bitmap images. It can read and write images in a variety of formats (over 200) including PNG, JPEG, GIF, HEIC, TIFF, DPX, EXR, WebP, Postscript, PDF, and SVG. Use ImageMagick to resize, flip, mirror, rotate, distort, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses and Bézier curves.

Wand is a ctypes-based simple ImageMagick binding for Python, so go through the step-by-step guide on how to install it.

Let’s start by installing ImageMagic:

brew install imagemagick@6

Next, create a symbolic link, with the following command (replace <your specific 6 version> with your specific version):

ln -s /usr/local/Cellar/imagemagick@6/<your specific 6 version>/lib/libMagickWand-6.Q16.dylib /usr/local/lib/libMagickWand.dylib

In my case, it was:

ln -s /usr/local/Cellar/imagemagick@6/6.9.10-0/lib/libMagickWand-6.Q16.dylib /usr/local/lib/libMagickWand.dylib

Let’s install Wand

pip3 install Wand

Now, let’s try to run the code

from wand.image import Image

with Image(filename=sourceFullPathFilename) as img:
img.save(filename=targetFilenameFull)

Unfortunately, I got the following error message:

wand.exceptions.DelegateError: FailedToExecuteCommand `’gs’ -sstdout=%stderr -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 ‘-sDEVICE=pngalpha’ -dTextAlphaBits=4 -dGraphicsAlphaBits=4 ‘-r72x72’ ‘-sOutputFile=/var/folders/n7/9xyh2rj14qvf3hrmr7g9b4gm0000gp/T/magick-31607l23fY21KEi6b%d’ ‘-f/var/folders/n7/9xyh2rj14qvf3hrmr7g9b4gm0000gp/T/magick-31607_nNNZjiBBusp’ ‘-f/var/folders/n7/9xyh2rj14qvf3hrmr7g9b4gm0000gp/T/magick-31607Zfemn9tWrdiY” (1) @ error/pdf.c/InvokePDFDelegate/292
Exception ignored in: <bound method Resource.__del__ of <wand.image.Image: (empty)>>

It seems that ghostscript is not installed by default, so let’s install it:

brew install ghostscript

Now we will need to create a soft link to /usr/bin, but /usr/bin/ in OS X 10.11+ is protected.

Just follow these steps:

1. Reboot to Recovery Mode. Reboot and hold “Cmd + R” after start sound.
2. In Recovery Mode go to Utilities -> Terminal.
3. Run: csrutil disable
4. Reboot in Normal Mode.
5. Do the “sudo ln -s /usr/local/bin/gs /usr/bin/gs” in terminal.
6. Do the 1 and 2 step. In terminal enable back csrutil by run: csrutil enable

(based on this)

Now it works – Enjoy!

 

Introduction to Survival Analysis

Introduction

Survival analysis is generally defined as a set of methods for analysing data where the outcome variable is the time until the occurrence of an event of interest. For example, if the event of interest is heart attack, then the survival time can be the time in years until a person develops a heart attack. For simplicity, we will adopt the terminology of survival analysis, referring to the event of interest as ‘death’ and to the waiting time as ‘survival’ time, but this technique has much wider applicability. The event can be death, occurrence of a disease, marriage, divorce, etc. The time to event or survival time can be measured in days, weeks, years, etc.

The specific difficulties relating to survival analysis arise largely from the fact that only some individuals have experienced the event and, subsequently, survival times will be unknown for a subset of the study group. This phenomenon is called censoring.

In longitudinal studies exact survival time is only known for those individuals who show the event of interest during the follow-up period. For others (those who are disease free at the end of the observation period or those that were lost) all we can say is that they did not show the event of interest during the follow-up period. These individuals are called censored observations. An attractive feature of survival analysis is that we are able to include the data contributed by censored observations right up until they are removed from the risk set.

Survival and Hazard

T  –  a non-negative random variable representing the waiting time until the occurrence of an event.

The survival function, S(t), of an individual is the probability that they survive until at least time t, where t is a time of interest and T is the time of event.

F001

The survival curve is non-increasing (the event may not reoccur for an individual) and is limited within [0,1].

survival-graph-crop

F(t) – the probability that the event has occurred by duration t:

F002

the probability density function (p.d.f.) f(t):

F003

An alternative characterisation of the distribution of T is given by the hazard function, or instantaneous rate of occurrence of the event, defined as

F004

The numerator of this expression is the conditional probability that the event will occur in the interval [t,t+dt] given that it has not occurred before, and the denominator is the width of the interval. Dividing one by the other we obtain a rate of event occurrence per unit of time. Taking the limit as the width of the interval goes down to zero, we obtain an instantaneous rate of occurrence.

Applying Bayes’ Rule

F005

on the numerator of the hazard function:

F006

Given that the event happened between time t to t+dt, the conditional probability of this event happening after time t is 1:

F007

Dividing by dt and passing to the limit gives the useful result:

F008

In words, the rate of occurrence of the event at duration t equals the density of events at t, divided by the probability of surviving to that duration without experiencing the event.

We will soon show that there is a one-to-one relation between the hazard and the survival function.

The derivative of S(t) is:

F009

We will now show that the hazard function is the derivative of -log S(t):

F010

If we now integrate from 0 to time t:

F011

F012

F013

 and introduce the boundary condition S(0) = 1 (since the event is sure not to have occurred by duration 0):

F014

F015

we can solve the above expression to obtain a formula for the probability of surviving to duration t as a function of the hazard at all durations up to t:

F016

One approach to estimating the survival probabilities is to assume that the hazard function follow a specific mathematical distribution. Models with increasing hazard rates may arise when there is natural aging or wear. Decreasing hazard functions are much less common but find occasional use when there is a very early likelihood of failure, such as in certain types of electronic devices or in patients experiencing certain types of transplants. Most often, a bathtub-shaped hazard is appropriate in populations followed from birth.

The figure below hows the relationship between four parametrically specified hazards and the corresponding survival probabilities. It illustrates (a) a constant hazard rate over time (e.g. healthy persons) which is analogous to an exponential distribution of survival times, (b) strictly increasing (c) decreasing hazard rates based on a Weibull model, and (d) a combination of decreasing and increasing hazard rates using a log-Normal model. These curves are illustrative examples and other shapes are possible.

different_hazard_functions

Example

The simplest possible survival distribution is obtained by assuming a constant risk over time:

survival-constant-risk

Censoring and truncation

One of the distinguishing feature of the field of survival analysis is censoring: observations are called censored when the information about their survival time is incomplete; the most commonly encountered form is right censoring.

censor_truncation

Right censoring occurs when a subject leaves the study before an event occurs, or the study ends before the event has occurred. For example, we consider patients in a clinical trial to study the effect of treatments on stroke occurrence. The study ends after 5 years. Those patients who have had no strokes by the end of the year are censored. Another example of right censoring is when a person drops out of the study before the end of the study observation time and did not experience the event. This person’s survival time is said to be censored, since we know that the event of interest did not happen while this person was under observation.

Left censoring is when the event of interest has already occurred before enrolment. This is very rarely encountered.

In a truncated sample, we do not even “pick up” observations that lie outside a certain range.

Unlike ordinary regression models, survival methods correctly incorporate information from both censored and uncensored observations in estimating important model parameters

Non-parametric Models

The very simplest survival models are really just tables of event counts: non-parametric, easily computed and a good place to begin modelling to check assumptions, data quality and end-user requirements etc. When no event times are censored, a non-parametric estimator of S(t) is 1 − F(t), where F(t) is the empirical cumulative distribution function.

Kaplan–Meier

When some observations are censored, we can estimate S(t) using the Kaplan-Meier product-limit estimator. An important advantage of the Kaplan–Meier curve is that the method can take into account some types of censored data, particularly right-censoring, which occurs if a patient withdraws from a study, is lost to follow-up, or is alive without event occurrence at last follow-up.

Suppose that 100 subjects of a certain type were tracked over a period of time to determine how many survived for one year, two years, three years, and so forth. If all the subjects remained accessible throughout the entire length of the study, the estimation of year-by-year survival probabilities for subjects of this type in general would be an easy matter. The survival of 87 subjects at the end of the first year would give a one-year survival probability estimate of 87/100=0.87; the survival of 76 subjects at the end of the second year would yield a two-year estimate of 76/100=0.76; and so forth.

But in real-life longitudinal research it rarely works out this neatly. Typically there are subjects lost along the way (censored) for reasons unrelated to the focus of the study.

Suppose that 100 subjects of a certain type were tracked over a period of two years determine how many survived for one year and for two years. Of the 100 subjects who are “at risk” at the beginning of the study, 3 become unavailable (censored) during the first year and 3 are known to have died by the end of the first year. Another 2 become unavailable during the second year and another 10 are known to have died by the end of the second year.

KM_experiment_table_died

Kaplan and Meier proposed that subjects who become unavailable during a given time period be counted among those who survive through the end of that period, but then deleted from the number who are at risk for the next time period.

The table below shows how these conventions would work out for the present example. Of the 100 subjects who are at risk at the beginning of the study, 3 become unavailable during the first year and 3 die. The number surviving the first year (Year 1) is therefore 100 (at risk) – 3 (died) = 97 and the number at risk at the beginning of the second year (Year 2) is 100 (at risk) – 3 (died) – 3 (unavailable) = 94. Another 2 subjects become unavailable during the second year and another 10 die. So the number surviving Year 2 is 94 (at risk) – 10 (died) = 84.

KM_experiment_table_survived

As illustrated in the next table, the Kaplan-Meier procedure then calculates the survival probability estimate for each of the t time periods, except the first, as a compound conditional probability.

KM_experiment_table

The estimate for surviving through Year 1 is simply 97/100=0.97. And if one does survive through Year 1, the conditional probability of then surviving through Year 2 is 84/94=0.8936. The estimated probability of surviving through both Year 1 and Year 2 is therefore (97/100) x (84/94)=0.8668.

Incorporating covariates: proportional hazards models

Up to now we have not had information for each individual other than the survival time and censoring status ie. we have not considered information such as the weight, age, or smoking status of individuals, for example. These are referred to as covariates or explanatory variables.

Cox Proportional Hazards Modelling

The most interesting survival-analysis research examines the relationship between survival — typically in the form of the hazard function — and one or more explanatory variables (or covariates).

F017

where λ0(t) is the non-parametric baseline hazard function and βx is a linear parametric model using features of the individuals, transformed by an exponential function. The baseline hazard function λ0(t) does not need to be specified for the Cox model, making it semi-parametric. The baseline hazard function is appropriately named because it describes the risk at a certain time when x = 0, which is when the features are not incorporated. The hazard function describes the relationship between the baseline hazard and features of a specific sample to quantify the hazard or risk at a certain time.

The model only needs to satisfy the proportional hazard assumption, which is that the hazard of one sample is proportional to the hazard of another sample. Two samples xi and xj satisfy this assumption when the ratio is not dependent on time as shown below:

F018

The parameters can be estimated by maximizing the partial likelihood.

 

Sources:
https://www.cscu.cornell.edu/news/statnews/stnews78.pdf
https://www.nature.com/articles/6601118#t2
http://blog.applied.ai/survival-analysis-part1/#fn:3
http://data.princeton.edu/wws509/notes/c7s1.html
https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator
http://www.stats.ox.ac.uk/~mlunn/lecturenotes1.pdf
Kaplan-Meier methods and Parametric Regression methods, Kristin Sainani Ph.D.
http://vassarstats.net/survival.html
http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/week09.pdf

Using Deep Neural Networks for NLP Applications – MAS

Really enjoyed visiting the Monetary Authority of Singapore (MAS) and talking on the applications of Deep Neural Networks for Natural Language Processing (NLP).

IMG_0993

During the talk, there were some great questions from the audience, one of them was “can a character level  model capture the unique structure of words and sentences? ” The answer is YES, and I hope that the demo, showing a three-layers 512-units LSTM model trained on publicly-available Regulatory and Supervisory Framework documents downloaded from the MAS website, predicting the next character and repeating it many times, helped to clarify the answer.

MAS Video Capture

Training the same model on Shakespeare’s works and running both models side by side was fun!  

LSTM

 

Install MongoDB Community Edition and PyMongo on OS X

  • Install Homebew, a free and open-source software package management system that simplifies the installation of software on Apple’s macOS operating system.

/usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)

  • Ensure that you’re running the newest version of Homebrew and that it has the newest list of formulae available from the main repository

brew update

  • To install the MongoDB binaries, issue the following command in a system shell:

brew install mongodb

  • Create a data directory (-p create nested directories, but only if they don’t exist already)

mkdir -p ./data/db

  • Before running mongodb for the first time, ensure that the user account running mongodb has read and write permissions for the directory

sudo chmod 765 data

  • Run MongoDB

mongod –dbpath data/db

  • To stop MongoDB, press Control+C in the terminal where the mongo instance is running

Install PyMongo

pip install pymongo

  • In a Python interactive shell:

import pymongo

from pymongo import MongoClient

RoboMongo

  • Create a Connection

client = MongoClient()

  • Access Database Objects

MongoDB creates new databases implicitly upon their first use.

db = client.test

  • Query for All Documents in a Collection

cursor = db.restaurants.find()

for document in cursor: print(document)

  • Query by a Top Level Field

cursor = db.restaurants.find({“borough”: “Manhattan”})

for document in cursor: print(document)

  • Query by a Field in an Embedded Document

cursor = db.restaurants.find({“address.zipcode”: “10075”})

for document in cursor: print(document)

  • Query by a Field in an Array

cursor = db.restaurants.find({“grades.grade”: “B”})

for document in cursor: print(document)

 

  • Insert a Document

Insert a document into a collection named restaurants. The operation will create the collection if the collection does not currently exist.

result = db.restaurants.insert_one(

{

“address”: {            “street”: “2 Avenue”,            “zipcode”: “10075”,            “building”: “1480”,            “coord”: [-73.9557413, 40.7720266]        },

“borough”: “Manhattan”,

“cuisine”: “Italian”,

“grades”: [

{                “date”: datetime.strptime(“2014-10-01”, “%Y-%m-%d“),                “grade”: “A”,                “score”: 11            },

{                “date”: datetime.strptime(“2014-01-16”, “%Y-%m-%d“),                “grade”: “B”,                “score”: 17            }        ],

“name”: “Vella”,

“restaurant_id”: “41704620”

})

result.inserted_id