# Causal Inference – What If

Judea Pearl, a pioneering figure in artificial intelligence, argues that AI has been stuck in a decades-long rut. His prescription for progress? Teach machines to understand the question why.

All the impressive achievements of deep learning amount to just curve fitting

Judea Pearl

Yoshua Bengio added in a recent interview

Now, Bengio says deep learning needs to be fixed. He believes it won’t realize its full potential, and won’t deliver a true AI revolution, until it can go beyond pattern recognition and learn more about cause and effect. In other words, he says, deep learning needs to start asking why things happen.

https://www.wired.com/story/ai-pioneer-algorithms-understand-why/

When we look at observational metrics, our Machine Learning models are doing great predicting a certain outcome given a treatment, but they are good exactly at that and not at the counterfactual – what would have been the outcome given no treatment

Causal Inference by Miguel Hernán and James Robins provides a great introduction to causal inference. You can download latest draft from their website:

The book is divided in three parts of increasing difficulty: Part I is about causal inference without models (i.e., nonparametric identification of causal effects), Part II is about causal inference with models (i.e., estimation of causal effects with parametric models), and Part III is about causal inference from complex longitudinal data (i.e., estimation of causal effects of time-varying treatments).

Here are the top four reasons of why I think it’s a great book:

#### Detailed introduction to the key concepts including many examples

The first four chapters (a definition of causal effect, randomised experiments, observational studies and effect modification) cover key concepts such as potential outcomes (the outcome variable that would have been observed under a certain treatment value), individual and average causal effects, randomisation, identifiability conditions, exchangeability, positivity and consistency. You will get to know Zeus’s extended family, with many examples covering their various health conditions and treatment options. As an example, table 1.1 shows the counterfactual outcomes (die or not) under both treatment (a = 1 a heart transplant) and no treatment (a = 0). Providing practical examples along with the definition helps cement the learning by identifying the key attributes associated with the concept.

#### Practical approach

Starting from the introduction, the authors are quite clear about their goals

Importantly, this is not a philosophy book. We remain agnostic about metaphysical concepts like causality and cause. Rather, we focus on the identification and estimation of causal effects in populations, that is, numerical quantities that measure changes in the distribution of an outcome under different interventions. For example, we discuss how to estimate
in patients with serious heart failure if they received a heart transplant versus if they did not receive a heart transplant. Our main goal is to help decision makers make better decisions

INTRODUCTION: TOWARDS LESS CASUAL CAUSAL INFERENCES

On top of it, the book comes with a large number of code example in both R and Python, covering the first two part including chapters 11-17. It would be great to see additional code examples covering part three (causal inference from complex longitudinal data).

jupyter notebook

and start playing with the code

#### The validity of causal inferences models

The authors discuss a large number of non-parametric and parametric techniques and algorithms to calculate causal effects. But they keep reminding us that all of these techniques rely on untestable assumptions and on expert knowledge. As an example:

Unfortunately, no matter how many variables are included in L, there is no way to test that the assumption (conditional exchangeability) is correct, which makes causal inference from observational data a risky task. The validity of causal inferences requires that the investigators’ expert knowledge is correct

and

Causal inference generally requires expert knowledge and untestable assumptions about the causal network linking treatment, outcome, and other variables.

#### A (geeky) sense of humor

Technical books tend to be concise and dry, telling an anecdote or adding a joke can make difficult content more enjoyable and understandable.

As an example, when discussing the potential outcomes of the heart transplant treatment in Zeus’s extended family, here is how the authors introduced the issue of sampling variability:

At this point you could complain that our procedure to compute effect measures is somewhat implausible. Not only did we ignore the well known fact that the immortal Zeus cannot die, but more to the point – our population in Table 1.1 had only 20 individuals.

Chapter 1.4

As another example, chapter 7 introduces the topic of confounding variables using an observational study which is designed to answer the causal question “does one’s looking up to the sky make other pedestrians look up too?”. The plot develops and new details are being shared in chapters 8 (selection bias), chapter 9 (measurement bias) and chapter 10 (random variability), till the authors announce the following

Do not worry. No more chapter introductions around the effect of your looking up on other people’s looking up. We squeezed that example well beyond what seemed possible

Chapter 11

I hope that you will find this book useful and that you will enjoy learning about Causal Inference as much as I did!

# How to fix “Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA”

After installing Tensorflow using pip3 install:

sudo pip3 install tensorflow

I’ve received the following warning message:

I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later on by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions, and a new coding scheme.

AVX introduces fused multiply-accumulate (FMA) operations, which speed up linear algebra computation, namely dot-product, matrix multiply, convolution, etc. Almost every machine-learning training involves a great deal of these operations, hence will be faster on a CPU that supports AVX and FMA (up to 300%).

We won’t ignore the warning message and we will compile TF from source.

sudo pip3 uninstall protobufsudo pip3 uninstall tensorflow

In a temp folder, clone Tensorflow:

git clone https://github.com/tensorflow/tensorflow
git checkout r2.0

Install the TensorFlow pip package dependencies:

pip3 install -U --user pip six numpy wheel setuptools mock future>=0.17.1pip3 install -U --user keras_applications==1.0.6 --no-depspip3 install -U --user keras_preprocessing==1.0.5 --no-deps

chmod +x bazel-0.26.0-installer-darwin-x86_64.sh ./bazel-0.26.0-installer-darwin-x86_64.sh --user export PATH="$PATH:$HOME/bin" bazel version

Configure your system build by running the following at the root of your TensorFlow source tree:

./configure

The Tensorflow build options expose flags to enable building for platform-specific CPU instruction sets:

Use bazel to make the TensorFlow package builder with CPU-only support:

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-msse4.2 //tensorflow/tools/pip_package:build_pip_package

The bazel build command creates an executable named build_pip_package—this is the program that builds the pip package. Run the executable as shown below to build a .whl package in the /tmp/tensorflow_pkg directory.

To build from a release branch:

./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

Output wheel file is in: /tmp/tensorflow_pkg

You can download the file from here, and try to install it directly

pip3 install /tmp/tensorflow_pkg/tensorflow-2.0.0b1-cp37-cp37m-macosx_10_14_x86_64.whl

cd out of that directory, and now running this should not produce any warning:

python3 -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

Enjoy!

# Learn the Basics of Git and Version Control

## Introduction

There are four fundamental elements in the Git Workflow.
Working Directory, Staging Area, Local Repository and Remote Repository.

If you consider a file in your Working Directory, it can be in three possible states.

1. It can be staged. Which means the files with with the updated changes are marked to be committed to the local repository but not yet committed.
2. It can be modified. Which means the files with the updated changes are not yet stored in the local repository.
3. It can be committed. Which means that the changes you made to your file are safely stored in the local repository.
• git add is a command used to add a file that is in the working directory to the staging area.
• git commit is a command used to add all files that are staged to the local repository.
• git push is a command used to add all committed files in the local repository to the remote repository. So in the remote repository, all files and changes will be visible to anyone with access to the remote repository.
• git fetch is a command used to get files from the remote repository to the local repository but not into the working directory.
• git merge is a command used to get the files from the local repository into the working directory.
• git pull is command used to get files from the remote repository directly into the working directory. It is equivalent to a git fetch and a git merge .
git --version

git config --global --list

Check your machine for existing SSH keys:

ls -al ~/.ssh

If you already have a SSH key, you can skip the next step of generating a new SSH key

## Generating a new SSH key and adding it to the ssh-agent

ssh-keygen -t rsa -b 4096 -C "your_email@example.com"


When adding your SSH key to the agent, use the default macOS ssh-add command. Start the ssh-agent in the background:

To add a new SSH key to your GitHub account, copy the SSH key to your clipboard:

pbcopy < ~/.ssh/id_rsa.pub


Copy the In the “Title” field, add a descriptive label for the new key. For example, if you’re using a personal Mac, you might call this key “Personal MacBook Air”.

Paste your key into the “Key” field.

ssh -T git@github.com


## Let’s Git

Now, locate to the folder you want to place under git in your terminal.

echo "# testGit" >> README.md

Now to add the files to the git repository for commit:

git add .

git status


git commit -m "First commit"

git status

### Add a remote origin and Push:

Now each time you make changes in your files and save it, it won’t be automatically updated on GitHub. All the changes we made in the file are updated in the local repository.

To add a new remote, use the git remote add command on the terminal, in the directory your repository is stored at.

The git remote add command takes two arguments:

• A remote name, for example, origin
• A remote URL, for example, https://github.com/user/repo.git

Now to update the changes to the master:

git remote add origin https://github.com/ofirsh/testGit.git
git remote -v


Now the git push command pushes the changes in your local repository up to the remote repository you specified as the origin.

git push -u origin master


And now if we go and check our https://github.com/ofirsh/testGit repository page on GitHub it should look something like this:

Once you start making changes on your files and you save them, the file won’t match the last version that was committed to git.

Let’s modify README.md to include the following text:

To see the changes you just made:

git diff


#### Markers for changes

--- a/README.md


These lines are a legend that assigns symbols to each diff input source. In this case, changes from a/README.md are marked with a --- and the changes from b/README.md are marked with the +++ symbol.

#### Diff chunks

The remaining diff output is a list of diff ‘chunks’. A diff only displays the sections of the file that have changes. In our current example, we only have one chunk as we are working with a simple scenario. Chunks have their own granular output semantics.

### Revert back to the last committed version to the Git Repo:

Now you can choose to revert back to the last committed version by entering:

git checkout .


### View Commit History:

You can use the git log command to see the history of commit you made to your files:

### Install PySpark

Make sure you have Java 8 or higher installed on your computer and visit the Spark download page

Unzip it and move it to your /opt folder:

$tar -xzf spark-2.4.0-bin-hadoop2.7.tgz$ sudo mv spark-2.4.0-bin-hadoop2.7 /opt/spark-2.4.0

A symbolic link is like a shortcut from one file to another. The contents of a symbolic link are the address of the actual file or folder that is being linked to.

Create a symbolic link (this will let you have multiple spark versions):

$sudo ln -s /opt/spark-2.4.0 /opt/spark̀ Check that the link was indeed created $ ls -l /opt/spark̀lrwxr-xr-x  1 root  wheel  16 Dec 26 15:08 /opt/spark̀ -> /opt/spark-2.4.0

Finally, tell your bash where to find Spark. To find what shell you are using, type:

$echo$SHELL/bin/bash

To do so, edit your bash file:

$nano ~/.bash_profile configure your$PATH variables by adding the following lines to your ~/.bash_profile file:

export SPARK_HOME=/opt/sparkexport PATH=$SPARK_HOME/bin:$PATH# For python 3, You have to add the line below or you will get an errorexport PYSPARK_PYTHON=python3

Now to run PySpark in Jupyter you’ll need to update the PySpark driver environment variables. Just add these lines to your ~/.bash_profile file:

export PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Your ~/.bash_profile file may look like this:

Restart (our just source) your terminal and launch PySpark:

$pyspark This command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’. ### Running PySpark in Jupyter Notebook The PySpark context can be sc = SparkContext.getOrCreate() To check if your notebook is initialized with SparkContext, you could try the following codes in your notebook: sc = SparkContext.getOrCreate() import numpy as np TOTAL = 10000 dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache() print("Number of random points:", dots.count()) stats = dots.stats() print('Mean:', stats.mean()) print('stdev:', stats.stdev())  The result: ### Running PySpark in your favorite IDE Sometimes you need a full IDE to create more complex code, and PySpark isn’t on sys.path by default, but that doesn’t mean it can’t be used as a regular library. You can address this by adding PySpark to sys.path at runtime. The package findspark does that for you. To install findspark just type: $ pip3 install findspark

And then on your IDE (I use Eclipse and Pydev) to initialize PySpark, just call:

import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")


Here is a full example of a standalone application to test PySpark locally

import findspark
findspark.init()
import random
from pyspark import SparkContext
sc = SparkContext(appName="EstimatePi")
def inside(p):
x, y = random.random(), random.random()
return x<em>x + y</em>y &lt; 1
NUM_SAMPLES = 1000000
count = sc.parallelize(range(0, NUM_SAMPLES)) \
.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
sc.stop()


The result:

Enjoy!

# Error when executing jupyter notebook (-bash: jupiter: command not found) [Mac]

After installing jupyter

pip3 install --upgrade pippip3 install jupyter

and trying to launch

jupyter notebook

the following error message appeared

-bash: jupyter: command not found

The solution:

pip3 install --upgrade --force-reinstall --no-cache-dir jupyter

# Fighting Digital Payments Fraud with Deep Learning

Interesting presentation today at the DataScience SG meet-up

Conventional fraud prevention methods are rule based, expansive and slow to implement

Q1 2016: $5 of every$100 subject to fraud attack!

Key fraud types: account takeover, friendly fraud & fraud due to stolen card information

Consumers want: easy, instant, customized, mobile and dynamic options to suit their personal situation. Consumers do NOT want to be part of the fraud detection process.

Key technology enablers:

Historically fraud detection systems have relied on rues hand-curated by fraud experts to catch fraudulent activity.

An auto-encoder is a neural network trained to reconstruct its inputs, which forces a hidden layer to try and to learn good representations of the input

Kaggle dataset:

Train Autoencoder on normal transactions and using the Autoencoder transformation there is now a clear separation between the normal and the fraudulent transactions.

# The Secret Recipe Behind GO-FOOD’s Recommendations (PyData Meetup)

The December PyData Meetup started with Luis Smith, Data Scientist at GO-JEK, sharing the Secret Recipe Behind GO-FOOD’s Recommendations:

“For GO-FOOD, we believe the key to unlocking good recommendations is to derive vector representations for our users, dishes, and merchants. This way we are able to capture our users’ food preferences and recommend them the most relevant merchants and dishes.”

How do people think about the food?

• Flavor profile
• Trendy
• Value for money
• Portion size
• Ingredients

… and much more

The preferred approach is to let the transactional data discover the pattern.

A sample ETL workflow:

Using StarSpace to learn the vector representations:

Go-Jek formulation of the problem:

User-to-dish similarity is surfaced in the app via the “dishes you might like”. The average vector of customer’s purchases represents the recommended dish.

Due to data sparsity, item-based collaborative filtering is used for merchant recommendation.

The cold start problem is still an issue, for inactive users or users that purchase infrequently.

(published here)