Learn the Basics of Git and Version Control

Introduction

There are four fundamental elements in the Git Workflow.
Working Directory, Staging Area, Local Repository and Remote Repository.

If you consider a file in your Working Directory, it can be in three possible states.

  1. It can be staged. Which means the files with with the updated changes are marked to be committed to the local repository but not yet committed.
  2. It can be modified. Which means the files with the updated changes are not yet stored in the local repository.
  3. It can be committed. Which means that the changes you made to your file are safely stored in the local repository.
  • git add is a command used to add a file that is in the working directory to the staging area.
  • git commit is a command used to add all files that are staged to the local repository.
  • git push is a command used to add all committed files in the local repository to the remote repository. So in the remote repository, all files and changes will be visible to anyone with access to the remote repository.
  • git fetch is a command used to get files from the remote repository to the local repository but not into the working directory.
  • git merge is a command used to get the files from the local repository into the working directory.
  • git pull is command used to get files from the remote repository directly into the working directory. It is equivalent to a git fetch and a git merge .
git --version
git config --global --list

Check your machine for existing SSH keys:

ls -al ~/.ssh

If you already have a SSH key, you can skip the next step of generating a new SSH key

Generating a new SSH key and adding it to the ssh-agent

ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

When adding your SSH key to the agent, use the default macOS ssh-add command. Start the ssh-agent in the background:

Adding a new SSH key to your GitHub account

To add a new SSH key to your GitHub account, copy the SSH key to your clipboard:

pbcopy < ~/.ssh/id_rsa.pub

Copy the In the “Title” field, add a descriptive label for the new key. For example, if you’re using a personal Mac, you might call this key “Personal MacBook Air”.

Paste your key into the “Key” field.

After you’ve set up your SSH key and added it to your GitHub account, you can test your connection:

ssh -T git@github.com

Let’s Git

Create a new repository on GitHub. Follow this link.
Now, locate to the folder you want to place under git in your terminal.

echo "# testGit" >> README.md

Now to add the files to the git repository for commit:

git add . 
git status

Now to commit files you added to your git repo:

git commit -m "First commit"
git status

Add a remote origin and Push:

Now each time you make changes in your files and save it, it won’t be automatically updated on GitHub. All the changes we made in the file are updated in the local repository.

To add a new remote, use the git remote add command on the terminal, in the directory your repository is stored at.

The git remote add command takes two arguments:

  • A remote name, for example, origin
  • A remote URL, for example, https://github.com/user/repo.git

Now to update the changes to the master:

git remote add origin https://github.com/ofirsh/testGit.git
git remote -v

Now the git push command pushes the changes in your local repository up to the remote repository you specified as the origin.

git push -u origin master

And now if we go and check our https://github.com/ofirsh/testGit repository page on GitHub it should look something like this:

See the Changes you made to your file:

Once you start making changes on your files and you save them, the file won’t match the last version that was committed to git.

Let’s modify README.md to include the following text:

To see the changes you just made:

git diff

Markers for changes

--- a/README.md
+++ b/README.md

These lines are a legend that assigns symbols to each diff input source. In this case, changes from a/README.md are marked with a --- and the changes from b/README.md are marked with the +++ symbol.

Diff chunks

The remaining diff output is a list of diff ‘chunks’. A diff only displays the sections of the file that have changes. In our current example, we only have one chunk as we are working with a simple scenario. Chunks have their own granular output semantics.

Revert back to the last committed version to the Git Repo:

Now you can choose to revert back to the last committed version by entering:

git checkout .

View Commit History:

You can use the git log command to see the history of commit you made to your files:

$ git log
echo 'testGit #2' > README.md 
git add .
git commit -m 'second commit'
git push origin master

Pushing Changes to the Git Repo:

Now you can work on the files you want and commit to changes locally. If you want to push changes to that repository you either have to be added as a collaborator for the repository or you have create something known as pull request. Go and check out how to do one here and give me a pull request with your code file.

So to make sure that changes are reflected on my local copy of the repo:

git pull origin master

Two more useful command:

git fetch
git merge

In the simplest terms, git fetch followed by a git merge equals a git pull. But then why do these exist?

When you use git pull, Git tries to automatically do your work for you. It is context sensitive, so Git will merge any pulled commits into the branch you are currently working in. git pull automatically merges the commits without letting you review them first.

When you git fetch, Git gathers any commits from the target branch that do not exist in your current branch and stores them in your local repository. However, it does not merge them with your current branch. This is particularly useful if you need to keep your repository up to date, but are working on something that might break if you update your files. To integrate the commits into your master branch, you use git merge.

Pull Request

Pull requests let you tell others about changes you’ve pushed to a GitHub repository. Once a pull request is sent, interested parties can review the set of changes, discuss potential modifications, and even push follow-up commits if necessary.

null

Pull requests are GitHub’s way of modeling that you’ve made commits to a copy of a repository, and you’d like to have them incorporated in someone else’s copy. Usually the way this works is like so:

  1. Lady Ada publishes a repository of code to GitHub.
  2. Brennen uses Lady Ada’s repo, and decides to fix a bug or add a feature.
  3. Brennen forks the repo, which means copying it to his GitHub account, and clones that fork to his computer.
  4. Brennen changes his copy of the repo, makes commits, and pushes them up to GitHub.
  5. Brennen submits a pull request to the original repo, which includes a human-readable description of the changes.
  6. Lady Ada decides whether or not to merge the changes into her copy.

Creating a Pull Request

There are 2 main work flows when dealing with pull requests:

  1. Pull Request from a forked repository
  2. Pull Request from a branch within a repository

Here we are going to focus on 2.

Creating a Topical Branch

First, we will need to create a branch from the latest commit on master. Make sure your repository is up to date first using

git pull origin master

To create a branch, use git checkout -b <new-branch-name> [<base-branch-name>], where base-branch-name is optional and defaults to master. I’m going to create a new branch called pull-request-demo from the master branch and push it to github.

git checkout -b pull-request-demo
git status
git push origin pull-request-demo

Now you can see two branches:

and

make some changes to README.md:

echo "test git #3 pull-request-demo" >> README.md
cat README.md

Commit the changes:

git add README.md
git commit -m 'commit to pull-request-demo'

…and push your new commit back up to your copy of the repo on GitHub:

git push --set-upstream origin pull-request-demo

Back to the web interface:

You can press the “Compare”, and now you can create the pull request:

Go ahead and click the big green “Create Pull Request” button. You’ll get a form with space for a title and longer description:

Like most text inputs on GitHub, the description can be written in GitHub Flavored Markdown. Fill it out with a description of your changes. If you especially want a user’s attention in the pull request, you can use the “@username” syntax to mention them (just like on Twitter).

GitHub has a handy guide to writing the perfect pull request that you may want to read before submitting work to other repositories, but for now a description like the one I wrote should be ok. You can see my example pull request here.

Pressing the green “Create pull request”:

And now, pressing the “Merge pull request” button:

Confirm merge:

Switching you local repo back to master:

git checkout master
git pull origin master

And now the local repo is pointing to master and contains the merged files.

Enjoy!

BTW please find below a nice Git cheat sheet

References:

Learn the Basics of Git in Under 10 Minutes

Pull Request Tutorial

Submitting a Pull Request on GitHub

Fighting Digital Payments Fraud with Deep Learning

Interesting presentation today at the DataScience SG meet-up

Conventional fraud prevention methods are rule based, expansive and slow to implement

Q1 2016: $5 of every $100 subject to fraud attack!

Key fraud types: account takeover, friendly fraud & fraud due to stolen card information

Consumers want: easy, instant, customized, mobile and dynamic options to suit their personal situation. Consumers do NOT want to be part of the fraud detection process.

Key technology enablers:

Historically fraud detection systems have relied on rues hand-curated by fraud experts to catch fraudulent activity.

An auto-encoder is a neural network trained to reconstruct its inputs, which forces a hidden layer to try and to learn good representations of the input

Kaggle dataset:

Train Autoencoder on normal transactions and using the Autoencoder transformation there is now a clear separation between the normal and the fraudulent transactions.

The Secret Recipe Behind GO-FOOD’s Recommendations (PyData Meetup)

The December PyData Meetup started with Luis Smith, Data Scientist at GO-JEK, sharing the Secret Recipe Behind GO-FOOD’s Recommendations:

“For GO-FOOD, we believe the key to unlocking good recommendations is to derive vector representations for our users, dishes, and merchants. This way we are able to capture our users’ food preferences and recommend them the most relevant merchants and dishes.”

How do people think about the food?

  • Flavor profile
  • Trendy
  • Value for money
  • Portion size
  • Ingredients

… and much more

The preferred approach is to let the transactional data discover the pattern.

A sample ETL workflow:

Using StarSpace to learn the vector representations:

Go-Jek formulation of the problem:

User-to-dish similarity is surfaced in the app via the “dishes you might like”. The average vector of customer’s purchases represents the recommended dish.

Due to data sparsity, item-based collaborative filtering is used for merchant recommendation.

The cold start problem is still an issue, for inactive users or users that purchase infrequently.

(published here)

Introduction to Survival Analysis

Introduction

Survival analysis is generally defined as a set of methods for analysing data where the outcome variable is the time until the occurrence of an event of interest. For example, if the event of interest is heart attack, then the survival time can be the time in years until a person develops a heart attack. For simplicity, we will adopt the terminology of survival analysis, referring to the event of interest as ‘death’ and to the waiting time as ‘survival’ time, but this technique has much wider applicability. The event can be death, occurrence of a disease, marriage, divorce, etc. The time to event or survival time can be measured in days, weeks, years, etc.

The specific difficulties relating to survival analysis arise largely from the fact that only some individuals have experienced the event and, subsequently, survival times will be unknown for a subset of the study group. This phenomenon is called censoring.

In longitudinal studies exact survival time is only known for those individuals who show the event of interest during the follow-up period. For others (those who are disease free at the end of the observation period or those that were lost) all we can say is that they did not show the event of interest during the follow-up period. These individuals are called censored observations. An attractive feature of survival analysis is that we are able to include the data contributed by censored observations right up until they are removed from the risk set.

Survival and Hazard

T  –  a non-negative random variable representing the waiting time until the occurrence of an event.

The survival function, S(t), of an individual is the probability that they survive until at least time t, where t is a time of interest and T is the time of event.

F001

The survival curve is non-increasing (the event may not reoccur for an individual) and is limited within [0,1].

survival-graph-crop

F(t) – the probability that the event has occurred by duration t:

F002

the probability density function (p.d.f.) f(t):

F003

An alternative characterisation of the distribution of T is given by the hazard function, or instantaneous rate of occurrence of the event, defined as

F004

The numerator of this expression is the conditional probability that the event will occur in the interval [t,t+dt] given that it has not occurred before, and the denominator is the width of the interval. Dividing one by the other we obtain a rate of event occurrence per unit of time. Taking the limit as the width of the interval goes down to zero, we obtain an instantaneous rate of occurrence.

Applying Bayes’ Rule

F005

on the numerator of the hazard function:

F006

Given that the event happened between time t to t+dt, the conditional probability of this event happening after time t is 1:

F007

Dividing by dt and passing to the limit gives the useful result:

F008

In words, the rate of occurrence of the event at duration t equals the density of events at t, divided by the probability of surviving to that duration without experiencing the event.

We will soon show that there is a one-to-one relation between the hazard and the survival function.

The derivative of S(t) is:

F009

We will now show that the hazard function is the derivative of -log S(t):

F010

If we now integrate from 0 to time t:

F011

F012

F013

 and introduce the boundary condition S(0) = 1 (since the event is sure not to have occurred by duration 0):

F014

F015

we can solve the above expression to obtain a formula for the probability of surviving to duration t as a function of the hazard at all durations up to t:

F016

One approach to estimating the survival probabilities is to assume that the hazard function follow a specific mathematical distribution. Models with increasing hazard rates may arise when there is natural aging or wear. Decreasing hazard functions are much less common but find occasional use when there is a very early likelihood of failure, such as in certain types of electronic devices or in patients experiencing certain types of transplants. Most often, a bathtub-shaped hazard is appropriate in populations followed from birth.

The figure below hows the relationship between four parametrically specified hazards and the corresponding survival probabilities. It illustrates (a) a constant hazard rate over time (e.g. healthy persons) which is analogous to an exponential distribution of survival times, (b) strictly increasing (c) decreasing hazard rates based on a Weibull model, and (d) a combination of decreasing and increasing hazard rates using a log-Normal model. These curves are illustrative examples and other shapes are possible.

different_hazard_functions

Example

The simplest possible survival distribution is obtained by assuming a constant risk over time:

survival-constant-risk

Censoring and truncation

One of the distinguishing feature of the field of survival analysis is censoring: observations are called censored when the information about their survival time is incomplete; the most commonly encountered form is right censoring.

censor_truncation

Right censoring occurs when a subject leaves the study before an event occurs, or the study ends before the event has occurred. For example, we consider patients in a clinical trial to study the effect of treatments on stroke occurrence. The study ends after 5 years. Those patients who have had no strokes by the end of the year are censored. Another example of right censoring is when a person drops out of the study before the end of the study observation time and did not experience the event. This person’s survival time is said to be censored, since we know that the event of interest did not happen while this person was under observation.

Left censoring is when the event of interest has already occurred before enrolment. This is very rarely encountered.

In a truncated sample, we do not even “pick up” observations that lie outside a certain range.

Unlike ordinary regression models, survival methods correctly incorporate information from both censored and uncensored observations in estimating important model parameters

Non-parametric Models

The very simplest survival models are really just tables of event counts: non-parametric, easily computed and a good place to begin modelling to check assumptions, data quality and end-user requirements etc. When no event times are censored, a non-parametric estimator of S(t) is 1 − F(t), where F(t) is the empirical cumulative distribution function.

Kaplan–Meier

When some observations are censored, we can estimate S(t) using the Kaplan-Meier product-limit estimator. An important advantage of the Kaplan–Meier curve is that the method can take into account some types of censored data, particularly right-censoring, which occurs if a patient withdraws from a study, is lost to follow-up, or is alive without event occurrence at last follow-up.

Suppose that 100 subjects of a certain type were tracked over a period of time to determine how many survived for one year, two years, three years, and so forth. If all the subjects remained accessible throughout the entire length of the study, the estimation of year-by-year survival probabilities for subjects of this type in general would be an easy matter. The survival of 87 subjects at the end of the first year would give a one-year survival probability estimate of 87/100=0.87; the survival of 76 subjects at the end of the second year would yield a two-year estimate of 76/100=0.76; and so forth.

But in real-life longitudinal research it rarely works out this neatly. Typically there are subjects lost along the way (censored) for reasons unrelated to the focus of the study.

Suppose that 100 subjects of a certain type were tracked over a period of two years determine how many survived for one year and for two years. Of the 100 subjects who are “at risk” at the beginning of the study, 3 become unavailable (censored) during the first year and 3 are known to have died by the end of the first year. Another 2 become unavailable during the second year and another 10 are known to have died by the end of the second year.

KM_experiment_table_died

Kaplan and Meier proposed that subjects who become unavailable during a given time period be counted among those who survive through the end of that period, but then deleted from the number who are at risk for the next time period.

The table below shows how these conventions would work out for the present example. Of the 100 subjects who are at risk at the beginning of the study, 3 become unavailable during the first year and 3 die. The number surviving the first year (Year 1) is therefore 100 (at risk) – 3 (died) = 97 and the number at risk at the beginning of the second year (Year 2) is 100 (at risk) – 3 (died) – 3 (unavailable) = 94. Another 2 subjects become unavailable during the second year and another 10 die. So the number surviving Year 2 is 94 (at risk) – 10 (died) = 84.

KM_experiment_table_survived

As illustrated in the next table, the Kaplan-Meier procedure then calculates the survival probability estimate for each of the t time periods, except the first, as a compound conditional probability.

KM_experiment_table

The estimate for surviving through Year 1 is simply 97/100=0.97. And if one does survive through Year 1, the conditional probability of then surviving through Year 2 is 84/94=0.8936. The estimated probability of surviving through both Year 1 and Year 2 is therefore (97/100) x (84/94)=0.8668.

Incorporating covariates: proportional hazards models

Up to now we have not had information for each individual other than the survival time and censoring status ie. we have not considered information such as the weight, age, or smoking status of individuals, for example. These are referred to as covariates or explanatory variables.

Cox Proportional Hazards Modelling

The most interesting survival-analysis research examines the relationship between survival — typically in the form of the hazard function — and one or more explanatory variables (or covariates).

F017

where λ0(t) is the non-parametric baseline hazard function and βx is a linear parametric model using features of the individuals, transformed by an exponential function. The baseline hazard function λ0(t) does not need to be specified for the Cox model, making it semi-parametric. The baseline hazard function is appropriately named because it describes the risk at a certain time when x = 0, which is when the features are not incorporated. The hazard function describes the relationship between the baseline hazard and features of a specific sample to quantify the hazard or risk at a certain time.

The model only needs to satisfy the proportional hazard assumption, which is that the hazard of one sample is proportional to the hazard of another sample. Two samples xi and xj satisfy this assumption when the ratio is not dependent on time as shown below:

F018

The parameters can be estimated by maximizing the partial likelihood.

 

Sources:
https://www.cscu.cornell.edu/news/statnews/stnews78.pdf
https://www.nature.com/articles/6601118#t2
http://blog.applied.ai/survival-analysis-part1/#fn:3
http://data.princeton.edu/wws509/notes/c7s1.html
https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator
http://www.stats.ox.ac.uk/~mlunn/lecturenotes1.pdf
Kaplan-Meier methods and Parametric Regression methods, Kristin Sainani Ph.D.
http://vassarstats.net/survival.html
http://www.mas.ncl.ac.uk/~nmf16/teaching/mas3311/week09.pdf

Using Deep Neural Networks for NLP Applications – MAS

Really enjoyed visiting the Monetary Authority of Singapore (MAS) and talking on the applications of Deep Neural Networks for Natural Language Processing (NLP).

IMG_0993

During the talk, there were some great questions from the audience, one of them was “can a character level  model capture the unique structure of words and sentences? ” The answer is YES, and I hope that the demo, showing a three-layers 512-units LSTM model trained on publicly-available Regulatory and Supervisory Framework documents downloaded from the MAS website, predicting the next character and repeating it many times, helped to clarify the answer.

MAS Video Capture

Training the same model on Shakespeare’s works and running both models side by side was fun!  

LSTM

 

Install GPU TensorFlow on AWS Ubuntu 16.04

 TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.

On a typical system, there are multiple computing devices. In TensorFlow, the supported device types are CPU and GPU.  GPUs offer 10 to 100 times more computational power than traditional CPUs, which is one of the main reasons why graphics cards are currently being used to power some of the most advanced neural networks responsible for deep learning.

The environment setup is often the hardest part of getting a deep learning setup going, so hopefully you will find this step-by-step guide helpful.

Launch a GPU-enabled Ubuntu 16.04 AWS instance

Choose an Amazon Machine Image (AMI) – Ubuntu Server 16.04 LTS

AWS-Ubuntu

Choose an instance type

The smallest GPU-enabled machine is p2.xlarge

AWS-Ubuntu-GPUs

You can find more details here.

Configure Instance Details, Add Storage (choose storage size), Add Tags, Configure Security Group and Review Instance Launch and Launch.

launch-status

Open the terminal on your local machine and connect to the remote machine (ssh -i)

Update the package lists for upgrades for packages that need upgrading, as well as new packages that have just come to the repositories

sudo apt-get –assume-yes update

Install the newer versions of the packages

sudo apt-get –assume-yes  upgrade

Install the CUDA 8 drivers

CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. GPU-accelerated CUDA libraries enable drop-in acceleration across multiple domains such as linear algebra, image and video processing, deep learning and graph analytics.

Verify that you have a CUDA-Capable GPU

lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Verify You Have a Supported Version of Linux

uname -m && cat /etc/*release

x86_64
DISTRIB_ID=Ubuntu
…..

The x86_64 line indicates you are running on a 64-bit system. The remainder gives information about your distribution.

 Verify the System Has gcc Installed

gcc –version

If the message is “The program ‘gcc’ is currently not installed. You can install it by typing: sudo apt install gcc”

sudo apt-get install gcc

gcc –version

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609

….

Verify the System has the Correct Kernel Headers and Development Packages Installed

uname –r

4.4.0-1038-aws

CUDA support

Download the CUDA-8 driver (CUDA 9 is not yet supported by TensorFlow 1.4)

The driver can be downloaded from here:

CUDA-download-toolikit

CUDA-download-toolikit-installer

Or, downloaded directly to the remote machine:

wget -O ./cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb

Downloading patch 2 as well:

wget -O ./cuda-repo-ubuntu1604-8-0-local-cublas-performance-update_8.0.61-1_amd64.deb https://developer.nvidia.com/compute/cuda/8.0/Prod2/patches/2/cuda-repo-ubuntu1604-8-0-local-cublas-performance-update_8.0.61-1_amd64-deb

Install the CUDA 8 driver and patch 2

Extract, analyse, unpack and install the downloaded .deb files

sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-cublas-performance-update_8.0.61-1_amd64.deb

apt-key is used to manage the list of keys used by apt to authenticate packages. Packages which have been authenticated using these keys will be considered trusted.

sudo apt-key add /var/cuda-repo-8-0-local-ga2/7fa2af80.pub
sudo apt-key add /var/cuda-repo-8-0-local-cublas-performance-update/7fa2af80.pub

sudo apt-get update

Once completed (~10 min), reboot the system to load the NVIDIA drivers.

sudo shutdown -r now

Install cuDNN v6.0

The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.

Download the cuDNN v6.0 driver

The driver can be downloader from here: please note that you will need to register first.

cuDNN-download2

Copy the driver to the AWS machine (scp -r -i)

Extract the cuDNN files and copy them to the target directory

tar xvzf cudnn-8.0-linux-x64-v6.0.tgz  

sudo cp -P cuda/include/cudnn.h /usr/local/cuda/includesudo

cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64

sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Update your bash file

nano ~/.bashrc

Add the following lines to the end of the bash file:

export CUDA_HOME=/usr/local/cuda

export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH

export PATH=${CUDA_HOME}/bin:${PATH}

bashrc

Save the file and exit.

Install TensorFlow

Install the libcupti-dev library

The libcupti-dev library is the NVIDIA CUDA Profile Tools Interface. This library provides advanced profiling support. To install this library, issue the following command:

sudo apt-get install libcupti-dev

Install pip

Pip is a package management system used to install and manage software packages written in Python which can be found in the Python Package Index (PyPI).

sudo apt-get install python-pip

sudo pip install –upgrade pip

Install TensorFlow

sudo pip install tensorflow-gpu

Test the installation

Run the following within the Python command line:

from tensorflow.python.client import device_lib

def get_available_gpus():

    local_device_protos = device_lib.list_local_devices()

    return [x.name for x in local_device_protos if x.device_type == ‘GPU’]

get_available_gpus()

The output should look similar to that:

2017-11-22 03:18:15.187419: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

2017-11-22 03:18:17.986516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

2017-11-22 03:18:17.986867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:

name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235

pciBusID: 0000:00:1e.0

totalMemory: 11.17GiB freeMemory: 11.10GiB

2017-11-22 03:18:17.986896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

[u’/device:GPU:0′]

 

 

Twitter’s real-time stack: Processing billions of events with Heron and DistributedLog

At the first day of the Strata+Hadoop, Maosong Fu, Tech Lead for Realtime Compute at Twitter shared some details on Twitter’s real-time stack

img_6462

There are many industries where optimizing in real-time can have a large impact on overall business performance, leading to instant benefits in customer acquisition, retention, and marketing.

valueofdata

But how fast is real-time? It depends on the context, whether it’s financial trading, tweeting, ad impression count or monthly dashboard.

what-is-real-time

 

Earlier Twitter messaging stack

twittermessaging

Kestrel is a message queue server we use to asynchronously connect many of the services and functions underlying the Twitter product. For example, when users update, any tweets destined for SMS delivery are queued in a Kestrel; the SMS service then reads tweets from this queue and communicates with the SMS carriers for delivery to phones. This implementation isolates the behavior of SMS delivery from the behavior of the rest of the system, making SMS delivery easier to operate, maintain, and scale independently.

Scribe is a server for aggregating log data streamed in real time from a large number of servers.

Some of Kestrel’s limitations are listed in the below:

  • Durability is hard to achieve
  • Read-behind degrades performance
  • Adding subscribers is expensive
  • Scales poorly as number of queues increase
  • Cross DC replication

kestrellimitations

From Twitter Github:

We’ve deprecated Kestrel because internally we’ve shifted our attention to an alternative project based on DistributedLog, and we no longer have the resources to contribute fixes or accept pull requests. While Kestrel is a great solution up to a certain point (simple, fast, durable, and easy to deploy), it hasn’t been able to cope with Twitter’s massive scale (in terms of number of tenants, QPS, operability, diversity of workloads etc.) or operating environment (an Aurora cluster without persistent storage).

Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

Kafka relies on file system page cache with performance degradation when subscribers fall behind – too many random I/O

kafkalimitations

Rethinking messaging

rethinkingmessaging

Apache DistributedLog (DL) is a high-throughput, low-latency replicated log service, offering durability, replication and strong consistency as essentials for building reliable real-time applications.

distributedlogs

Event Bus

eventbus

Features of DistributedLog at Twitter:

High Performance

DL is able to provide milliseconds latency on durable writes with a large number of concurrent logs, and handle high volume reads and writes per second from thousands of clients.

Durable and Consistent

Messages are persisted on disk and replicated to store multiple copies to prevent data loss. They are guaranteed to be consistent among writers and readers in terms of strict ordering.

Efficient Fan-in and Fan-out

DL provides an efficient service layer that is optimized for running in a multi- tenant datacenter environment such as Mesos or Yarn. The service layer is able to support large scale writes (fan-in) and reads (fan-out).

Various Workloads

DL supports various workloads from latency-sensitive online transaction processing (OLTP) applications (e.g. WAL for distributed database and in-memory replicated state machines), real-time stream ingestion and computing, to analytical processing.

Multi Tenant

To support a large number of logs for multi-tenants, DL is designed for I/O isolation in real-world workloads.

Layered Architecture

DL has a modern layered architecture design, which separates the stateless service tier from the stateful storage tier. To support large scale writes (fan- in) and reads (fan-out), DL allows scaling storage independent of scaling CPU and memory.

 

 

distibutedlogs

Storm was no longer able to support Twitter’s requirements and although Twitter improved Storm’s performance eventually Twitter decided to develop Heron.

Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter. Heron is built with a wide array of architectural improvements that contribute to high efficiency gains.

heron

Heron has powered all realtime analytics with varied use cases at Twitter since 2014. Incident reports dropped by an order of magnitude demonstrating proven reliability and scalability

 

heronusecases

Heron is in production for the last 3 years, reducing hardware requirements by 3x. Heron is highly scalable both in the ability to execute large number of components for each topology and the ability to launch and track large numbers of topologies.

 

heronattwitter

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch– and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation.

The way this works is that an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch system and once in the stream processing system. You stitch together the results from both systems at query time to produce a complete answer.

lambda-16338c9225c8e6b0c33a3f953133a4cb

Lambda Architecture: the good

lambdathegood

The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be.

LambdaTheBad.png

Summingbird to the Rescue! Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Summingbird.png

Curious to Learn More?

curioustolearnmore

 

Interested in Heron?

Code at: https://github.com/twitter/heron

http://twitter.github.io/heron/

 

inerestedinheron

Install MongoDB Community Edition and PyMongo on OS X

  • Install Homebew, a free and open-source software package management system that simplifies the installation of software on Apple’s macOS operating system.

/usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)

  • Ensure that you’re running the newest version of Homebrew and that it has the newest list of formulae available from the main repository

brew update

  • To install the MongoDB binaries, issue the following command in a system shell:

brew install mongodb

  • Create a data directory (-p create nested directories, but only if they don’t exist already)

mkdir -p ./data/db

  • Before running mongodb for the first time, ensure that the user account running mongodb has read and write permissions for the directory

sudo chmod 765 data

  • Run MongoDB

mongod –dbpath data/db

  • To stop MongoDB, press Control+C in the terminal where the mongo instance is running

Install PyMongo

pip install pymongo

  • In a Python interactive shell:

import pymongo

from pymongo import MongoClient

RoboMongo

  • Create a Connection

client = MongoClient()

  • Access Database Objects

MongoDB creates new databases implicitly upon their first use.

db = client.test

  • Query for All Documents in a Collection

cursor = db.restaurants.find()

for document in cursor: print(document)

  • Query by a Top Level Field

cursor = db.restaurants.find({“borough”: “Manhattan”})

for document in cursor: print(document)

  • Query by a Field in an Embedded Document

cursor = db.restaurants.find({“address.zipcode”: “10075”})

for document in cursor: print(document)

  • Query by a Field in an Array

cursor = db.restaurants.find({“grades.grade”: “B”})

for document in cursor: print(document)

 

  • Insert a Document

Insert a document into a collection named restaurants. The operation will create the collection if the collection does not currently exist.

result = db.restaurants.insert_one(

{

“address”: {            “street”: “2 Avenue”,            “zipcode”: “10075”,            “building”: “1480”,            “coord”: [-73.9557413, 40.7720266]        },

“borough”: “Manhattan”,

“cuisine”: “Italian”,

“grades”: [

{                “date”: datetime.strptime(“2014-10-01”, “%Y-%m-%d“),                “grade”: “A”,                “score”: 11            },

{                “date”: datetime.strptime(“2014-01-16”, “%Y-%m-%d“),                “grade”: “B”,                “score”: 17            }        ],

“name”: “Vella”,

“restaurant_id”: “41704620”

})

result.inserted_id

 

Changing the Game with Data and Insights – Data Science Singapore

Another great Data Science Singapore (DSSG) event! Hong Cao from McLaren Applied Technologies shared his insights on applications of data science at McLaren.

The first project is using economic sensors for continuous human conditions monitoring, including sleep quality, gait and activities, perceived stress and cognitive performance.

DataScience

Gait outlier analysis provides unique insight on fatigue levels while exercising, probability of injury and post surgery performance and recovery.

Gait Analysis Data Science

DataScience(1)DataScience(3)

A related study looks into how biotelemetry assist in patient treatment such as ALS (Amyotrophic Lateral Sclerosis) disease progression monitoring. The prototype tools collect heart rate, activity and speech data to analyse disease progression.

DataScience(3)

HRV (Heart Rate Variability) features are extracted from both the time and from the frequency domains.

DataScience(4)

Activity score is derived from the three-axis accelerometer data.

DataScience(5)DataScience(6)

The second project was a predictive failure POC, to help determine the condition of Haul Trucks in order to predict when a failure might happen. The cost of having an excavator go down in the field is $5 million a day, while the cost of losing a haul truck is $1.8 million per day. If you can prevent it from going down in the field, that makes a huge difference

DataScience(7)DataScience(8)DataScience(9)DataScience(10)DataScience(11)DataScience(12)