Regular expressions are such an incredibly convenient tool, available across so many languages that most developers will learn them sooner or later.
But regular expressions can become quite complex. The syntax is terse, subtle, and subject to combinatorial explosion.
The best way to improve your skills is to write a regular expression, test it on some real data, debug the expression, improve it and repeat this process again and again.
Click on the name of the role that is attached to your cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instances (for example, EMR_EC2_DefaultRole) andclick Attach policies
Attach the AmazonSNSFullAccess policy to the role. This policy allows SNS to send notifications based on state changes in your Amazon EMR cluster.
A summary page is presented with the message “Policy AmazonSNSFullAccess has been attached for the EMR_EC2_DefaultRole.”
Choose Event Pattern, Build event pattern to match events by service. For Service Name, choose the service that emits the event to trigger the rule. For Event Type, choose the specific event that is to trigger the rule.
For Targets, choose Add Target and choose the AWS service that is to act when an event of the selected type is detected
Choose Configure details. For Rule definition, type a name and description for the rule.
Once created, the message “Rule emr_state_change_SNS was created.” will be presented
Make sure you have Java 8 or higher installed on your computer and visit the Spark download page
Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.
Unzip it and move it to your /opt folder:
$ tar -xzf spark-2.4.0-bin-hadoop2.7.tgz $ sudo mv spark-2.4.0-bin-hadoop2.7 /opt/spark-2.4.0
A symbolic link is like a shortcut from one file to another. The contents of a symbolic link are the address of the actual file or folder that is being linked to.
Create a symbolic link (this will let you have multiple spark versions):
Finally, tell your bash where to find Spark. To find what shell you are using, type:
$ echo $SHELL /bin/bash
To do so, edit your bash file:
$ nano ~/.bash_profile
configure your $PATH variables by adding the following lines to your ~/.bash_profile file:
export SPARK_HOME=/opt/spark export PATH=$SPARK_HOME/bin:$PATH # For python 3, You have to add the line below or you will get an error export PYSPARK_PYTHON=python3
Now to run PySpark in Jupyter you’ll need to update the PySpark driver environment variables. Just add these lines to your ~/.bash_profile file:
Restart (our just source) your terminal and launch PySpark:
$ pyspark
This command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.
Running PySpark in Jupyter Notebook
The PySpark context can be
sc = SparkContext.getOrCreate()
To check if your notebook is initialized with SparkContext, you could try the following codes in your notebook:
sc = SparkContext.getOrCreate()
import numpy as np
TOTAL = 10000
dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache()
print("Number of random points:", dots.count())
stats = dots.stats()
print('Mean:', stats.mean())
print('stdev:', stats.stdev())
The result:
Running PySpark in your favorite IDE
Sometimes you need a full IDE to create more complex code, and PySpark isn’t on sys.path by default, but that doesn’t mean it can’t be used as a regular library. You can address this by adding PySpark to sys.path at runtime. The package findspark does that for you.
To install findspark just type:
$ pip3 install findspark
And then on your IDE (I use Eclipse and Pydev) to initialize PySpark, just call:
Interesting presentation today at the DataScience SG meet-up
Conventional
fraud prevention methods are rule based, expansive and slow to implement
Q1 2016: $5
of every $100 subject to fraud attack!
Key fraud types: account takeover, friendly fraud & fraud due to stolen card information
Consumers want: easy, instant, customized, mobile and dynamic options to suit their personal situation. Consumers do NOT want to be part of the fraud detection process.
Key technology enablers:
Historically fraud detection systems have relied on rues hand-curated by fraud experts to catch fraudulent activity.
An auto-encoder is a neural network trained to reconstruct its inputs, which forces a hidden layer to try and to learn good representations of the input
Kaggle dataset:
Train Autoencoder on normal transactions and using the Autoencoder transformation there is now a clear separation between the normal and the fraudulent transactions.
The December PyData Meetup started with Luis Smith, Data Scientist at GO-JEK, sharing the Secret Recipe Behind GO-FOOD’s Recommendations:
“For GO-FOOD, we believe the key to unlocking good recommendations is to derive vector representations for our users, dishes, and merchants. This way we are able to capture our users’ food preferences and recommend them the most relevant merchants and dishes.”
How do people think about the food?
Flavor profile
Trendy
Value for money
Portion size
Ingredients
… and much more
The preferred approach is to let the transactional data discover the pattern.
A sample ETL workflow:
Using StarSpace to learn the vector representations:
Go-Jek formulation of the problem:
User-to-dish similarity is surfaced in the app via the “dishes you might like”. The average vector of customer’s purchases represents the recommended dish.
Due to data sparsity, item-based collaborative filtering is used for merchant recommendation.
The cold start problem is still an issue, for inactive users or users that purchase infrequently.
The * operator unpack the arguments out of a list or tuple.
> args = [3, 6]
> list(range(*args))
[3, 4, 5]
As an example, when we have a list of three arguments, we can use the * operator inside a function call to unpack it into the three arguments:
def f(a,b,c):
print('a={},b={},c={}'.format(a,b,c))
> z = ['I','like','Python']
> f(*z)
a=I,b=like,c=Python
> z = [['I','really'],'like','Python']
> f(*z)
a=['I', 'really'],b=like,c=Python
In Python 3 it is possible to use the operator * on the left side of an assignment, allowing to specify a “catch-all” name which will be assigned a list of all items not assigned to a “regular” name:
> a, *b, c = range(5)
> a
0
> c
4
> b
[1, 2, 3]
The ** operator can be used to unpack a dictionary of arguments as a collection of keyword arguments. Calling the same function f that we defined above:
> d = {'c':'Python','b':'like', 'a':'I'}
> f(**d)
a=I,b=like,c=Python
and when there is a missing argument in the dictionary (‘a’ in this example), the following error message will be printed:
What a great symposium! Thank you Dr. Linlin Li, Prof. Ido Dagan, Prof. Noah Smith and the rest of the speakers for the interesting talks and thank you Singapore University of Technology and Design (SUTD) for hosting this event. Here is a quick summary of the first half of the symposium, you can learn more by looking for the papers published by these research groups:
Linlin Li: The text processing engine that powers Alibaba’s business applications
Dr. Linlin Li from Alibaba presented the mission of Alibaba’s NLP group and spoke about AliNLP, a large scale NLP technology platform for the entire Alibaba Eco-system, dealing with data collection and multilingual algorithms for lexical, syntactic, semantic, discourse analysis and distributed representation of text.
Alibaba is also helping to improve the quality of the Electronic Medical Records (EMRs) in China, traditionally done by labour intensive methods.
Ido Dagan: Consolidating Textual Information
Prof. Ido Dagan gave an excellent presentation on Natural Knowledge Consolidating Textual Information. Texts come in large multitudes, such as news story, search results, and product reviews. Search interfaces hasn’t changed much in decades, which make them accessible, but hard to consume. For example, the news tweets illustration in the slide below shows that here is a lot of redundancy and complementary information, so there is a need to consolidate the knowledge within multiple texts.
Generic knowledge representation via structured knowledge graphs and semantic representation are often being used, where both approaches require an expert to annotate the dataset, which is expansive and hard to replicate.
The structure of a single sentence will look like this:
The information can be consolidated across the various data sources via Coreference
To conclude
Noah A. Smith: Syncretizing Structured and Learned Representation
Prof. Noah described new ways to use representation learning for NLP
Some promising results
Prof. Noah presented different approaches to solve backpropagation with structure in the middle, where the intermediate representation is non-differentiable.
ImageMagick® is used to create, edit, compose, or convert bitmap images. It can read and write images in a variety of formats (over 200) including PNG, JPEG, GIF, HEIC, TIFF, DPX, EXR, WebP, Postscript, PDF, and SVG. Use ImageMagick to resize, flip, mirror, rotate, distort, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses and Bézier curves.
Wand is a ctypes-based simple ImageMagick binding for Python, so go through the step-by-step guide on how to install it.
Let’s start by installing ImageMagic:
brew install imagemagick@6
Next, create a symbolic link, with the following command (replace <your specific 6 version> with your specific version):
ln -s /usr/local/Cellar/imagemagick@6/<your specific 6 version>/lib/libMagickWand-6.Q16.dylib /usr/local/lib/libMagickWand.dylib
It seems that ghostscript is not installed by default, so let’s install it:
brew install ghostscript
Now we will need to create a soft link to /usr/bin, but /usr/bin/ in OS X 10.11+ is protected.
Just follow these steps:
1. Reboot to Recovery Mode. Reboot and hold “Cmd + R” after start sound.
2. In Recovery Mode go to Utilities -> Terminal.
3. Run: csrutil disable
4. Reboot in Normal Mode.
5. Do the “sudo ln -s /usr/local/bin/gs /usr/bin/gs” in terminal.
6. Do the 1 and 2 step. In terminal enable back csrutil by run: csrutil enable
Step 1: Launch a new Finder window by choosing New Finder Window under the Finder’s File menu.
Step 2: Navigate to a desired file or folder and click the item in the Finder window while holding the Control (⌃) key, which will bring up a contextual menu populated with various file-related operations.
Step 3: Now hold down the Option (⌥) key to reveal a hidden option in the contextual menu, labeled “Copy (file/folder name) as Pathname”.
Step 4: Selecting this option will copy the complete, not relative, pathname of your item into the system clipboard.