Data Scientists, With Great Power Comes Great Responsibility

It is a good time to be a data scientist.

With great power comes great responsibilityIn 2012 the Harvard Business Review hailed the role of data scientist “The sexiest job of the 21st century”. Data scientists are working at both start-ups and well-established companies like Twitter, Facebook, LinkedIn and Google receiving a total average salary of $98k ($144k for US respondents only) .

Data – and the insights it provides – gives the business the upper hand to better understand the clients, prospects and the overall operation. Till recently, it was not uncommon for million- and -billion- dollar deals to be accepted or rejected based on the intuition & instinct. Data scientists add value to the business by leading to informed and timely decision-making process using quantifiable, data driven evidence and by translating the data into actionable insights.

So you have a rewarding corporate day job, how about doing data science for social good?

You have been endowed with tremendous data science and leadership powers and the world needs them! Mission-driven organizations are tackling huge social issues like poverty, global warming and public health. Many have tons of unexplored data that could help them make a bigger impact, but don’t have the time or skills to leverage it. Data science has the power to move the needle on critical issues but organizations need access to data superheroes like you to use it

DataKind Blog 

There are a few of programs that exist specifically to facilitate this, the United Nations #VisualizeChange challenge is the one I’ve just taken.

As the Chief Information Technology Officer, I invite the global community of data scientists to partner with the United Nations in our mandate to harness the power of data analytics and visualization to uncover new knowledge about UN related topics such as human rights, environmental issues, and political affairs.

Ms. Atefeh Riazi – Chief Information Technology Officer at United Nations

The United Nations UNITE IDEAS published a number of data visualization challenges. For the latest challenge, #VisualizeChange: A World Humanitarian Summit Data Challenge , we were provided with unstructured information from nearly 500 documents that the consultation process has generated as per July 2015. The qualitative data is categorized in emerging themes and sub-themes that have been identified according to a developed taxonomy. The challenge was to process the consultation data in order to develop an original and thought provoking illustration of information collected through the consultation process.

Over the weekend I’ve built an interactive visualization using open-source tools (R and Shiny) to help and identify innovative ideas and innovative technologies in humanitarian action, especially on communication and IT technology. By making it to the top 10 finalists, the solution is showcased here, as well as on the Unite Ideas platform and other related events worldwide, so I hope that this visualization will be used to uncover new knowledge.

#VisualizeChange Top 10 Visualizations

Opening these challenges to the public helps raising awareness – during the process of analysing the data and designing the visualization I’ve learned on some of most pressing humanitarian needs such as Damage and Need Assessment, Communication, Online Payment and more and on the most promising technologies such as Mobile, Data Analytics, Social Media, Crowdsourcing and more.

#VisualizeChange Innovative Ideas and Technologies

Kaggle is another great platform where you can apply your data science skills for social good. How about applying image classification algorithms to automate the right whale recognition process using a dataset of aerial photographs of individual whale? With fewer than 500 North Atlantic right whales left in the world’s oceans, knowing the health and status of each whale is integral to the efforts of researchers working to protect the species from extinction.

Right Whale Recognition

There are other excellent programs.

The DSSG program ran by the University of Chicago, where aspiring data scientists take on real-world problems in education, health, energy, transportation, economic development, international development and work for three months on data mining, machine learning, big data, and data science projects with social impact.

DataKind bring together top data scientists with leading social change organizations to collaborate on cutting-edge analytics and advanced algorithms to maximize social impact.

Bayes Impact  is a group of practical idealists who believe that applied properly, data can be used to solve the world’s biggest problems.

Are you aware of any other organizations and platforms doing data science for social good? Feel free to share.

Tools & Technologies

R for analysis & visualization
Shiny.io for hosting the interactive R script
The complete source code and the data is hosted here

 

Bioinformatic Is Cool, I Mean Really Cool

I’ve been fascinated by genomic research for years.  While successfully implementing a fairly large and diverse set of algorithms (segmentation, image processing and machine learning) on text, image and other semi-structured datasets, till recently I didn’t have much exposure to the exciting field of bioinformatics, processing DNA and RNA sequences.

After completing Bioinformatics Methods I  and Bioinformatics Methods II (thank you Professor Nicholas Provart, University of Toronto) I have a better appreciation of the important roles of Bioinformatics in medicinal sciences and in drug discovery, diagnosis and disease management, but also a better appreciation of the complexity involved with the processing of large biological datasets.

bioinformatics

Topics covered in these two courses include multiple sequence alignments, phylogenetics, gene expression data analysis, and protein interaction networks, in two separate parts. The first part, Bioinformatic Methods I, dealt with databases, Blast, multiple sequence alignments, phylogenetics, selection analysis and metagenomics. The second part, Bioinformatic Methods II, dealt with motif searching, protein-protein interactions, structural bioinformatics, gene expression data analysis, and cis-element predictions.

Please find below a short list of tools and resources I’ve used while completing the different labs:

NCBI/Blast I

http://www.ncbi.nlm.nih.gov/

Multiple Sequence Alignments

http://megasoftware.net (download tool)

http://dialign.gobics.de/

http://mafft.cbrc.jp/alignment/server/

Phylogenetic

https://code.google.com/p/jmodeltest2/

http://www.megasoftware.net/

http://evolution.genetics.washington.edu/phylip.html

http://bar.utoronto.ca/webphylip/

http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::fastdnaml

Selection Analysis

http://selecton.tau.ac.il/

http://bar.utoronto.ca/EMBOSS

http://www.datamonkey.org/                 

NEXT GENERATION SEQUENCING APPLICATIONS: RNA-SEQ AND METAGENOMICS                                    

http://mpss.udel.edu

http://mockler-jbrowse.mocklerlab.org/jbrowse.athal/?loc=Chr2%3A12414112..12415692

http://www.ncbi.nlm.nih.gov/pubmed/20410051

METAGENOMICS

http://metagenomics.anl.gov/

Protein Domain, Motif and Profile Analysis

http://www.ncbi.nlm.nih.gov/guide/domains-structures/

http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi               

http://smart.embl-heidelberg.de/

http://pfam.xfam.org/search

http://www.ebi.ac.uk/InterProScan/

Protein-protein interactions               

http://mips.gsf.de/proj/ppi/  

http://dip.doe-mbi.ucla.edu/dip/Main.cgi

http://www.thebiogrid.org/index.php

http://www.cytoscape.org/download.html   (download tool)             

Structural Bioinformatics

http://pdb.org  

http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html  

http://pymol.org/edu (download tool)            

Gene Expression Analysis

http://www.ncbi.nlm.nih.gov/sra  

http://bar.utoronto.ca/BioC/

Gene Expression Data Analysis

http://bar.utoronto.ca/ntools/cgi-bin/ntools_expression_angler.cgi

http://bar.utoronto.ca/affydb/cgi-bin/affy_db_exprss_browser_in.cgi

http://bioinfo.cau.edu.cn/agriGO/analysis.php

http://bar.utoronto.ca/efp/

http://atted.jp/

http://bar.utoronto.ca/ntools/cgi-bin/ntools_venn_selector.cgi

Cis regulatory element mapping and prediction

http://bar.utoronto.ca/ntools/cgi-bin/BAR_Promomer.cgi

http://www.bioinformatics2.wsu.edu/cgi-bin/Athena/cgi/home.pl

http://jaspar.genereg.net/

http://meme-suite.org/tools/meme

Additional Coursers

https://class.coursera.org/molevol-002 Computational Molecular Evolution

https://www.coursera.org/course/pkubioinfo Bioinformatics

https://www.coursera.org/course/programming1 Python

https://www.coursera.org/course/webapplications Web Applications (Ruby on Rails)

https://www.coursera.org/course/epigenetics Epigenetics

https://www.coursera.org/course/genomicmedicine Genomic & Precision Medicine

https://www.coursera.org/course/usefulgenetics Useful Genetics

Book

Zvelebil & Baum 2008 Understanding Bioinformatics. Garland Science, New York

Understanding Bioinformatics