Data Science Team
Accelerate your Analytics Projects Using Pivotal Data Scientists
To help businesses develop actionable insights for the business and grow their skill base more rapidly, Pivotal has assembled a team of experienced data scientists available for analytics-focused engagements.
Sarah Aerni joined us from Stanford University where she performed interdisciplinary research at the interface of biomedicine and computer science, specifically machine learning. She focused her efforts on building computational models enabling research for a broad range of fields in biomedicine. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. In addition to academic research, she has experience in consulting, co-founding a company offering expert services in informatics for both academic and industry settings. Beyond her interests in biomedical informatics, she is passionate about education and fostering interdisciplinary collaboration.
Michael Brand has been a researcher for the past 20 years. He is passionate about finding new insights in data and new algorithms for handling data, irrespective of data domain. He holds patents and pending patents in topics ranging from information retrieval to data clustering to speech and video processing to in-database analytics, and has worked on some highly successful and award-winning products, such as the Xbox Kinect camera [made by PrimeSense, a previous employer] and Verint Systems’ IntelliFind.
He holds a B.Sc. in Engineering from Tel-Aviv University, an M.Sc. in Applied Mathematics (Game Theory) from the Weizmann Institute of Science and is nearing completion of a Ph.D. in Information Technology (Theory of Computation) from Monash University. He has published papers on strategy analysis (in monitoring experiment safety), bioinformatics (compressed genotyping), object oriented design, number theory, mathematical design theory and more, as well as a book entitled “The Mathematics of Justice”.
Kaushik Das is an expert at applying mathematical models to solve business problems. He has more than 10 years of experience designing and deploying analytical software, working for enterprise software companies such as Rapt, Demandtec and M-Factor as well as in business consulting with McKinsey.
After joining EMC’s Data Computing Division as Director of Analytics, Kaushik has led several projects to provide actionable insights from the analysis of Big Data, for customers in a variety of sectors ranging from Utility, Oil & Gas, to Banking, Retail and Digital Media. Kaushik is leveraging his Geophysics PhD-level academic work to lead our efforts to build out an analytics practice for the Energy sector, where Big Data holds great promise.
Hulya has extensive experience in the application of algorithmic approaches to complex problems in multiple verticals. Before joining Pivotal, she had held positions at IBM and M-Factor where she helped her customers make optimal business decisions under uncertainty by marrying machine learning algorithms with optimization routines. She is currently a senior principal data scientist at Pivotal where she is the lead for health care vertical. She holds a Ph.D. in Operations Research from the University of Florida.
Victor holds a Computer Sciences Ph.D from University of Cincinnati, where his research covered machine learning, computer vision and social network mining. His previous Sr. Scientist role at Riverain Medical was focused on designing machine learning predictive models for FDA-approved chest X ray imaging diagnosis / enhancement. During his PhD, he was also associated with Cincinnati Children’s Hospital for various biomedical image mining research projects e.g. Parkinson’s Diseased Brain MRI, etc. Victor is deepening his knowledge in graph mining topics, especially link prediction in dynamic social networks. Prior to his PhD work, he was a Research SDE at National Lab of Pattern Recognition, Chinese Academy of Sciences, and involved in building large-scale intelligent video analytics systems. Victor obtained an Electrical Engineer bachelor degree from University of Science & Technology of China.
Hong comes to Pivotal with 20 years experience in insurance and retail banking analytics. His past employers include IAG, the largest general insurer in Australia, and ANZ Bank, one of the big four Australian banks. His work has spanned a number of areas in actuarial science, statistics and financial mathematics including risk scorecards, loss reserving and economic capital. He has a BEc in actuarial science and a Masters in applied statistics from Macquarie University, and a PhD in statistics from the Australian National University.
Ian has a background in numerical analysis and simulation, having been a postdoctoral researcher in the field of theoretical cosmology for a number of years. His expertise includes high performance computing for scientific applications, perturbative analysis of large systems of differential equations and the differential geometry underlying relativistic physics. He completed a PhD in theoretical cosmology at Queen Mary, University of London and received a MSc from Imperial College London in theoretical physics. Ian’s work has been published in leading international physics journals and he has released the Python numerical package used in his research to the community.
Annika is a seasoned leader of analytics initiatives, coming to Pivotal after over six years in data leadership roles at Yahoo! At Pivotal since April 2011, she has built the “Data Science Dream Team” – an industry-leading group of data scientists, representing a rich combination of vertical domain and horizontal analytical expertise – to facilitate Data Science-driven transformations for Pivotal customers. During her time at Yahoo!, she led Audience and International data solutions for Yahoo!’s central data organization, Strategic Data Solutions, and led Insights Services – comprised of a team of 40 researchers covering Web analytics, satisfaction/brand health metrics, and audience/ad measurement. Annika is a recognized evangelist for “applied data” and well known for her acute focus on action-enablement.
Woo Jae comes to Pivotal with a diverse background in practical applications of both humble and advanced inferential statistics, and is committed to helping customers make smarter decisions driven by a healthy mix of Big Data, domain expertise, predictive modeling, and curiosity. He was previously a Senior Statistician at Bay Area startup M-Factor (now IBM DemandTec) where he built and delivered go-live demand analysis solutions powered by Bayesian hierarchical models. Woo Jae holds a M.Sc. in Statistics from Stanford, and a B.Sc. in Industrial & Labor Relations with a minor in Biometry & Statistics from Cornell.
Alexander Kagoshima received a M.S. (Dipl.-Ing.) in Economics and Engineering from TU Berlin. In graduate school, his focus was on machine learning and statistics with applications to time series analysis, specifically financial time series. For his thesis, Alex developed and evaluated a novel change-point detection algorithm that combines ideas from physics, machine learning and statistics. This novel algorithm is intended for wind-turbine sensor data and enables intelligent wind-turbine control systems to dynamically adapt to changing wind conditions. During his internship at Volkswagen, he predicted malfunctions in a test fleet of fuel-cell cars. In his spare time, he tries to find new ways to analyze soccer games with statistical methods.
Niels has an extensive background in natural language processing, machine learning, and web-scale information extraction. He holds a Ph.D. in Computer Science from UMBC and a M.S. in Computer Science from East Carolina University. Previously, he has focused on developing automated methods to mine and construct commonsense knowledge from unstructured, web-scale data to support cognitive tasks such as planning, reasoning, and prediction for intelligent machines. Prior to Pivotal, Niels was at the Johns Hopkins University’s Applied Physics Laboratory where he developed and optimized interplanetary communications protocols and mission critical space-flight software. At Pivotal, Niels is involved in advancing customer-centric analytics solutions for major clients in the finance industry.
Anirudh Kondaveeti is a graduate from Arizona State University (ASU) specializing in the area of machine learning and spatio-temporal data mining. His PhD research work at ASU focused on developing analytic models for spatio-temporal data, specifically trajectory data. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains. Some of the areas where trajectory analysis plays an important role are marketing, production systems, surveillance for security purposes e.g. border security, traffic monitoring and management, social media etc. He has developed models for clustering, outlier detection and change detection in real unstructured trajectory data obtained from the GPS traces of vehicles (e.g. cabs in SFO). Currently, Anirudh is solving Big Data problems in IT operations and security analytics.
Derek is a seasoned data scientist specialized in the art of building data-driven defense posture against security threats and frauds. He joined Pivotal after six years with RSA building behavior-based risk engine to predict on-line banking frauds. Prior to RSA, Derek had 10+ years of research experience in voice biometrics-security and speech and language processing with various startups. He received MSEE in signal and image processing from University of Southern California. Today he leads an A-team of passionate data scientists to solve the Big Data problems in IT operations and security analytics.
Mariann Micsinai is a member of the Data Science team at Pivotal’s New York City location. She holds a Ph.D. in Computational Biology from NYU/Yale and pursued Master’s degrees in Computational Biology, Mathematics, Economics, International Studies and Linguistics. Most recently, she focused on developing novel computational methods in human cancer genetics and on analyzing and integrating next-generation sequencing experimental data (ChIP-Seq, RNA-Seq, Exome-Seq, 4C-Seq etc.). Prior to her experience in the bioinformatics field, she worked for Lehman Brothers’ Emerging Market Trading desk in a market risk management role. In parallel, she taught Econometrics and Mathematics for Economists at Barnard College. At Pivotal, Mariann is involved in solving Big Data problems in finance and health care analytics.
Kee Siong Ng has been involved in machine learning and artificial intelligence R&D for the last 14 or so years. He has a PhD from the Australian National University, with vertical experience in traffic analysis, government, retail, and energy. He is the zero-th member of the APJ data science team and recently came back from a stint as consulting lead data scientist at Reliance Industries in Mumbai. Kee Siong has served on the program committees of several international machine learning conferences and he continues to hold adjunct appointments at universities in Australia and Singapore. He is tickled by the idea that he has the sexiest job of the 21st century.
Rashmi Raghu holds a PhD in Mechanical Engineering and a PhD Minor in Management Science & Engineering from Stanford University. Her doctoral research focused on the development of novel computational (fluid-structure interaction) models of the cardiovascular system to aid disease research. Prior to that she completed M.E. and B.E. degrees in Engineering Science from the University of Auckland, New Zealand. Her professional interests include mathematical modeling and computational techniques for applications ranging from modeling physical systems to aiding decision analysis. Among other things she enjoys reading, listening to and playing music (she plays the Veena and is happy to describe it to people who have not heard of it).
Completed his PhD in Operations Research from UC Berkeley. He also hold an MA in Statistics from UC Berkeley, and an undergraduate degree in engineering from IIT Bombay. His doctoral research focused on online, data-driven learning approaches in operations management problems, with applications to inventory control, capacity allocation and dynamic assortment optimization.
Previously a Data Scientist at Sony Mobile Communications in Redwood City, he lead Sony Mobile’s Data Science initiatives that spanned across Statistical Machine Learning & NLP with text, including Named Entity Recognition, Topic Models and Sentiment Analysis. He also worked with Sony’s Playstation group on User Persona Analysis, Ad Recommendation & Inventory Forecasting. Before joining Sony, he was an engineer in the Analytics team at Salesforce.com. He received a Masters in Computer Sciences from University of Texas at Austin completing his thesis and research in NLP where he focused on graphical models for weakly supervised sequence prediction problems like Supertagging. While in graduate school, he also interned at IBM Research in Almaden. He loves mountaineering and is a native speaker of Python.
Regunathan Radhakrishnan received his M.S. and Ph.D degrees in EE from Polytechnic University, Brooklyn, NY in 2002 and 2004 respectively. He was a research fellow in the ECE department and also an intern at Mitsubishi Electric Research Labs, Cambridge, MA during his graduate studies. He was with Mitsubishi Electric Research Laboratories (MERL) until May 2006 as a Visiting Researcher. He was one of the leading contributors to MERL's Video Summarization algorithm based on Audio Classification which is now a differentiating feature of Mitsubishi Electric's DVD recorder in Japan. He joined the sound technology research team at Dolby Laboratories in June 2006 and applied machine learning methods to audio and video data for intelligent metadata creation for eco-system wide solutions. His research interests include statistical machine learning, video summarization, multimedia content identification, watermarking, and spatial audio. He has published several conference papers, as well as 7 journal papers and 5 book chapters and a book on multimedia content analysis and security. He has filed about 40 patents in the areas of multimedia content analysis, multimedia security and content identification. He is currently serving as an associate editor for Journal of Multimedia and as a Program Committee member of SPIE Forensics and Media Security conference. He has received “ The Valuable Invention Award” from Mitsubishi Electric Corporation for the work on Video summarization using Audio Analysis and has received the SMPTE Journal paper award for the work on Audio-Video Synchronization.
Noelle Sio has a background in mathematics, statistics, and data mining with an emphasis on digital media. She is currently a Senior Data Scientist at Pivotal, a division of EMC. Her work has mainly focused on helping companies extend their analytical capabilities by exploring and modeling digital data; from enabling a digital media agency to hypertarget their online campaigns to discovering new insights to online conversion drivers for a large retail bank. Previously, she worked as a researcher at eHarmony and Fox Interactive Media, where she leveraged massive datasets up to the petabyte level for marketing optimization, fraud detection, and ad monetization products. Noelle holds an A.B. From Washington University in St. Louis in Applied Mathematics and Physical Anthropology and a M.S. in Applied Mathematics from Cal Poly Pomona.
Jarrod came to Pivotal after working as an Analytics Consultant at a marketing company. He received an undergraduate degree in mathematics from Kennesaw State University additionally studying statistics and business. His experience has been centered in marketing analytics preparing, analyzing and modeling data for a wide range of marketing applications. Past projects have included churn/attrition modeling, lifetime value estimation, netlift modeling, market segmentation, and market basket analysis. He has a heavy interest in distributed computing and has spent time working with Hadoop and MPP databases. He also finds technologies integration to be fascinating and has worked extensively in extending tools such as SAS by incorporating java, R, and excel functionality.
Prior to joining Pivotal, Cao Yi worked in Energy Market Company Singapore for two years, where he was responsible for the maintenance of Singapore's wholesale electricity market clearing engine. He has a Ph.D. in Operations Research from National University of Singapore, and a master's degree in Statistics. The areas of his expertise are optimization and statistical data mining.
Jin received a Master’s degree in Artificial Intelligence from the Katholieke Universiteit Leuven, Belgium and a Ph.D. in Machine Learning from the Australian National University. Before joining Pivotal in 2011, she was a postdoctoral researcher with the School of Computer Science at the University of Adelaide (UoA), Australia. Her research is in stochastic (online) learning, optimization, and robust statistics for Machine Learning and Computer Vision problems. Jin’s work has been published in top international computer science journals and conferences. She continues to hold an adjunct lecturer position with UoA.
Healthcare Fraud/Waste/Abuse (FWA) Detection
Fraudulent, wasteful, and abusive behaviors account for an estimated one-third of the $2.2 trillion spent on healthcare in the US each year. Unfortunately, medical fraud, waste, and abuse (FWA) is difficult to detect, with companies resorting to a “pay and chase” approach to adjudicating claims. As such approaches are ineffective, a Specialty Benefits Management (SBM) company approached the Pivotal Data Science team to develop real-time models to detect FWA prior to authorization.
Since there is no standardized set of training data labeled as fraudulent or wasteful by medical professionals, the solution needed to incorporate real-time ordering of physician data to patient history and physician profile data. The Pivotal Data Science team was able to build hundreds of models with customer feature sets, which were trained in three minutes. They developed models to predict physician responses, allowing the SBM to triage deviations from expected responses.
Cross-Channel Customer Engagement
A major health insurance company engaged Pivotal Data Science to reduce call center costs through cross-channel customer engagement. Since each call to the call center represents a significant cost to the company, the company wanted to determine when customers were using the call center when they could have otherwise used the company’s website. Since unstructured text data requires considerable preprocessing, this presented a significant challenge to the company. Pivotal Data Science used logistic regression to predict whether a customer was unlikely to find relevant information on the web, prompting a call. The team created a topic model based on the call logs to learn common questions and issues customers were calling about, identifying topics they had trouble finding on the company’s website.
Network Intrusion Detection
Malware and advanced cyber threats can remain undetected on a large network for a long time, posing significant threats to enterprises. According on Mandiant, in 2012 malware remained on compromised networks for a median of 416 days. Covert threats employ increasingly sophisticated techniques to bypass traditional security appliances. One of the worlds largest health care providers engaged Pivotal Data Science to develop predictive models to detect risks of advanced cyber threats within the company’s large heterogeneous environment and reduce malware’s undetected “free time” on the network.
Pivotal Data Science built a new behavioral intrusion detection framework for the customer based on machine learning, graph theory and security research. The team designed operational components of the health care provider’s next generation SIEM, and engineered a full featured, custom social graph-based intrusion model to help identify risky behavior on the network. As a result, the solution engineered by Pivotal Data Science identified breaches that went undetected by all of the health care provider’s existing security products.
Predicting Commodity Futures through Twitter
Always seeking new sources for tips, commodity traders have turned to Twitter to tease out hot leads and emerging trends soon to affect the market. A major agribusiness company turned to the Pivotal Data Science team to develop models to predict the price of commodity futures through tweets. While ingesting a stream of tweets filtered by keyword or users is a common use case, reaping insight from the Twitter stream is a greater challenge. Language on Twitter doesn’t adhere to common rules of spelling and grammar, and is free-form and unstructured. Moreover, there is no domain specific label corpus of tweet sentiment, requiring semi-supervised machine learning techniques.
To address these challenges, the Pivotal Data Science developed a solution for the customer blending sentiment analysis, tweet metadata, and text regression algorithms. A filtered power track stream of tweets is ingested into the Pivotal Greenplum DCA, which runs a blend of models to deliver predictive insight to a dashboard interface. The team established a foundation for blending structured data such as market fundamentals with the unstructured data from tweets, enabling the agribusiness company to more efficiently and effectively predict commodity futures from tweets.
Vaccine Potency Prediction
There is an industry-wide need to keep the cost of vaccine manufacturing down in order to remain profitable, while reengineering processes to enable the delivery of drugs to patients on different continents. Pivotal Data Science worked with a major pharmaceutical company to predict the potency of vaccines and gain insights into the manufacturing process in order to fine-tune vaccine production.
The company aimed to predict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process. Unfortunately, there were data quality issues due to manual data entry performed with varying consistency. Moreover, the company’s data model was not optimal for running analytical queries. Pivotal Data Science introduced a new data model optimized for accessibility and enabling analytics. The team built automated outlier detection/correction methods to address manual data entry quality issues, and devised imputation methods to deal with data completeness issues. As a result, they were able to build predictive models yielding high accuracy results, helping improve manufacturing efficiency and vaccine quality.
Credit Risk Assessment and Stress Testing
A global financial services provider desired to speed up the process of compliance reporting and stress testing for Basel III, a global regulatory standard. Running the calculation procedures on the customer’s legacy database proved to be time-consuming, meaning that analytics to be performed in an overnight batch mode. By working with the Pivotal Data Science team, the provider was able to implement risk asset calculation and stress testing using the Greenplum database. Three years of data was processed in well under 2 minutes, significantly faster than the customer’s existing procedures. The team also connected an “in-database" visualization tool to the Greenplum database via ODBC, enabling on-demand reporting and visualization, speeding up the process of testing and reporting.
Network User Behavior Anomaly Detection
Data science presents the opportunity to proactively discover security threats that might otherwise go undetected. Pivotal Data Science worked with a major financial service enterprise that wished to improve detection of anomalous user behaviors on its global network. No existing SIEM solutions offered the scale, performance, or breadth of data sources that the enterprise required, which boasts thousands of network devices and billions of events within a six month period.
The enterprise required a predictive analytics solution to aid in spotting new threats quickly and effectively. The solution would model a baseline for user network behavior, and enable anomaly detection in an adaptive and scalable architecture. Pivotal Data Science built an innovative Graph Mining-based algorithmic framework using advanced machine learning which modeled both network topology and temporal behaviors. Through parallel model training and the risk scoring of network behavior, the solution proactively identified anomalous user behavior.
Traffic Velocity Prediction
The Pivotal Data Science team was approached by the research and development division of a major automotive company to help solve the problem of predicting traffic velocity patterns. Using multiple data sources, including GPS records and weather conditions, the division wished to and find causal relationships between velocity and external factors in an automated fashion. The available data presented a number of challenges, encompassing multiple data feeds of arbitrary sizes which were often unrelated to traffic volume, and limited access to metadata on roads and individual vehicles.
The Pivotal Data Science team built an in-database model workflow using supervised and unsupervised machine learning techniques simultaneously, which was capable of modeling months of data in minutes. Working with the automotive company’s R&D division, the team established a framework for predicting traffic velocity that provides interpretable and actionable results.
Real-time alert system for high-risk respiratory patients
The cost of treating respiratory patients is greatly increased by urgent care visits, which can be greatly reduced by preventative treatment. A large vertically-integrated healthcare provider engaged Pivotal Data Science to help identify urgent care risk factors and propose interventions to reduce the likelihood that patients would require urgent care visits. The Data Science Labs team built models to predict the risk of acute care encounters using prescription refill history, air quality data, and socioeconomic indicators. This enabled a real-time application that would alert patients and physicians to take preventative action, reducing costly and preventable urgent care visits.
Text Analytics for Churn Prediction
A major telecom company wanted to reduce customer churn — the rate at which customers reduce or stop their business with an organization — through more accurate predictive models. The company’s previous attempts were limited in their effectiveness, with existing models only using structured features. Manually entered call center memos were unstructured and full of typos. Pivotal Data Science built sentiment analysis models to predict churn and topic models to understand topics of conversation in call center memos. As a result, the telecom company achieved an improvement of 16% in ROC curve for churn prediction, allowing it to better identify common reasons for customer loss and proactively address those issues.
Shrinkage Reduction in Retail
A major supermarket chain approached the Pivotal Data Science team with the goal to reduce shrinkage, the loss of inventory due to administrative errors, employee mistakes, or theft. The chain’s existing system relied upon a manually-prescribed set of rules based on model stores, and was effective at explaining only slightly more than half of the existing shrinkage. The company desired to reduce shrinkage by identifying and better understanding clusters of products and stores with similar shrinkage characteristics. The Pivotal Data Science team built a clustering algorithm that pulled shrinkage to sales data from the chain’s stores, grouped by product type. The algorithm located themes common among particular product clusters, enabling the chain to identify similar shrinkage patterns across stores.
Analytics and Insight for the Smart Power Grid
Silver Spring Networks, the leading provider of smart grid platforms for power utilities, required a data analytics platform to transform the continuous two-way stream of data between a utility and devices on a grid into actionable insight. With these real-time data transactions producing multiple petabytes of data, Silver Spring Networks engaged the Pivotal Data Science team to implement a highly scalable compute and analytics infrastructure. The team implemented a solution using the Pivotal Data Computing Appliance (DCA) and worked with Silver Spring Networks to develop intricate analytic schemas, producing useful datasets that help utilities detect issue ranging from power outages to theft. The Pivotal Data Science team continues to collaborate with Silver Spring Networks to develop analytic solutions that help realize a more intelligent, efficient and reliable power utility system.
Obama 2012 CTO Harper Reed: “Big Answers Rather Than Big Data”
The Obama 2012 Presidential campaign has been hailed as the most innovative and data-driven to date. Headed by CTO Harper Reed, the campaign placed a premium on multivariate user testing and user experience was key, demanding that his team reap insights from a huge amount of data. This required people who thrived in a high-pressure, constantly changing work environment, facing a hard deadline and with no room for error. It demanded a technology infrastructure that wouldn’t go down on election day. And it required that the campaign run analytics on a wealth of data from many sources — social media, SMS, volunteer canvasing, poll results — and deliver actionable insights in real time.
As the keynote speaker at the Pivotal HD announcement on Monday, February 23, 2013, Reed delivered what he called “A Big Data intervention.” Reed urged the audience to move the conversation beyond Big Data, toward what he called “Big Answers.” He noted that technologists are “often bad at listening when it comes to data,” and said that practitioners “should be using these insights from data to do more listening.” He stated technologists must ask themselves, “‘How do we use targeting to have a conversation?’”
Big Answers Rather Than Big Data
“It seems to me that the conversation about Big Data came out of it being hard to store,“ Reed said in an interview with Pivotal P.O.V. ”But the thing is, storage doesn’t matter, because these and other companies have solved this problem. Because it’s largely solved, it pushes us to this idea that I think people invest less in, which is Big Answers: How do you couple this data with answers? I think people say Big Data when they really mean answers. This should be a conversation of how you get the best answers.”
“On the campaign, it was very important to us that we focused on the answers, so our analytics team was all about giving us answers. Every day they would give us a brief that would say, ‘we need to put more people in Florida’ or ‘we need to do more media here,’ or ‘when you’re on the TV, here is who your audience is.’ It was about giving us actual information that we could react to and act upon, answers to an actual question. The question wasn’t ‘how big is your database?’ The questions we focus on as technologists, we forget that the reason we’re here is to get better answers and insights.”
From Many-to-Many to Many-to-One: How Data Can Push Conversations Closer to Individuals
“I realized that my entire career has been about asking, ‘how do we push the conversation closer and closer to an individual?’ The idea of microlistening came from Tim O’Reilly. I was at a Foo Camp, and Tim O’Reilly said, ‘I’m tired of targeting, I want more listening.’ I started looking closely at all the listening we were doing, whether it was on Twitter or knocking on doors.”
“If you knock on a door and somebody says, ‘I’m really interested in health care,’ then when you ask, ‘who’s interested in health care in this area,’ you have that person and you have something to react to, and you can move that conversation closer. You can then target them to have a loop, a conversation.“ ”People are focusing more on, ‘how can I show them an ad?’ They’ll say that a person needs to see it 12 times or 10 times or whatever to impact them, and that’s great — we still need ads and microtargeting is cool — but I think what’s important is asking, ’how do we have a conversation, and how do we make it so it’s on an individual level?’”
Trusting Your Users
“If you have microtargeting in your organization, how do you reflect that into listening? First of all, you have to trust your users. You have to want to have that conversation. In a campaign, that is the most important thing, for us to listen to people, because we’re representing people — we’re trying to participate in a representative government.”
“That’s where I think it gets interesting, and technology is bringing us towards that, where we can target so specifically that you start to have microconversations and these tiny, many-to-one interactions. Twitter obviously does this on a grand social scale, and we can take advantage of that, but there’s a lot of opportunities there that people are missing.”
Transforming medicine through cutting-edge genomics research
The Human Genome Project, completed in 2003, stands as one of humanity’s greatest achievements, a 13-year project to sequence all of the 20,000–25,000 genes and the three billion chemical base pairs that make up human DNA. As impressive an accomplishment this was at the time, mapping the genome was only the beginning. The Human Genome Project proved that genes only comprised a small portion of the human genome, and that there are still many more fundamental elements to identify.
Identifying these elements, and understanding how they operate, is one of the top priorities for the Broad Institute, a genomic medicine research center comprised of scientists from the Massachusetts Institute of Technology (MIT) and Harvard University. The Broad Institute seeks to achieve a comprehensive understanding of the elements in the human genome and identify how cell circuits process information. This research not only expands our understanding of the genome, but also directly informs the Broad Institute’s search for therapeutic applications. Through a number of collaborative projects, the Institute aims to identify the cellular mutations that cause cancer, identify the molecular basis of viruses, bacteria and other pathogen that cause infectious disease, and innovate how current genomic research informs the development of pharmaceutical drugs.
The Broad Institute is the outcome of collaboration between Harvard and MIT scientists over the past decade. During the ‘90s, the Whitehead Institute/MIT Center for Genome Research served as the flagship of the Human Genome Project and inspired early genomic medicine collaborations with Harvard Medical School. The success of these projects, as well as those of Harvard Medical School’s Institute of Chemistry and Cell Biology, instigated the need for a formalized partnership between Harvard and MIT scientists. The Broad Institute was founded in 2003 to tackle major molecular medicine challenges using methodologies that are open, collaborative, interdisciplinary and scalable.
Genome biology and cell circuits research comprise one of the most significant big data challenges today. The Broad Institute’s data footprint grew to eight petabytes in the past year, with a throughput that doubles every five months. Matthew Trunnell, Manager of Research Computing at the Broad Institute, expresses the unique data challenges posed by this research stating, “Applications of next-generation DNA sequencing at the Broad Institute and elsewhere require an unprecedented amount of data to be stored and managed in a high-performance research and bioinformatics pipeline. "To contend with the ever-increasingly large genome-related datasets, researchers at the Broad Institute rely on Isilon clustered storage and have developed over 30 custom software tools for highly specialized analysis of the data. In line with its commitment to open and collaborative research approaches, the software and data are available to all researchers to download from the Broad Institute’s website.
We enjoy a unique opportunity in human history to understand the molecular building blocks of life and apply these insights to the improvement of therapeutic medicine. The Broad Institute understands this, boldly transforming our understanding of the human genome while accelerating the treatment of cancer, viruses and other serious diseases.
Data Science for the Public Good
Over the course of four seminars, Code for America’s Big Data for the Public Good series presented a rare opportunity for leading data science thinkers, innovators, and practitioners to explore how the field can serve the public interest. The series hosted Michal Migurski and Eric Rodenbeck of Stamen Design, Jake Porway of DataKind and formerly The New York Times, wiki inventor and Nike Code for a Better World Fellow Ward Cunningham, and Jeremy Howard, President and Chief Scientist at Kaggle. Though the diverse selection of speakers explored the topic from a variety of perspectives, a set of recurring themes arose during the talks.
Data Science is Storytelling
As big data proliferates, new approaches to communicating the insights revealed are required. Interactive maps, data visualization, and infographics are tools to clarify complexity, placing data scientists into the role of the storyteller. Referencing Hans Rosling, Stamen’s Rodenbeck emphasized that “narrative is critical,” in order to provide context and effectively communicate what a particular dataset demonstrates to governments, social organizations, and citizens. Stamen learned this lesson when a city growth visualization tool the studio built for Trulia was the subject of backlash by residents, who believed the application visualized their community as if it was a missile target in a video game. Cunningham emphasized the “storytelling aspect” of data science while discussing the Smallest Federated Wiki platform he developed at Nike, which allows companies and the public to share and collaborate on the analysis of data sets. The platform’s federated approach, Cunningham explained, allows other users to assess the quality and accuracy of analyses. “There are a lot of different stories to tell about any particular piece of data,” he said, boasting of the advantages of a federated approach. Through analysis, narratives emerge. “As we find our way through the data,” he explained, “we can say, ‘here’s a visualization that can tell this story and there’s a visualization that can tell that story.’”
Data Empowers Citizens Stamen’s Migurski emphasized the value of establishing a dialogue between citizens and government institutions based upon data. Data empowers citizens to advocate for the needs of their communities, and reveals what needs are not being addressed. Sharing what he learned while building the crime-mapping application Oakland Crimespotting, Migurski identified four best practices for working with government data. He stated that tools “must demonstrate the impact by linking to truths shared within the communities served,” be stable and reliable, refer to an official version that can be verified and supported, and remain contextually relevant. Porway spoke of the wealth of public data that goes untapped, lamenting undirected government or organization data dumps that are “like giving crude oil to people.” “Open data is not useable data,” he warned, advocating for an ongoing dialogue between government agencies, social organizations, and data scientists. "By bridging these communities, you’re starting to make that data useable,” he said, increasing the likelihood it can serve citizens.
Bridging the Data Science Gap There is an abundance of public data, but a lack of skilled practitioners to make sense of it. This presents an opportunity for data scientists to use their skills to serve the public interest. Porway noted that in many social organizations, “data and skills are often silo’d from one another.” This creates a risk that the wealth of information these organizations produce will become irretrievable data exhaust.
“On the one hand, we have a group of people who are really good at looking at data, really good at analyzing things, but don’t have a lot of social outputs for it,” Porway said. “On the other hand, we have social organizations that are surrounded by data and are trying to do really good things for the world but don’t have anybody to look at it.” Porway sees a network of “transformative communities” emerging to address this issue, within which government officials, representatives of social organizations, data scientists, researchers, and journalists “are coming together for a common goal and sharing across those boundaries to do more.”
One way to connect data scientists with institutions and organizations that lack skilled practitioners is the competition model established by Kaggle. Howard explained that Kaggle harnesses practitioners’ competitive impulse and their desire “to hack at interesting problems and interesting code.” He noted that in “cause organizations where they don’t have people working on this stuff, they often don’t see the forest for the trees,” unaware of the value of the data available.
Howard cited the EMC Data Science Global Hackathon for Air Quality Prediction, a weekend-long competition that offered participants access to EPA Air Quality Index data for Chicago, as an example of how the Kaggle model can serve the public interest. Revealing the transformative potential of data science in service of the social good, Howard noted that the competitive hackathon worked with “a data set which is local in scope,” which “you can use at a local level, yet you can also take the results and apply them really powerfully throughout the world.” As grand as that may sound, we are only on the cusp of what can be done with big data.
As demonstrated by the thinkers and practitioners who spoke at the Big Data for the Public Good seminar series, there exists a community of data scientists who are as passionate about serving the public interest as they are datasets. The potential of such collaborations is nothing less than transformative.
Data-driven defense against global pandemics
The world is connected on an unprecedented scale, and as a result the threat of global pandemics has increased exponentially. Traditionally, disease control efforts have reacted to outbreaks. These delayed response times are too slow for a global population that is always on the move. Pandemics such as HIV/AIDS and H1N1 spread from one region to the next at a speed inconceivable a century ago. Escalating the threat, viruses are becoming increasingly resistant to antibiotics, rendering traditional preventative measures ineffective. It’s a new world in need of a new approach to disease control.
Instead of reacting, Global Viral Forecasting (GVF) harnesses big data to prevent global pandemics before they start. In GVF’s view, “Dramatic failures in such pandemic control, such as the ongoing lack of success in HIV vaccine development twenty-five years into the pandemic, have shown that the wait-and-respond approach is not sufficient.” Founded by world-renowned virologist Dr. Nathan Wolfe, GVF gathers its data from multiple sources, including viral discovery in the field, anthropological research in disease hotspots to identify how viruses cross from wildlife to humans, and tracking social media trends to predict and prevent outbreaks. The source of nearly 75% of all diseases are passed between animals and humans, cross-species transmission is believed to have caused HIV. Using a network of viral listening posts located in Cameroon, the Democratic Republic of Congo, Madagascar, China, Malaysia, Sierra Leone and Gabon, GVF aims to stop viruses before they pass from animal to human.
For GVF, both the physical and the digital worlds are the lab, creating large sets of data that enable real preventative measures. But there is always a need for more data to help predict and prevent the next global pandemic. In “Crunching Digital Data Can Help The World”, an opinion piece Wolfe wrote with Lucky Gunasekara and Zachary Bogue for CNN.com, they characterize big data as “a source and multiplier of social good. Big data can help us change our world for the better.” Leveraging large sets of data from multiple sources–field work, research and trends on the open web–Wolfe and his team are able to make accurate and actionable predictions using engineering and software techniques that were nascent a decade ago, when Wolfe founded GVF. To improve the effectiveness of their predictions, GVF are constantly looking for new sources of relevant data. Moving forward, GVF aims to track purchases of over-the-counter medications and creating social interaction models from anonymous mobile phone data. As Wolfe notes, “the GVF team incessantly talks about needing more–they need big data. Data has power, but it is difficult to predict precise benefits without actually crunching the data.”
GVF’s efforts have borne real results. On the strength of its field work, GVF has tracked how viruses spread from bushmeat to human. They have identified a fifth form of human malaria, and studied how the disease originated. They have identified how disease control efforts failed in containing swine flu, and propose more proactive approaches to preventing and treating future outbreaks.