Moore Foundation selects Matthew Stephens for Data-Driven Discovery grant

The Gordon and Betty Moore Foundation has announced the University of Chicago’s Matthew Stephens as the recipient of a Moore Investigator in Data-Driven Discovery award. Stephens, a professor in statistics and human genetics, is among 14 scientists from academic institutions nationwide who will receive a total of $21 million over five years to catalyze new data-driven scientific discoveries. Stephen’s grant is for $1.5 million.

These Moore Investigator Awards are part of a $60 million, five-year Data-Driven Discovery Initiative within the Gordon and Betty Moore’s Science Program. The initiative—one of the largest privately funded data scientist programs of its kind—is committed to enabling new types of scientific breakthroughs by supporting interdisciplinary, data-driven researchers.

“Science is generating data at unprecedented volume, variety and velocity, but many areas of science don’t reward the kind of expertise needed to capitalize on this explosion of information,” said Chris Mentzel, program director of the Data-Driven Discovery Initiative. “We are proud to recognize these outstanding scientists, and we hope these awards will help cultivate a new type of researcher and accelerate the use of interdisciplinary, data-driven science in academia.”

Stephens is a data scientist who develops statistical and computational analysis tools for the large datasets being generated in the biological sciences. Over the last 15 years, Stephens and his collaborators have made seminal contributions to several problems in population genetics, including identifying structure, or clusters, in genetic data, and modeling correlations among genetic variants.

The methods for identifying structure, which Stephens developed with collaborators Jonathan Pritchard, Peter Donnelly and Daniel Falush, have driven scientific discoveries in hundreds of organisms. Science papers in 2002, 2003 and 2004 used their method to elucidate the genetic structure of human populations, the Heliobacter pylori stomach bacterium and domestic dog breeds, respectively. The original paper of Stephens and his collaborators has been cited more than 11,000 times. And, in an example of the potential for cross-fertilization of ideas across disciplines, similar methods have also become popular in machine learning to identify structure in large collections of text documents.

Stephens’ work modeling correlations among genetic variants began with a paper in 2003, with graduate student Na Li, PhD’03. At the time, scientists were grappling with a problem: they had an elegant model—based on work by UChicago’s Richard Hudson, professor in ecology & evolution—relating these correlations to the underlying recombination process, which mixes a parent’s genetic material before transmission to an offspring, but these models were computationally intractable for even small datasets.

Li and Stephens solved this problem by simplifying the model enough to make it computationally tractable. This new simplified model has found widespread application in the last 10 years: Stephens, Li and their collaborators used their model to demonstrate that most recombination in human genes occurs in relatively narrow channels, called “hotspots,” rather than being spread uniformly. And thousands of scientists conducting genomic studies now make regular use of these models to impute missing genotype data to substantially improve the efficacy of their studies.

Stephens’ recent focus has been on developing methods for data integration—combining information on multiple related processes. An important application of these methods—which he has been pursuing with collaborators, including Pritchard, Yoav Gilad and Anna DiRienzo—is to combine information measured on cellular processes, such as gene expression, and transcription factor binding, in order to help understand the mechanisms of genetic regulation within living cells.

The Data-Driven Discovery award will help Stephens to develop and apply improved statistical methods for this problem. The statistical challenges he faces are many. These include identifying networks of genes or regions of DNA that interact with one another; identifying the scientifically significant associations among the astronomically large numbers of associations that could occur; and integrating different types of data to understand the dynamic and causal links between them.

“The substantial, long-term funding from the Moore Foundation allows me to bring on a team of students and postdoctoral researchers to give this complex problem the attention it needs,” Stephens said.

Stephens also plans to use the Moore Foundation funding to help create web-based resources for data scientists to collaborate and compare methods, helping to improve the productivity of data-driven discovery across the wider data science community. “I believe that we can substantially improve the way that statistical and computational methods research progresses, by exploiting new internet technologies and ideas from crowdsourcing, open science and reproducible research,” Stephens said.

Intel co-founder Gordon Moore and his wife Betty established the foundation, based in Palo Alto, Calif., to create position change in science, environmental conservation and patient care.