by Mark van der Laan.
My father told me the most important thing about solving a problem is to formulate it accurately, and one would think that, as statisticians, most of us would agree with that advice. Suppose we were to build a spaceship that can fly to Mars and return safely to Earth.
It would be folly indeed to make simplifying assumptions in its construction that science tells us are false. Such assumptions could spell death for the astronauts and failure for their mission. And yet, that is what many statisticians often do, sometimes referring to the great 20th century English statistician, George E.P. Box’s belief that “Essentially, all models are wrong, but some are useful”.
To understand why this claim ‘’all models are wrong’’ in statistics is outdated is to understand how we are building the foundations for a revolution in method, one that uses machine learning in ways that could scarcely have been imagined by Box writing three decades ago, let alone the great progenitors of computer algorithms, such as Alan Turing.
It is a revolution that has the power to revitalize the connection between scientists and statisticians, and one that will be as central to making sense of Big Data as Big Data is central to the future of statistics and science. But in order to arrive at what I have called “targeted learning,” we need to start with the basic problem in statistical modeling.
Almost all the statistical software tools available to scientists encourage parametric modeling, and thus designing and analyzing experiments based on highly simplifying assumptions about the distribution of data that are very wrong.
The resulting epidemic of false positives—claimed findings that aren’t true—has been recognized by many, not least John Ioannidis, whose 2005 paper—“Why most published research findings are false’’—in PLOS Medicine made a compelling case for reform, and drew the attention of many people beyond the practice of science and statistics to a signal problem in the production of knowledge.
One can show that the use of such guaranteed misspecified parametric models will also guarantee that for large enough sample size, the reported confidence interval will not contain the estimand (e.g., the true effect size of a new treatment for heart disease).
That is, we statisticians pride ourselves by going beyond data mining, while in truth our confidence intervals are wrong all the time.
Targeted Learning and Big Data
At the same time, we have reached a moment in history where technology can help us to transcend the limitations of the parametric model and tackle the hard estimation problems defined by a realistic statistical model and a clear definition of the desired target estimand representing the answer to the question of interest.
Starting in 2006, we developed a general statistical learning approach—targeted maximum likelihood learning—that integrates the state of the art in machine learning and data-adaptive estimation with all the incredible advances in causal inference, censored data, efficiency and empirical process theory. The integration of machine learning is done through what we called “super learning. By being highly adaptive to the data and by targeting the learning towards the target estimand, targeted learning provides a truthful estimate and confidence interval.
The first step in super-learning is the creation of a library of parametric model-based estimators and data adaptive estimators. There are a lot of these automated machine learning algorithms, and the body of machine learning algorithms grows every year. The algorithms go through an iterative updating process that aims to balance bias (due to the model not being data adaptive enough) against variance (by being too data adaptive).
The super-learning algorithm uses the data to decide between all weighted combinations of these algorithms. The data set is split into many different “training samples” and “validation samples” and the algorithms compete on the training samples, while their performance is evaluated on the validation samples. The weighted combination that performs the best, on average, is the winner.
Our research showed that for large samples, this super-learner process performs as well as the best-weighted combination of all these algorithms. The lesson is that one should not bet on one algorithm alone, but that one should use them all to build a diverse, powerful library of candidate algorithms—and then to deploy them all competitively on the data.
This field of targeted learning is open for anyone to contribute to, and the truth is that anybody who honestly formulates the estimation problem and cares about learning the answer to the scientific question of interest will end up having to learn about these approaches and can make important contributions to our field.
In sum, science needs big data and statistical targeted learning—but statisticians and data scientists will have to rise to the challenge if science as a whole is to thrive.
Mark van der Laan is the Jiann-Ping Hsu/Karl E. Peace Professor of Biostatistics and Statistics at the University of California, Berkeley. His research group is responsible for developing the super and targeted learning statistical approaches.