Inter-American Development Bank
facebook
twitter
youtube
linkedin
instagram
Abierto al públicoBeyond BordersCaribbean Development TrendsCiudades SosteniblesEnergía para el FuturoEnfoque EducaciónFactor TrabajoGente SaludableGestión fiscalGobernarteIdeas MatterIdeas que CuentanIdeaçãoImpactoIndustrias CreativasLa Maleta AbiertaMoviliblogMás Allá de las FronterasNegocios SosteniblesPrimeros PasosPuntos sobre la iSeguridad CiudadanaSostenibilidadVolvamos a la fuente¿Y si hablamos de igualdad?Home
Citizen Security and Justice Creative Industries Development Effectiveness Early Childhood Development Education Energy Envirnment. Climate Change and Safeguards Fiscal policy and management Gender and Diversity Health Labor and pensions Open Knowledge Public management Science, Technology and Innovation  Trade and Regional Integration Urban Development and Housing Water and Sanitation
  • Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer

Gente Saludable

IDB

  • HOME
  • CATEGORIES
    • Aging and Dependency
    • Courses and Seminars
    • Digital Transformation
    • Public health and nutrition
    • Healthy Lifestyle
    • Health services
    • Health Spending and Financing
    • Women’s and Children’s Health
  • authors
  • English
    • Spanish
    • Portuguese

Why We Need a Statistical Revolution

July 20, 2015 por Autor invitado Leave a Comment


by Mark van der Laan.

My father told me the most important thing about solving a problem is to formulate it accurately, and one would think that, as statisticians, most of us would agree with that advice. Suppose we were to build a spaceship that can fly to Mars and return safely to Earth.

It would be folly indeed to make simplifying assumptions in its construction that science tells us are false. Such assumptions could spell death for the astronauts and failure for their mission. And yet, that is what many statisticians often do, sometimes referring to the great 20th century English statistician, George E.P. Box’s belief that “Essentially, all models are wrong, but some are useful”.

To understand why this claim ‘’all models are wrong’’ in statistics is outdated is to understand how we are building the foundations for a revolution in method, one that uses machine learning in ways that could scarcely have been imagined by Box writing three decades ago, let alone the great progenitors of computer algorithms, such as Alan Turing.

It is a revolution that has the power to revitalize the connection between scientists and statisticians, and one that will be as central to making sense of Big Data as Big Data is central to the future of statistics and science. But in order to arrive at what I have called “targeted learning,” we need to start with the basic problem in statistical modeling.

Almost all the statistical software tools available to scientists encourage parametric modeling, and thus designing and analyzing experiments based on highly simplifying assumptions about the distribution of data that are very wrong.

The resulting epidemic of false positives—claimed findings that aren’t true—has been recognized by many, not least John Ioannidis, whose 2005 paper—“Why most published research findings are false’’—in PLOS Medicine made a compelling case for reform, and drew the attention of many people beyond the practice of science and statistics to a signal problem in the production of knowledge.

One can show that the use of such guaranteed misspecified parametric models will also guarantee that for large enough sample size, the reported confidence interval will not contain the estimand (e.g., the true effect size of a new treatment for heart disease).

That is, we statisticians pride ourselves by going beyond data mining, while in truth our confidence intervals are wrong all the time.

 

Targeted Learning and Big Data

 

At the same time, we have reached a moment in history where technology can help us to transcend the limitations of the parametric model and tackle the hard estimation problems defined by a realistic statistical model and a clear definition of the desired target estimand representing the answer to the question of interest.

Starting in 2006, we developed a general statistical learning approach—targeted maximum likelihood learning—that integrates the state of the art in machine learning and data-adaptive estimation with all the incredible advances in causal inference, censored data, efficiency and empirical process theory. The integration of machine learning is done through what we called “super learning. By being highly adaptive to the data and by targeting the learning towards the target estimand, targeted learning provides a truthful estimate and confidence interval.

The first step in super-learning is the creation of a library of parametric model-based estimators and data adaptive estimators. There are a lot of these automated machine learning algorithms, and the body of machine learning algorithms grows every year. The algorithms go through an iterative updating process that aims to balance bias (due to the model not being data adaptive enough) against variance (by being too data adaptive).

The super-learning algorithm uses the data to decide between all weighted combinations of these algorithms. The data set is split into many different “training samples” and “validation samples” and the algorithms compete on the training samples, while their performance is evaluated on the validation samples. The weighted combination that performs the best, on average, is the winner.

Our research showed that for large samples, this super-learner process performs as well as the best-weighted combination of all these algorithms. The lesson is that one should not bet on one algorithm alone, but that one should use them all to build a diverse, powerful library of candidate algorithms—and then to deploy them all competitively on the data.

This field of targeted learning is open for anyone to contribute to, and the truth is that anybody who honestly formulates the estimation problem and cares about learning the answer to the scientific question of interest will end up having to learn about these approaches and can make important contributions to our field.

In sum, science needs big data and statistical targeted learning—but statisticians and data scientists will have to rise to the challenge if science as a whole is to thrive.

Mark van der Laan is the Jiann-Ping Hsu/Karl E. Peace Professor of Biostatistics and Statistics at the University of California, Berkeley. His research group is responsible for developing the super and targeted learning statistical approaches.


Filed Under: Digital Transformation, Uncategorized Tagged With: Banco Interamericano de Desarrollo, BID, Salud

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Follow Us

Subscribe

Search

Health

Latin American and Caribbean countries face multiple challenges to provide quality healthcare for their citizens. In this blog, IDB Specialists and international experts discuss current health issues and hope to build a dynamic dialogue through your comments.

Similar Posts

  • Digital Tools Can Prevent Inappropriate and Even Dangerous Drug Prescriptions
  • “Superbugs” and Antibiotics: Why You Should Pay Attention
  • Finding Signals in the Data Noise
  • Three Impacts of Digital Health on Healthcare
  • +Digital, Advancing the Digital Transformation in Latin America and the Caribbean

Footer

Banco Interamericano de Desarrollo
facebook
twitter
youtube
youtube
youtube

    Blog posts written by Bank employees:

    Copyright © Inter-American Development Bank ("IDB"). This work is licensed under a Creative Commons IGO 3.0 Attribution-NonCommercial-NoDerivatives. (CC-IGO 3.0 BY-NC-ND) license and may be reproduced with attribution to the IDB and for any non-commercial purpose. No derivative work is allowed. Any dispute related to the use of the works of the IDB that cannot be settled amicably shall be submitted to arbitration pursuant to the UNCITRAL rules. The use of the IDB's name for any purpose other than for attribution, and the use of IDB's logo shall be subject to a separate written license agreement between the IDB and the user and is not authorized as part of this CC- IGO license. Note that link provided above includes additional terms and conditions of the license.


    For blogs written by external parties:

    For questions concerning copyright for authors that are not IADB employees please complete the contact form for this blog.

    The opinions expressed in this blog are those of the authors and do not necessarily reflect the views of the IDB, its Board of Directors, or the countries they represent.

    Attribution: in addition to giving attribution to the respective author and copyright owner, as appropriate, we would appreciate if you could include a link that remits back the IDB Blogs website.



    Privacy Policy

    Derechos de autor © 2025 · Magazine Pro en Genesis Framework · WordPress · Log in

    Banco Interamericano de Desarrollo

    Aviso Legal

    Las opiniones expresadas en estos blogs son las de los autores y no necesariamente reflejan las opiniones del Banco Interamericano de Desarrollo, sus directivas, la Asamblea de Gobernadores o sus países miembros.

    facebook
    twitter
    youtube
    This site uses cookies to optimize functionality and give you the best possible experience. If you continue to navigate this website beyond this page, cookies will be placed on your browser.
    To learn more about cookies, click here
    X
    Manage consent

    Privacy Overview

    This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
    Necessary
    Always Enabled
    Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
    Non-necessary
    Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
    SAVE & ACCEPT