The Research Behind the Parenting Books Doesn’t Check Its Own Work
Training Minds | Essay 3
Here is a fact about education research that the parenting publishing industry would prefer you not think about too carefully.
Less than two-tenths of one percent of papers published in education’s top journals are replication attempts. Not two percent. Not twenty percent. Two-tenths of one percent. For every five hundred papers claiming to show that something works, roughly one paper has gone back to check whether that’s true.
In psychology — which is itself in the middle of a well-documented credibility crisis — the replication rate is about 1 percent. Education research, by this measure, runs about ten times lower. The field that produces the evidence base for how we teach children is one of the fields least structured to verify its own findings.
I didn’t know this when I walked into a bookstore with an eighteen-month-old at home and a list of questions I needed answered. I just knew that the books on the shelf disagreed with each other in ways that felt more like ideology than science. The Montessori advocates and the structured-learning advocates and the screen-time alarmists were not having a scientific disagreement. They were having a values disagreement dressed up in the language of research, and the research underneath it was thinner than any of them were admitting.
The replication crisis in psychology hit the mainstream around 2011, when a large-scale project systematically tried to reproduce a hundred published findings. Fewer than forty percent replicated. This was shocking to people outside the field. It was not shocking to people inside it, who had been quietly aware for decades that the incentive structure — publish novel findings, not boring confirmations — was producing a literature full of results that looked true until someone checked.
Education research shares the same incentive structure and shows similar weaknesses, often with less systematic verification. When studies are replicated at all, the results are not reassuring. A large-scale study published in *Nature* in April 2026 — Tyner and colleagues — attempted to replicate 274 claims from 164 papers across 54 social and behavioral science journals. Only 55 percent of claims replicated at the claim level. At the paper level it drops further: nearly half of the tested papers failed to replicate even one of their core claims. And among the findings that did replicate, the effect sizes were roughly half the size of the originals — meaning the field is not just producing findings that don’t hold up, it is producing findings that, even when they hold up, overstate how much the effect matters. (The 40 percent psychology figure is from the 2015 Open Science Collaboration paper, which systematically reproduced a hundred published psychology findings. The Tyner et al. paper covers social and behavioral sciences broadly, including but not limited to education.) And these numbers only cover the rare cases where someone bothered to check. For the vast majority of education findings, we simply don’t know. They’ve been stated once, cited by everyone who came after, and never tested again.
The specific mechanism matters. It’s not primarily fraud — although that happens too. It’s something more structural: small sample sizes, flexible analysis choices, publication pressure, and the simple fact that an exciting result gets published and replicated widely in citation while a null result gets filed away. The system is optimized to produce interesting claims, not reliable ones.
Here is the part that is relevant to parents specifically.
The parenting advice industry draws heavily on educational psychology and social psychology — the parts of the research literature with the worst replication records. The famous study that gets cited to tell you screen time damages your child’s brain. The claim that reading aloud for twenty minutes a day produces specific developmental outcomes. The research suggesting that a particular toy or curriculum or teaching style gives children an advantage. These claims travel through parenting books and pediatric guidelines and mommy blogs with the authority of scientific consensus. Many of them have never been replicated. Some of them have been tested and failed.
This is not an argument against science. It is an argument for being precise about which science you are using.
The developmental science this newsletter draws on sits in a different part of the literature. Statistical learning, categorical cognition, perceptual development, early word learning — these are domains where the canonical findings have been tested across many labs, populations, languages, and decades. Saffran’s statistical learning paper has been replicated so many times it is now a textbook example. DeLoache’s dual-representation work has generated decades of follow-on research. Waxman’s word learning findings have been extended cross-linguistically. This is not the same literature as “we ran forty families through a six-week curriculum and found a statistically significant effect on vocabulary scores.” This is cumulative science with a track record. The difference is not topic. It is methodology and accumulation.
The machine learning framework adds something the developmental science alone cannot. Engineering disciplines are validated by deployment, not by p-values. A model either generalizes or it doesn’t. The finding either transfers to production or it doesn’t. You cannot p-hack your way indefinitely to a model that works in deployment. Eventually the model has to generalize outside the training set, and the result is not only a PDF with a favorable confidence interval but a system that either functions or fails in the real world. When I borrow vocabulary from ML to describe how toddlers learn, I am borrowing from a discipline that has built its credibility on a fundamentally different accountability structure.
The practical implication is not that you should distrust everything. It is that you should hold the claims in this field — including the ones in this newsletter — to a higher standard than the field typically holds itself to.
That means preferring findings that have been replicated over those that haven’t. It means being skeptical of any claim that “X minutes of Y per day produces Z outcome,” because that level of precision almost never survives contact with replication. It means noticing when a recommendation is based on a single study with forty participants, versus a finding that has been observed across dozens of labs and many populations over time. It means being especially skeptical of negative findings — “screens damage attention,” “early reading instruction is harmful,” “structured play prevents creativity” — because those claims tend to generate more fear than evidence.
And it means understanding why a framework grounded in first principles from a more reliable discipline is worth something. Not because machine learning has all the answers about child development — it doesn’t — but because it provides a way of thinking about learning that doesn’t depend on which education study happened to get published last month.
The parenting shelf is full of books that cite research with confidence. Most of them are citing a literature that, by its own admission, doesn’t check its own work.
You deserve to know that.
Training Minds is a Substack about categorical learning, the Signal Stack, and what the research actually says about how children ages 12–36 months build knowledge. Written for analytically minded parents who want frameworks and evidence, not parenting philosophy.
The next essay: why the sequence matters — why letters come after animals, why numbers are actually two separate problems — and what happens to that sequence once the 2D channel is open.
— Sandra

