An essay · ~10 min read

The Two Thousand Five Hundred Words That Unlock Reading

Elfrieda Hiebert and the corpus that reframed vocabulary instruction

Of the four hundred thousand words in English, which ones actually matter? A reading researcher's decades of corpus analysis turned an impossible memorisation problem into a tractable, network-based one — and made an equity argument that should be impossible to ignore.


The Impossible Math of English

Start with the number. There are roughly four hundred thousand words in English (Nagy & Anderson, 1984)1. Even if a teacher could somehow teach five new words every school day from kindergarten through twelfth grade, the student would graduate having learned around twelve thousand words. That leaves three hundred and eighty-eight thousand untaught. Vocabulary instruction, if you think of it as "teach more words," is a problem with no solution.

Most of the field has, for most of its history, given up on solving the math directly. Vocabulary, the conventional wisdom holds, is mostly absorbed incidentally through wide reading. Direct instruction can supply a few hundred high-value words per year. The rest takes care of itself.

Elfrieda Hiebert, founder of TextProject and a member of the Reading Hall of Fame, looked at the same problem and asked a different question. She did not ask how many words children needed to learn. She asked which words actually appeared, and how often, in the texts children were being asked to read. The answer turned out to reframe the entire discipline.

The Two Thousand Five Hundred

Working with Amanda Goodwin and Gina Cervetti, Hiebert ran a corpus analysis of thousands of school texts spanning kindergarten through college (Hiebert, Goodwin, & Cervetti, 2018)2. Their finding, later synthesised in The Reading Teacher, was that approximately 2,500 morphological word families account for an average of 91.5% of the words students encounter in text from grade one through college (Hiebert, 2020)3.

Of the four hundred thousand words in English, fewer than three thousand families do nine-tenths of the work in everything a child will ever be asked to read in school.

That is a remarkable claim, and it deserves to be sat with. The other 99.4% of the lexicon shows up rarely or once, often in genre-specific or topic-specific contexts.

The word family matters here. Hiebert is not talking about 2,500 individual words. A morphological word family includes a root word and all of its inflected and derived forms. The help family includes help, helps, helped, helping, helper, helpless, helpful, helpfully, unhelpful, and so on. A child who has acquired the morphological logic of help has a generative system, not a memorised list. Master 2,500 of these generative systems, and you can read essentially anything an academic text will throw at you.

This finding alone should change how schools think about vocabulary. Most curricula still teach words as arbitrary weekly lists, often selected by frequency tables that count individual words rather than families, often disconnected from the texts students are actually reading. Hiebert's work suggests this is, almost literally, a misuse of instructional time. The high-leverage target is small, identifiable, and stable across grade levels. We could teach it deliberately. Mostly, we don't.

Words Are a Network, Not a List

Hiebert's second contribution is more subtle, and in some ways more important. Even within the core 2,500 families, she argues, words are not learned as individual items. They are learned in relationship to one another, through three kinds of connections: semantic (words that share meaning), morphological (words that share a root), and what she calls multiple-meaning (the same word used differently across contexts) (Hiebert, 2020)3.

Consider the word form. By itself, it is a fairly abstract noun and verb. Place it inside its morphological family and it becomes the root of inform, reform, transform, formation, conform, formal, informal, deform, uniform. Place it inside its semantic family and it sits beside shape, structure, mould, pattern. Place it inside its multiple-meaning family and you discover that form in a doctor's office and form in a sculpture studio share an etymological core but operate in entirely different conceptual worlds. A student who has learned form as a single dictionary entry has learned almost nothing. A student who has learned form as a node in a network has acquired access to dozens of related words and concepts at once.

This network view is the practical reason Hiebert is so insistent that vocabulary cannot be taught from flashcards or weekly lists. A flashcard isolates a word from precisely the connections that make it learnable. As Hiebert herself has put it, "We don't learn words as you might think of as a file cabinet; we learn words in relation to other words." The unit of instruction is not the word. It is the cluster.

In the Service of Knowledge

Even with the right unit of instruction, there remains the question of how to organise the curriculum across a year. Hiebert's answer is that vocabulary should be taught "in the service of knowledge." Words should be clustered around coherent topics that students can build conceptual depth in, rather than rotated through unrelated themes that produce the breadth without the depth.

The reasoning is rooted in a finding the field has known for decades but rarely acted on. What a reader knows about a topic is one of the strongest predictors of how well they will comprehend a text on that topic — often stronger than any general measure of reading ability. (This is the "knowledge effect" first formalised in the 1970s and reaffirmed across cognitive science.) Vocabulary growth and knowledge growth are not separate goals; they are the same goal seen from two angles. A child who learns ten new words about ecosystems is also learning about ecosystems. A child who learns ten new arbitrary words is learning ten new words.

The implication is that arbitrary word lists, even good ones drawn from the core vocabulary, leave value on the table. The same instructional minutes can produce both a richer vocabulary and a stronger conceptual base, but only if the words are chosen and sequenced to build into something. Vocabulary curriculum, in Hiebert's hands, becomes inseparable from knowledge curriculum.

Generative Instruction for the Long Tail

What about the remaining 8 to 10% of words, the rare ones that fall outside the core 2,500 families? Hiebert's analysis finds that around 30% of the rare words in children's books are proper nouns, and the rest are scattered across topics and genres in patterns too varied to teach directly. There is no clever curriculum that can pre-teach the long tail.

What can be done is to give students the tools to generate meaning for unfamiliar words on the fly. Hiebert calls this generative vocabulary instruction (Hiebert & Pearson, n.d.; Hiebert, 2020)4. The tools are morphological awareness (recognising that arborist shares a root with arbor), context analysis (using surrounding sentences to constrain meaning), and awareness of word relationships (knowing that an unfamiliar word in a list of synonyms inherits constraints from its neighbours). These strategies do not replace direct instruction in the core 2,500 families. They sit on top of it, leveraging the network the student has built to make sense of words she has never encountered.

This is why, in Hiebert's framework, deep teaching of the core is the precondition for everything else. A student without a strong core has nothing to generate from when she meets arborist on the page. A student with a strong core can decompose it, recognise the root, and arrive at "someone who works with trees" without ever having been taught the word.

What It Means, and the Equity Case

Vocabulous! is built on this view of vocabulary. The curriculum is sequenced not as a long list of words to memorise but as a structured progression through high-utility morphological word families, with each family treated as a generative system rather than a closed set. Words within a level are clustered around coherent topics, so that students build conceptual knowledge alongside vocabulary, not in parallel to it. And the platform's pedagogical machinery — spaced repetition, varied modality, multiple senses across multiple contexts — is designed to teach words in the network-rich way Hiebert's research says they are actually learned, not in the flashcard-isolated way most digital vocabulary tools default to. None of this is original to us; almost all of it traces, directly or indirectly, to Hiebert's analysis of what the texts students actually read are made of.

Hiebert's work does not prescribe a single program. It does something subtler. It tells the field, with corpus-level precision, what the high-leverage target actually is. It explains why most vocabulary instruction misses that target. And it offers a coherent alternative that schools and platforms can adopt at any scale.

If her findings are even approximately right — and the corpus evidence is overwhelming that they are — then a great deal of the vocabulary instruction children receive today is, in a quiet way, malpractice. We are teaching words that do not appear in the texts children read, in formats that do not match how words are actually learned, organised around topics that do not build cumulative knowledge. The cost is paid most heavily by the children who arrived at school with the smallest vocabularies, and have the least to lose by being taught well. That is the equity case for taking Elfrieda Hiebert's research seriously, and it is the case Vocabulous! was built to act on.

References

Sources are open-access where possible — TextProject working papers and publisher open-access pages — with paywalled DOIs only where no free version exists. Last verified May 2026.

  1. Nagy, W. E., & Anderson, R. C. (1984). How Many Words Are There in Printed School English? Reading Research Quarterly, 19(3), 304–330. DOI.
  2. Hiebert, E. H., Goodwin, A. P., & Cervetti, G. N. (2018). Core Vocabulary: Its Morphological Content and Presence in Exemplar Texts. Reading Research Quarterly, 53(1), 29–49. DOI.
  3. Hiebert, E. H. (2020). The Core Vocabulary: The Foundation of Proficient Comprehension. The Reading Teacher, 73(6), 757–768. Free PDF (TextProject) · DOI.
  4. Hiebert, E. H., & Pearson, P. D. (n.d.). Generative Vocabulary Instruction. TextProject. Free PDF (TextProject).

We encourage parents, educators, and researchers to dig in to the original sources, and to push back where you disagree. The work is too important to take on faith.