Learning to read is a complex process, especially in a language like English where the mappings between the printed and spoken forms of the language are often unpredictable. Many have assumed that the optimal way of teaching children about these mappings is by exhaustively accounting for the relevant ones and express them as rules so that children might use such rules in their reading and writing. This project is examining assumptions like this using computational models of a child reading, investigating how to optimize learning not through exhaustive sampling of the printed patterns of the language, rather by trying to minimize the amount of training necessary for successful generalization to words in the language yet learned. This experimentation takes the form of both manipulating the number of words a given model trains on for sake of understanding the capacity of the learner to generalize as well as structuring the sequence of those words learned through development to understand what might comprise an optimal learning environment for the developing child.

This set of studies is aimed at investigating aspects of the structure of printed language input (orthographic, phonological, and semantic) that contribute to differential developmental outcomes in language and reading. This is a collaboration with Chris Cox (former lab member, faculty member at LSU).

This project two arms. There is an education-facing aspect of the project where we are analyzing real reading curricula, and a related but separate set of projects that are geared toward computational models that vary the structure of the input to a learner (“machine teaching”). This page has information for all the sub-projects associated with our work on curriculum analyses and modeling.

Status updates

For status updates to workflow on curriculum modeling see the wiki page relevant to the specific project you are interested in. For example, if you are interested in unordered candide, see the wiki page for unordered candide.


As we publish data from activities under these projects, you should be able to find them here. They are labelled based on the type of analyses performed:

  • 3k corpus characteristics
    • This notebook provides summary data about the corpus that we are using for (at the point of writing this) all our models of orth-phon learning.
  • by-item error analyses for candide unordered (brute force 2)
    • These analyses look at the itemwise generalization error from our simulations where we trained the same learner in many many different learning environments.
  • Summary of candide simulations (candide1/ in GitHub)
    • These were the analyses that informed the cogsci 2018 poster

Cloud resources

  • The protocol for how to code and transcribe curricula can be found here
  • This file outlines the different tasks that we will target in our transcriptions
  • Google Drive for the team (if you need permission contact Lauren or Matt)
  • Box drive with educational curricular materials (let us know if you need permission)
  • The phonological code conventions being used are here
    • And here are some examples of transcribed words as a reference
  • GitHub
    • candide_unordered repo
      • Here is the wiki for status updates on this work.
    • OrthPhon parent repository. This is the original GitHub repo for the project. We are migrating to different repos for subproject – changes will be ongoing.
      • OrthPhon Wiki within our GitHub repo.
      • The older companion work in dialect modeling can be found here

Communication & Other Resources

  • General access information (copy code, key details, etc) can be found here: LCNL access information
  • Slack information:
    • LCNL Reading Team Slack: Note that this is a different Slack team than the primary lcnl Slack team.
    • For curriculum team see channel “curriculum”
    • For modeling team see channel “orthphon”

Comments are closed