Do Infants Learn Grammar with Algebra or Statistics?
The report "Rule learning by seven-month-old infants" by
G. F. Marcus et al. (1 Jan., p. 77)
adds to a growing body of evidence concerning the remarkable learning
abilities of infants. This evidence indicates that children acquire
much more knowledge of language from experience than one might assume
(1). However, the conclusion by Marcus et
al. that the infants had learned rules rather than merely statistical
regularities is unwarranted.
In the experiments in the report by Marcus et al., infants
were familiarized with sequences of syllables that conformed to
patterns such as ABB or AAB (for example, "wo fe fe" versus
"wo wo fe"). They were then tested on sequences containing
different syllables that either matched these patterns or not. Infants
preferred (2) novel sequences that violated
the pattern to which they had been pre-exposed, and so were said
to have learned the rule governing the sequences' "grammar."
This conclusion rests on the fact that the test sequences contained
novel syllables; thus, the infants could not have learned anything
about their statistical properties. However, these "grammatical
rules" created other statistical regularities. AAB, for example,
indicated that a syllable would be followed by another instance
of the same syllable and then a different syllable. Thus, in the
pretraining phase, the infant was exposed to a statistical regularity
governing sequences of perceptually similar and different events.
The report's discussion focused on what the infants could learn
about the particular syllables used in training, but there is no
reason to deny these infants the capacity to learn these same-different
contingencies.

"Wo fe fe" or "wo wo fe"?
CREDIT: GARY F. MARCUS
There is also no reason to deny connectionist neural network models
for this capacity. In our view, the goal of modeling is to understand
children's behavior by endowing networks with the same capacities
and experiences as children. The networks that Marcus et al.
studied were not provided with either, so it is not unexpected that
they behaved differently. A 7-month-old child has already developed
a rich representation of the structure of acoustic and speech events
on the basis of several thousand hours of exposure to examples,
including the "novel" test syllables. In the model used
by Marcus et al., in contrast, there was no knowledge of
the structure of utterances, no exposure to these syllables, and
no way to represent phonological similarity.
A model with the same kinds of capacities and experiences as infants
will perform in a similar manner. To demonstrate this, we implemented
a simple model (3), which is not a general
account of all aspects of the phenomena, but serves to illustrate
that the limitations that Marcus et al. described are not
intrinsic to all connectionist models.
Rather than showing that rule learning is "there from the
start" (4), the findings in Marcus et
al.'s report indicate that infants are able to encode multiple
types of statistical regularities. This feat places them squarely
on the path toward acquiring a central aspect of the adult's linguistic
competence (5).
Mark S. Seidenberg
Neuroscience Program,
University of Southern California,
Los Angeles, CA 90089-2520, USA.
E-mail: marks@gizmo.usc.edu
Jeff L. Elman
Department of Cognitive Science,
University of California, San Diego,
La Jolla, CA 92093-0515, USA
References and Notes
- N. Chomsky, Knowledge of Language (Praeger, New York,
1986).
- As described on page 78 of the report, preference was indicated
by an infant "looking longer at the flashing side light during
presentations of [novel] sentences."
- Discussion and model are at crl.ucsd.edu/~elman/Papers/MVRVsim.html
.
- As stated in the Perspective "Out of the minds of babes"
(S. Pinker, p. 40)
that accompanied the report.
- M. S. Seidenberg,
Science 275, 1599
(1997); J. L. Elman et al., Rethinking Innateness
(MIT Press, Cambridge, MA, 1996).
Marcus et al. report that 7-month-old infants learned
language tasks that required rule learning. They also state that
these tasks are not learnable by statistical algorithms, including
simple recurrent networks (SRNs). This statement is not correct.
After noting that some stimuli could be learned statistically
(such as those used in experiment 1 in the report), Marcus et
al. used a refined phoneme set (for their experiments 2 and
3). The fact that they assumed that the refined phoneme set was
not statistically learnable indicates that their experimental paradigm
was based on binary feature representations. For instance, vowel
height (1, 2) would be
represented by two features, +/-high and
+/-low. However, if one adopts a continuous
vowel height as in the cardinal vowel scale (English low, middle,
and high vowels would be represented by 0.00, 0.67, and 1.00), statistical
algorithms can accomplish the learning (3).
I conducted computer simulations with the use of a variant of
SRN with continuous vowel height and place of articulation (POA)
(3, 4). In all cases,
as expected, the network made larger prediction errors with the
inconsistent sentences (3). These results suggest
that the report's experimental design does not exclude the possibility
that children used a statistical learning strategy. I agree with
Marcus et al. that standard SRNs cannot generalize learned
rules to novel independent features; however, SRNs can apply learned
mappings [for example, f (x, y) = x] to novel
real values.
Michiro Negishi
Department of Cognitive and Neural Systems,
Boston University,
Boston, MA 02215, USA.
E-mail: negishi@cns.bu.edu
References and Notes
- Vowel height is the index of the vertical position of the tongue
body with respect to the roof of the mouth: /i/ as in "bee"
is a high vowel, whereas /a/ as in "Sam" is a low vowel
(2).
- H. J. Giegerich, English Phonology: An Introduction
(Cambridge Univ. Press, Cambridge, 1992).
- Examples, simulations, and results are at cns-web.bu.edu/pub/mnx/sci.html.
- The rationale for my using continuous POA, which helps the learning
in experiment 1 in the report, but not in experiments 2 and 3,
comes from sonority scale [section 6.2 in (2)]
and the distribution of ejectives and implosives [J. Greenberg,
Int. J. Am. Linguist. 36, 123 (1970)].
Marcus et al. propose that 7-month-old infants, when
listening to speech, can extract abstract algebraic rules "that
represent relationships between placeholders...such as 'the first
item X is the same as the third item Y'...." Marcus et
al. refer to an earlier report, "Statistical learning
by 8-month-old infants" (13 Dec. 1996, p. 1926),
in which Saffran et al. showed that infants of like age
abstracted statistical relationships from speech in order to segment
words. These two reports ascribe to infants (among other cognitive
achievements) two powerful means to acquire language: associative
and rule-learning procedures. However, the evidence for algebraic
rule learning in the report by Marcus et al. is open to
serious question.
Marcus et al. state (note 18 in the report), "In
principle, an infant who paid attention only to the final two syllables
[words] of each sentence could distinguish the AAB grammar from
the ABB grammar purely on the basis of reduplication...." We
would add that this is a strong possibility, in that syllables were
separated by 250-millisecond pauses and each three-syllable sentence
was separated by a 1-second pause. Moreover, there is evidence that
7-month-old infants can discriminate objects by means of the abstract
relations, same or different (1). Marcus et
al. then state, in note 18, "but [the infants] could not
have succeeded in the experiment of Saffran et al."
in demonstrating "word" segmentation if they had been
using a strategy of reduplication. Consequently, Marcus et al.
apparently did not explore or eliminate this possibility in their
own studies of rule learning. This comparison, however, is highly
problematic--there are important differences in procedural details
between these two studies. Saffran et al. presented their
infants with frequently repeated, randomly ordered sequences of
four trisyllabic "words." There was, moreover, no pause
between syllables or between words, and the syllables were coarticulated,
making it highly unlikely, and perhaps impossible, that only the
final two syllables of each word were perceived as the final two
syllables of each sentence, as they might have been in the studies
of Marcus et al.
A control study of the following nature is needed to begin to
eliminate the strategy of reduplication as one that infants could
be using: familiarize infants with an AAB sentence format and test
with new sentences with BAB and AAB formats. If there is a preference
for the novel format despite the unchanging arrangement of the final
two syllables, as would be expected had infants acquired an algebraic
rule, there would then be support for the conclusion made by Marcus
et al. Until such control experiments are performed, we
cannot conclude that infants at the age at which word segmentation
has been evidenced are also able to acquire an algebraic rule.
Peter D. Eimas
Department of Cognitive and Linguistic Science,
Brown University,
Providence, RI 02912, USA.
E-mail: peter-eimas@brown.edu
References
- D. J. Tyrrell, L. B. Stauffer, L. B. Snowman, Infant Behav.
Dev. 14, 125 (1991).
Response
Eimas suggests an additional control to rule out the possibility
that infants could have relied only on the final two syllables.
Although we maintain that such a control could bear only on the
question of which rules an infant can learn, rather than the question
of whether an infants could learn rules (because the generalization
of identity itself requires a rule that holds for all instances
in a class), we are grateful for the suggestion. We have now run
that control, and the results (1) are consistent
with our previous findings.
The other two letters state that various modifications of the
simple recurrent network can handle our results, but no such network
provides a genuine, empirically adequate alternative to our proposal.
Seidenberg and Elman
present a model that can capture our data, but only by resorting
to a technique that Elman has criticized elsewhere (2):
the incorporation of an all-knowing "external teacher"
that provides the network with information that is not otherwise
available in the environment. As we noted in our report, and as
Negishi acknowledges in his letter, the standard version of the
simple recurrent network--which uses a "predication task"
that does not depend on information that is not directly available
in the environment--does not succeed in generalizing our ABA or
ABB patterns to novel words (3). Seidenberg
and Elman appear to abandon (without comment) the usual "predication
task" version of the network model in favor of a different
kind of model, in which an external teacher decides whether each
pair of successive words is identical. Such information is not "directly
observable from the environment" (4);
instead, it is provided by an external teacher (built by Seidenberg
and Elman) that itself builds in an algebraic rule. Because, in
the human, that external device must be something inside the child
rather than something provided by the environment, Seidenberg
and Elman have not gotten rid of the rule; they have simply hidden
it (5).
We find Negishi's model to be more interesting. Negishi points
out, quite rightly, that an SRN that uses real numbers rather than
binary encoding can capture our results. Why should that be the
case? As we noted in our report, "algebraic" rules are
"open-ended abstract relationships for which we can substitute
arbitrary items." Models that use real-number encoding use
their nodes as variables and incorporate operations that treat all
instances of a given variable equally. In other words, rather then
presenting an alternative to rules, such devices wind up implementing
them (6).
This is a subtle point, perhaps best understood in a comparison
(3) between two models, one that represents
numbers as sets of discrete binary features, and another that represents
numbers as analog values, such as the identity function mentioned
by Negishi, f(x) = x. Neither architecture is
inherently superior: Models that represent inputs as sets of nonarbitrary
discrete features can capture transitional probabilities between
words such as would be present in the experiments in the 1996 report
by Saffran et al., but cannot freely generalize the identity
relationships that underlie our studies; models that use nodes as
registers can freely generalize identity relationships, but cannot
capture the transitional probabilities between words that underlie
the experiments in that report. In some broad sense, both architectures
might be characterized as "statistical," but the two architectures
are suited to different problems.
Our results, in tandem with those of Saffran et al.,
suggest that infants are capable of discerning both rules and transitional
probabilities. As we said in our report (note 24), we aimed "not
to deny the importance of neural networks but rather to try to characterize
what properties the right sort of neural network architecture must
have."
Gary F. Marcus
Department of Psychology,
New York University,
New York, NY 10003, USA.
E-mail: gary.marcus@nyu.edu
References and Notes
- In the control experiment, we trained eight 7-month-old infants
on sentences from a BAB or an AAB grammar and tested on BAB and
AAB sentences made up of novel words. Seven of the eight infants
looked longer at the inconsistent sentences than at the consistent
sentences.
- J. L. Elman, in Mind as Motion: Explorations in the Dynamics
of Cognition, R. F. Port, and T. v. Gelder, Eds. (MIT Press,
Cambridge, MA, 1995), pp. 195-223.
- Discussion, examples, and models at psych.nyu.edu/~gary/science/discussion.html.
- The model was also given "negative evidence"; that
is, in the habituation phase, the model was told not only which
sentences are ABB sentences (positive evidence), but also which
sentences were not (negative evidence). In contrast, the infants
in our experiment were given only positive evidence, and not exposed
to examples of "ungrammatical patterns." Our experiment,
but not the Elman-Seidenberg
model, is consistent with the assumption that children are able
to learn grammar without negative evidence [R. W. Brown and C.
Hanlon, in Cognition and the Development of Language,
R. Hayes, Ed. (Wiley, New York, 1970); J. L. Morgan and L. L.
Travis, J. Child Lang. 16, 531 (1989);
G. F. Marcus, Cognition 46, 53 (1993)].
- Seidenberg and Elman
appear to use the term "statistics" to refer to regularity,
thus counting rules as statistical regularities. Weakening the
terminology in this way does not take away from our point that
infants can learn rules.
- A recent paper of ours, cited in note 22 in our report, made
this point explicitly (7, p. 275): "While
most networks represent inputs by pattern of activation across
sets of nodes, in principle one could use a single node to represent
all possible inputs, assigning each possible input to some real
number...incorporating what is a transparent implementation of
a register....The node in question would represent a variable;
its value would represent the instantiation of that variable."
For two other examples of neural network architectures that explicitly
implement relationships between variables and that could capture
our findings without a hidden teacher or negative evidence, see
K. J. Holyoak and J. E. Hummel, in Cognitive Dynamics: Conceptual
Change in Humans and Machines, E. Deitrich and A. Markman,
Eds. (Erlbaum, Mahwah, NJ, 1999) and L. Shastri and V. Ajjanagadde,
Behav. Brain Sci. 16, 417 (1993). For
further discussion, see (7) and G. F. Marcus,
The Algebraic Mind (MIT Press, Cambridge, MA, in press).
- G. F. Marcus, Cognit. Psychol. 37,
243 (1998).