Line 244: Line 244:
 \\ \\
 \\ \\
-{{ ::screenshot_2016-02-06_20.41.52.png?nolink&200 |}}+{{ :screenshot_2016-02-06_20.43.30.png?nolink&300 |}} 
 +**Model 3: //multiple rules under noise//** 
 +Model 3 loosens an additional assumption: that all the 
 +strings in the input data are the product of a single rule. Instead, 
 +it considers the possibility that there are multiple 
 +rules, each consistent with a subset of the training data. 
 +We encode a weak bias to have fewer rules via a prior 
 +probability distribution that favors more compact partitions 
 +of the input. This prior is known as a Chinese Restaurant 
 +Process (CRP) prior (Rasmussen, 2000); it introduces a 
 +second free parameter, c, which controls the bias over clusterings. 
 +A low value of c encodes a bias that there are likely 
 +to be many small clusters, while a high value of c encodes a 
 +bias that there are likely to be a small number of large 
 +The joint probability of the training data T and a partition 
 +Z of those strings into rule clusters is given by 
 +P(T,Z) = P(T|Z)P(Z) 
 +neglecting the parameters a and c. The probability of a 
 +clustering P(Z) is given by CRP(Z,c). 
 +Unlike in Models 1 and 2, inference by exact enumeration 
 +is not possible and so we are not able to compute the 
 +normalizing constant. But we are still able to compute the 
 +relative posterior probability of a partition of strings into 
 +clusters (and hence the posterior probability distribution 
 +over rules for that cluster). Thus, we can use a Markovchain 
 +Monte Carlo (MCMC) scheme to find the posterior 
 +distribution over partitions. In practice we use a Gibbs 
 +sampler, an MCMC method for drawing repeated samples 
 +from the posterior probability distribution via iteratively 
 +testing all possible cluster assignments for each string 
 +(MacKay, 2003). 
 +In all simulations we calculate the posterior probability 
 +distribution over rules given the set of unique string types 
 +used in the experimental stimuli. We use types rather than 
 +rather than individual string tokens because a number of 
 +computational and experimental investigations have suggested 
 +that types rather than tokens may be a psychologically 
 +natural unit for generalization (Gerken & Bollt, 2008; 
 +Goldwater, Griffiths, & Johnson, 2006; Richtsmeier, Gerken, 
 +& Ohala, in press). 
 +To assess the probability of a set of test items 
 +E = e1 . . . en (again computed over types rather than tokens) 
 +after a particular training sequence, we calculate the total 
 +probability that those items would be generated under a 
 +particular posterior distribution over hypotheses. This 
 +probability is 
 +{{ ::screenshot_2016-02-06_20.54.45.png?nolink&300 |}} 
 +which is the product over examples of the probability of a 
 +particular example, summed across the posterior distribution 
 +over rules p(R|T). For Model 1 we compute p(ek|rj) 
 +using Eq. (2); for Models 2 and 3 we use Eq. (4). 
 +We use surprisal as our main measure linking posterior 
 +probabilities to the results of looking time studies. Surprisal 
 +(negative log probability) is an information-theoretic 
 +measure of how unlikely a particular outcome is. It has 
 +been used previously to model adult reaction time data 
 +in sentence processing tasks (Hale, 2001; Levy, 2008) as 
 +well as infant looking times (Frank, Goodman, & 
 +Tenenbaum, 2009). 
 +{{ :screenshot_2016-02-06_21.03.21.png?nolink |}}
 \\ \\
 \\ \\
 ---- ----
 **Conclusions:** **Conclusions:**
 +The infant language learning literature has often been
 +framed around the question ‘‘rules or statistics?’’ We suggest
 +that this is the wrong question. Even if infants represent
 +symbolic rules with relations like identity—and there
 +is every reason to believe they do—there is still the question
 +of how they learn these rules, and how they converge
 +on the correct rule so quickly in a large hypothesis space.
 +This challenge requires statistics for guiding generalization
 +from sparse data.
 +from sparse data.
 +In our work here we have shown how domain-general
 +statistical inference principles operating over minimal
 +rule-like representations can explain a broad set of results
 +in the rule learning literature.
 +The inferential principles encoded in our models—the
 +size principle (or in its more general form, Bayesian Occam’s
 +razor) and the non-parametric tradeoff between
 +complexity and fit to data encoded in the Chinese Restaurant
 +Process—are not only useful in modeling rule learning
 +within simple artificial languages. They are also the same
 +principles that are used in computational systems for natural
 +language processing that are engineered to scale to
 +large datasets. These principle have been applied to tasks
 +as varied as unsupervised word segmentation (Brent,
 +1999; Goldwater, Griffiths, & Johnson, 2009), morphology
 +learning (Albright & Hayes, 2003; Goldwater et al., 2006;
 +Goldsmith, 2001), and grammar induction (Bannard,
 +Lieven, & Tomasello, 2009; Klein & Manning, 2005; Perfors,
 +Tenenbaum, & Regier, 2006).
 +First, our models assumed the minimal machinery
 +needed to capture a range of findings. Rather than making
 +a realistic guess about the structure of the hypothesis
 +space for rule learning, where evidence was limited we assumed
 +the simplest possible structure. For example,
 +although there is some evidence that infants may not always
 +encode absolute positions (Lewkowicz & Berent,
 +2009), there have been few rule learning studies that go
 +beyond three-element strings. We therefore defined our
 +rules based on absolute positions in fixed-length strings.
 +For the same reason, although previous work on adult concept
 +learning has used infinitely expressive hypothesis
 +spaces with prior distributions that penalize complexity
 +(e.g. Goodman, Tenenbaum, Feldman, & Griffiths, 2008;
 +Kemp, Goodman, & Tenenbaum, 2008), we chose a simple
 +uniform prior over rules instead. With the collection of
 +more data from infants, however, we expect that both
 +more complex hypothesis spaces and priors that prefer
 +simpler hypotheses will become necessary.
 +Second, our models operated over unique string types
 +as input rather than individual tokens. This assumption
 +highlights an issue in interpreting the a parameter of Models
 +2 and 3: there are likely different processes of forgetting
 +that happen over types and tokens. While individual tokens
 +are likely to be forgotten or misperceived with constant
 +probability, the probability of a type being
 +misremembered or corrupted will grow smaller as more
 +tokens of that type are observed (Frank et al., 2010). An
 +interacting issue concerns serial position effects. Depending
 +on the location of identity regularities within sequences,
 +rules vary in the ease with which they can be
 +learned (Endress, Scholl, & Mehler, 2005; Johnson et al.,
 +2009). Both of these sets of effects could likely be captured
 +by a better understanding of how limits on memory interact
 +with the principles underlying rule learning. Although a
 +model that operates only over types may be appropriate
 +for experiments in which each type is nearly always heard
 +the same number of times, models that deal with linguistic
 +data must include processes that operate over both types
 +and tokens (Goldwater et al., 2006; Johnson, Griffiths, &
 +Goldwater, 2007).
 +Finally, though the domain-general principles we have
 +identified here do capture many results, there is some
 +additional evidence for domain-specific effects. Learners
 +may acquire expectations for the kinds of regularities that
 +appear in domains like music compared with those that
 +appear in speech (Dawson & Gerken, 2009); in addition, a
 +number of papers have described a striking dissociation
 +between the kinds of regularities that can be learned from
 +vowels and those that can be learned from consonants
 +(Bonatti, Peña, Nespor, & Mehler, 2005; Toro, Nespor,
 +Mehler, & Bonatti, 2008). Both sets of results point to a
 +need for a hierarchical approach to rule learning, in which
 +knowledge of what kinds of regularities are possible in a
 +domain can itself be learned from the evidence. Only
 +through further empirical and computational work can
 +we understand which of these effects can be explained
 +through acquired domain expectations and which are best
 +explained as innate domain-specific biases or constraints.
 \\ \\