Trace:
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
frank_tenenbaum_2011 [2016/02/06 19:41] – silvia | frank_tenenbaum_2011 [2016/02/08 15:48] (current) – silvia | ||
---|---|---|---|
Line 244: | Line 244: | ||
\\ | \\ | ||
\\ | \\ | ||
- | {{ :: | + | {{ : |
+ | \\ | ||
+ | \\ | ||
+ | **Model 3: //multiple rules under noise//** | ||
+ | \\ | ||
+ | Model 3 loosens an additional assumption: that all the | ||
+ | strings in the input data are the product of a single rule. Instead, | ||
+ | it considers the possibility that there are multiple | ||
+ | rules, each consistent with a subset of the training data. | ||
+ | We encode a weak bias to have fewer rules via a prior | ||
+ | probability distribution that favors more compact partitions | ||
+ | of the input. This prior is known as a Chinese Restaurant | ||
+ | Process (CRP) prior (Rasmussen, 2000); it introduces a | ||
+ | second free parameter, c, which controls the bias over clusterings. | ||
+ | A low value of c encodes a bias that there are likely | ||
+ | to be many small clusters, while a high value of c encodes a | ||
+ | bias that there are likely to be a small number of large | ||
+ | clusters. | ||
+ | The joint probability of the training data T and a partition | ||
+ | Z of those strings into rule clusters is given by | ||
+ | P(T,Z) = P(T|Z)P(Z) | ||
+ | neglecting the parameters a and c. The probability of a | ||
+ | clustering P(Z) is given by CRP(Z,c). | ||
+ | \\ | ||
+ | Unlike in Models 1 and 2, inference by exact enumeration | ||
+ | is not possible and so we are not able to compute the | ||
+ | normalizing constant. But we are still able to compute the | ||
+ | relative posterior probability of a partition of strings into | ||
+ | clusters (and hence the posterior probability distribution | ||
+ | over rules for that cluster). Thus, we can use a Markovchain | ||
+ | Monte Carlo (MCMC) scheme to find the posterior | ||
+ | distribution over partitions. In practice we use a Gibbs | ||
+ | sampler, an MCMC method for drawing repeated samples | ||
+ | from the posterior probability distribution via iteratively | ||
+ | testing all possible cluster assignments for each string | ||
+ | (MacKay, 2003). | ||
+ | \\ | ||
+ | In all simulations we calculate the posterior probability | ||
+ | distribution over rules given the set of unique string types | ||
+ | used in the experimental stimuli. We use types rather than | ||
+ | rather than individual string tokens because a number of | ||
+ | computational and experimental investigations have suggested | ||
+ | that types rather than tokens may be a psychologically | ||
+ | natural unit for generalization (Gerken & Bollt, 2008; | ||
+ | Goldwater, Griffiths, & Johnson, 2006; Richtsmeier, | ||
+ | & Ohala, in press). | ||
+ | \\ | ||
+ | To assess the probability of a set of test items | ||
+ | E = e1 . . . en (again computed over types rather than tokens) | ||
+ | after a particular training sequence, we calculate the total | ||
+ | probability that those items would be generated under a | ||
+ | particular posterior distribution over hypotheses. This | ||
+ | probability is | ||
+ | \\ | ||
+ | \\ | ||
+ | {{ :: | ||
+ | \\ | ||
+ | \\ | ||
+ | which is the product over examples of the probability of a | ||
+ | particular example, summed across the posterior distribution | ||
+ | over rules p(R|T). For Model 1 we compute p(ek|rj) | ||
+ | using Eq. (2); for Models 2 and 3 we use Eq. (4). | ||
+ | We use surprisal as our main measure linking posterior | ||
+ | probabilities to the results of looking time studies. Surprisal | ||
+ | (negative log probability) is an information-theoretic | ||
+ | measure of how unlikely a particular outcome is. It has | ||
+ | been used previously to model adult reaction time data | ||
+ | in sentence processing tasks (Hale, 2001; Levy, 2008) as | ||
+ | well as infant looking times (Frank, Goodman, & | ||
+ | Tenenbaum, 2009). | ||
+ | \\ | ||
+ | \\ | ||
+ | **Results** | ||
+ | \\ | ||
+ | \\ | ||
+ | {{ : | ||
\\ | \\ | ||
\\ | \\ | ||
- | |||
---- | ---- | ||
**Conclusions: | **Conclusions: | ||
+ | \\ | ||
+ | \\ | ||
+ | The infant language learning literature has often been | ||
+ | framed around the question ‘‘rules or statistics? | ||
+ | that this is the wrong question. Even if infants represent | ||
+ | symbolic rules with relations like identity—and there | ||
+ | is every reason to believe they do—there is still the question | ||
+ | of how they learn these rules, and how they converge | ||
+ | on the correct rule so quickly in a large hypothesis space. | ||
+ | This challenge requires statistics for guiding generalization | ||
+ | from sparse data. | ||
+ | \\ | ||
+ | from sparse data. | ||
+ | In our work here we have shown how domain-general | ||
+ | statistical inference principles operating over minimal | ||
+ | rule-like representations can explain a broad set of results | ||
+ | in the rule learning literature. | ||
+ | \\ | ||
+ | The inferential principles encoded in our models—the | ||
+ | size principle (or in its more general form, Bayesian Occam’s | ||
+ | razor) and the non-parametric tradeoff between | ||
+ | complexity and fit to data encoded in the Chinese Restaurant | ||
+ | Process—are not only useful in modeling rule learning | ||
+ | within simple artificial languages. They are also the same | ||
+ | principles that are used in computational systems for natural | ||
+ | language processing that are engineered to scale to | ||
+ | large datasets. These principle have been applied to tasks | ||
+ | as varied as unsupervised word segmentation (Brent, | ||
+ | 1999; Goldwater, Griffiths, & Johnson, 2009), morphology | ||
+ | learning (Albright & Hayes, 2003; Goldwater et al., 2006; | ||
+ | Goldsmith, 2001), and grammar induction (Bannard, | ||
+ | Lieven, & Tomasello, 2009; Klein & Manning, 2005; Perfors, | ||
+ | Tenenbaum, & Regier, 2006). | ||
+ | \\ | ||
+ | First, our models assumed the minimal machinery | ||
+ | needed to capture a range of findings. Rather than making | ||
+ | a realistic guess about the structure of the hypothesis | ||
+ | space for rule learning, where evidence was limited we assumed | ||
+ | the simplest possible structure. For example, | ||
+ | although there is some evidence that infants may not always | ||
+ | encode absolute positions (Lewkowicz & Berent, | ||
+ | 2009), there have been few rule learning studies that go | ||
+ | beyond three-element strings. We therefore defined our | ||
+ | rules based on absolute positions in fixed-length strings. | ||
+ | For the same reason, although previous work on adult concept | ||
+ | learning has used infinitely expressive hypothesis | ||
+ | spaces with prior distributions that penalize complexity | ||
+ | (e.g. Goodman, Tenenbaum, Feldman, & Griffiths, 2008; | ||
+ | Kemp, Goodman, & Tenenbaum, 2008), we chose a simple | ||
+ | uniform prior over rules instead. With the collection of | ||
+ | more data from infants, however, we expect that both | ||
+ | more complex hypothesis spaces and priors that prefer | ||
+ | simpler hypotheses will become necessary. | ||
+ | \\ | ||
+ | Second, our models operated over unique string types | ||
+ | as input rather than individual tokens. This assumption | ||
+ | highlights an issue in interpreting the a parameter of Models | ||
+ | 2 and 3: there are likely different processes of forgetting | ||
+ | that happen over types and tokens. While individual tokens | ||
+ | are likely to be forgotten or misperceived with constant | ||
+ | probability, | ||
+ | misremembered or corrupted will grow smaller as more | ||
+ | tokens of that type are observed (Frank et al., 2010). An | ||
+ | interacting issue concerns serial position effects. Depending | ||
+ | on the location of identity regularities within sequences, | ||
+ | rules vary in the ease with which they can be | ||
+ | learned (Endress, Scholl, & Mehler, 2005; Johnson et al., | ||
+ | 2009). Both of these sets of effects could likely be captured | ||
+ | by a better understanding of how limits on memory interact | ||
+ | with the principles underlying rule learning. Although a | ||
+ | model that operates only over types may be appropriate | ||
+ | for experiments in which each type is nearly always heard | ||
+ | the same number of times, models that deal with linguistic | ||
+ | data must include processes that operate over both types | ||
+ | and tokens (Goldwater et al., 2006; Johnson, Griffiths, & | ||
+ | Goldwater, 2007). | ||
+ | \\ | ||
+ | Finally, though the domain-general principles we have | ||
+ | identified here do capture many results, there is some | ||
+ | additional evidence for domain-specific effects. Learners | ||
+ | may acquire expectations for the kinds of regularities that | ||
+ | appear in domains like music compared with those that | ||
+ | appear in speech (Dawson & Gerken, 2009); in addition, a | ||
+ | number of papers have described a striking dissociation | ||
+ | between the kinds of regularities that can be learned from | ||
+ | vowels and those that can be learned from consonants | ||
+ | (Bonatti, Peña, Nespor, & Mehler, 2005; Toro, Nespor, | ||
+ | Mehler, & Bonatti, 2008). Both sets of results point to a | ||
+ | need for a hierarchical approach to rule learning, in which | ||
+ | knowledge of what kinds of regularities are possible in a | ||
+ | domain can itself be learned from the evidence. Only | ||
+ | through further empirical and computational work can | ||
+ | we understand which of these effects can be explained | ||
+ | through acquired domain expectations and which are best | ||
+ | explained as innate domain-specific biases or constraints. | ||
\\ | \\ |