Differences

This shows you the differences between two versions of the page.

--- frank_tenenbaum_2011 [2016/02/06 19:42] – silvia
+++ frank_tenenbaum_2011 [2016/02/08 15:48] (current) – silvia
@@ Line 244: / Line 244: @@
 \\
 \\
-{{ ::screenshot_2016-02-06_20.41.52.png?nolink&200 |}}
+{{ :screenshot_2016-02-06_20.43.30.png?nolink&300 |}}
+\\
+\\
+**Model 3: //multiple rules under noise//**
+\\
+Model 3 loosens an additional assumption: that all the
+strings in the input data are the product of a single rule. Instead,
+it considers the possibility that there are multiple
+rules, each consistent with a subset of the training data.
+We encode a weak bias to have fewer rules via a prior
+probability distribution that favors more compact partitions
+of the input. This prior is known as a Chinese Restaurant
+Process (CRP) prior (Rasmussen, 2000); it introduces a
+second free parameter, c, which controls the bias over clusterings.
+A low value of c encodes a bias that there are likely
+to be many small clusters, while a high value of c encodes a
+bias that there are likely to be a small number of large
+clusters.
+The joint probability of the training data T and a partition
+Z of those strings into rule clusters is given by
+P(T,Z) = P(T|Z)P(Z)
+neglecting the parameters a and c. The probability of a
+clustering P(Z) is given by CRP(Z,c).
+\\
+Unlike in Models 1 and 2, inference by exact enumeration
+is not possible and so we are not able to compute the
+normalizing constant. But we are still able to compute the
+relative posterior probability of a partition of strings into
+clusters (and hence the posterior probability distribution
+over rules for that cluster). Thus, we can use a Markovchain
+Monte Carlo (MCMC) scheme to find the posterior
+distribution over partitions. In practice we use a Gibbs
+sampler, an MCMC method for drawing repeated samples
+from the posterior probability distribution via iteratively
+testing all possible cluster assignments for each string
+(MacKay, 2003).
+\\
+In all simulations we calculate the posterior probability
+distribution over rules given the set of unique string types
+used in the experimental stimuli. We use types rather than
+rather than individual string tokens because a number of
+computational and experimental investigations have suggested
+that types rather than tokens may be a psychologically
+natural unit for generalization (Gerken & Bollt, 2008;
+Goldwater, Griffiths, & Johnson, 2006; Richtsmeier, Gerken,
+& Ohala, in press).
+\\
+To assess the probability of a set of test items
+E = e1 . . . en (again computed over types rather than tokens)
+after a particular training sequence, we calculate the total
+probability that those items would be generated under a
+particular posterior distribution over hypotheses. This
+probability is
+\\
+\\
+{{ ::screenshot_2016-02-06_20.54.45.png?nolink&300 |}}
+\\
+\\
+which is the product over examples of the probability of a
+particular example, summed across the posterior distribution
+over rules p(R|T). For Model 1 we compute p(ek|rj)
+using Eq. (2); for Models 2 and 3 we use Eq. (4).
+We use surprisal as our main measure linking posterior
+probabilities to the results of looking time studies. Surprisal
+(negative log probability) is an information-theoretic
+measure of how unlikely a particular outcome is. It has
+been used previously to model adult reaction time data
+in sentence processing tasks (Hale, 2001; Levy, 2008) as
+well as infant looking times (Frank, Goodman, &
+Tenenbaum, 2009).
+\\
+\\
+**Results**
+\\
+\\
+{{ :screenshot_2016-02-06_21.03.21.png?nolink |}}
 \\
 \\
 ----
 **Conclusions:**
+\\
+\\
+The infant language learning literature has often been
+framed around the question ‘‘rules or statistics?’’ We suggest
+that this is the wrong question. Even if infants represent
+symbolic rules with relations like identity—and there
+is every reason to believe they do—there is still the question
+of how they learn these rules, and how they converge
+on the correct rule so quickly in a large hypothesis space.
+This challenge requires statistics for guiding generalization
+from sparse data.
+\\
+from sparse data.
+In our work here we have shown how domain-general
+statistical inference principles operating over minimal
+rule-like representations can explain a broad set of results
+in the rule learning literature.
+\\
+The inferential principles encoded in our models—the
+size principle (or in its more general form, Bayesian Occam’s
+razor) and the non-parametric tradeoff between
+complexity and fit to data encoded in the Chinese Restaurant
+Process—are not only useful in modeling rule learning
+within simple artificial languages. They are also the same
+principles that are used in computational systems for natural
+language processing that are engineered to scale to
+large datasets. These principle have been applied to tasks
+as varied as unsupervised word segmentation (Brent,
+; Goldwater, Griffiths, & Johnson, 2009), morphology
+learning (Albright & Hayes, 2003; Goldwater et al., 2006;
+Goldsmith, 2001), and grammar induction (Bannard,
+Lieven, & Tomasello, 2009; Klein & Manning, 2005; Perfors,
+Tenenbaum, & Regier, 2006).
+\\
+First, our models assumed the minimal machinery
+needed to capture a range of findings. Rather than making
+a realistic guess about the structure of the hypothesis
+space for rule learning, where evidence was limited we assumed
+the simplest possible structure. For example,
+although there is some evidence that infants may not always
+encode absolute positions (Lewkowicz & Berent,
+), there have been few rule learning studies that go
+beyond three-element strings. We therefore defined our
+rules based on absolute positions in fixed-length strings.
+For the same reason, although previous work on adult concept
+learning has used infinitely expressive hypothesis
+spaces with prior distributions that penalize complexity
+(e.g. Goodman, Tenenbaum, Feldman, & Griffiths, 2008;
+Kemp, Goodman, & Tenenbaum, 2008), we chose a simple
+uniform prior over rules instead. With the collection of
+more data from infants, however, we expect that both
+more complex hypothesis spaces and priors that prefer
+simpler hypotheses will become necessary.
+\\
+Second, our models operated over unique string types
+as input rather than individual tokens. This assumption
+highlights an issue in interpreting the a parameter of Models
+and 3: there are likely different processes of forgetting
+that happen over types and tokens. While individual tokens
+are likely to be forgotten or misperceived with constant
+probability, the probability of a type being
+misremembered or corrupted will grow smaller as more
+tokens of that type are observed (Frank et al., 2010). An
+interacting issue concerns serial position effects. Depending
+on the location of identity regularities within sequences,
+rules vary in the ease with which they can be
+learned (Endress, Scholl, & Mehler, 2005; Johnson et al.,
+). Both of these sets of effects could likely be captured
+by a better understanding of how limits on memory interact
+with the principles underlying rule learning. Although a
+model that operates only over types may be appropriate
+for experiments in which each type is nearly always heard
+the same number of times, models that deal with linguistic
+data must include processes that operate over both types
+and tokens (Goldwater et al., 2006; Johnson, Griffiths, &
+Goldwater, 2007).
+\\
+Finally, though the domain-general principles we have
+identified here do capture many results, there is some
+additional evidence for domain-specific effects. Learners
+may acquire expectations for the kinds of regularities that
+appear in domains like music compared with those that
+appear in speech (Dawson & Gerken, 2009); in addition, a
+number of papers have described a striking dissociation
+between the kinds of regularities that can be learned from
+vowels and those that can be learned from consonants
+(Bonatti, Peña, Nespor, & Mehler, 2005; Toro, Nespor,
+Mehler, & Bonatti, 2008). Both sets of results point to a
+need for a hierarchical approach to rule learning, in which
+knowledge of what kinds of regularities are possible in a
+domain can itself be learned from the evidence. Only
+through further empirical and computational work can
+we understand which of these effects can be explained
+through acquired domain expectations and which are best
+explained as innate domain-specific biases or constraints.
 \\