Silvia Rădulescu

Trace:

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
frank_tenenbaum_2011 [2016/02/06 18:28] silviafrank_tenenbaum_2011 [2016/02/08 15:48] (current) silvia
Line 19: Line 19:
 **Introduction:** **Introduction:**
 \\ \\
-1. Introduction: from ‘‘rules vs. statistics’’ to statistics +\\ 
-over rules+**1. Introduction: from ‘‘rules vs. statistics’’ to statistics 
 +over rules**
 \\ \\
 A central debate in the study of language acquisition A central debate in the study of language acquisition
Line 106: Line 107:
 learners are doing in these tasks.** learners are doing in these tasks.**
 \\ \\
 +Our models are ideal observer models: they provide a 
 +description of the learning problem and show what the 
 +correct inference would be, under a given set of assumptions. 
 +The ideal observer approach has a long history in 
 +the study of perception and is typically used for understanding 
 +the ways in which performance conforms to or 
 +deviates from the ideal (Geisler, 2003). 
 +\\ 
 +With few exceptions (Dawson & Gerken, 2009; Johnson 
 +et al., 2009), empirical work on rule learning has been 
 +geared towards showing what infants can do, rather than 
 +providing a detailed pattern of successes and failures 
 +across ages. 
 +\\ 
 +\\ 
 +**Models** 
 +\\ 
 +The hypothesis space is constant 
 +across all three models, but the inference procedure 
 +varies depending on the assumptions of each model. 
 +\\ 
 +Our approach is to make the simplest possible assumptions 
 +about representational components, including the 
 +structure of the hypothesis space and the prior on hypotheses. 
 +As a consequence, the hypothesis space of our models 
 +is too simple to describe the structure of interesting phenomena 
 +in natural language, and our priors do not capture 
 +any of the representational biases that human learners 
 +may brings to language acquisition. 
 +\\ 
 +Nevertheless, our hope is that this approach will help in 
 +articulating the principles of generalization underlying 
 +experimental results on rule learning. 
 +\\ 
 +**2.1. Hypothesis space** 
 +\\ 
 +This hypothesis space is based 
 +on the idea of a rule as a restriction on strings. We define 
 +the set of strings S as the set of ordered triples of elements 
 +s1, s2, s3 where all s are members of vocabulary of elements, 
 +V. There are thus |V|3 possible elements in S. 
 +\\ 
 +For each set of simulations, we define S as the total set 
 +of string elements used in a particular experiment. 
 +\\ 
 +For example, in Marcus et al (1999): set of elements S = {ga, gi, ta, ti, na, ni, la, li}. These elements 
 +are treated by our models as unique identifiers that do not 
 +encode any information about phonetic relationships between 
 +syllables. 
 +\\ 
 +A rule defines a subset of S. Rules are written as ordered 
 +triples of primitive functions (f1, f2, f3). Each function operates 
 +over an element in the corresponding position in a 
 +string and returns a truth value. For example, f1 defines a 
 +restriction on the first string element, x1. The set F of functions 
 +is a set which for our simulations includes ^ (a function 
 +which is always true of any element) and a set of 
 +functions is y(x) which are only true if x = y where y is a 
 +particular element. The majority of the experiments addressed 
 +here make use of only one other function: the 
 +identity function =a which is true if x = xa. For example, in 
 +Marcus et al. (1999), learners heard strings like ga ti ti 
 +and ni la la, which are consistent with (^, ^, =2) (ABB, or ‘‘second 
 +and third elements equal’’). The stimuli in that experiment 
 +were also consistent with another regularity, 
 +however: (^,^,^), which is true of any string in S. One additional 
 +set of experiments makes use of musical stimuli 
 +for which the functions >a and <a (higher than and lower 
 +than) are defined. They are true when x > xa and x < xa 
 +respectively. 
 +\\ 
 +\\ 
 +**Model 1: //single rule//** 
 +\\ 
 +\\ 
 +Model 1 begins with the framework for generalization 
 +introduced by Tenenbaum and Griffiths (2001). It uses exact 
 +Bayesian inference to calculate the posterior probability 
 +of a particular rule r given the observed set of training 
 +sentences T = t1 . . . tm. This probability can be factored via 
 +Bayes’ rule into the product of the likelihood of the training 
 +data being generated by a particular rule p(T|r), and a prior 
 +probability of that rule p(r), normalized by the sum of 
 +these over all rules: 
 +\\ 
 +\\ 
 +{{ ::screenshot_2016-02-06_20.28.40.png?nolink&200 |}} 
 +\\ 
 +\\ 
 +We assume a uniform prior p(r) = 1/|R|, meaning that no 
 +rule is a priori more probable than any other. For human 
 +learners the prior over rules is almost certainly not uniform  
 +and could contain important biases about the kinds of 
 +structures that are used preferentially in human language 
 +(whether these biases are learned or innate, domaingeneral 
 +or domain-specific). 
 +\\ 
 +We assume that training examples are generated by 
 +sampling uniformly from the set of sentences that are congruent 
 +with one rule. This assumption is referred to as 
 +strong sampling, and leads to the size principle: the probability 
 +of a particular string being generated by a particular 
 +rule is inversely proportional to the total number of strings 
 +that are congruent with that rule (which we notate |r|). 
 +\\ 
 +**Model 2: //single rule under noise//** 
 +\\ 
 +Model 1 assumed that every data point must be accounted 
 +for by the learner’s hypothesis. However, there 
 +are many reasons this might not hold for human learners: 
 +the learner’s rules could permit exceptions, the data could 
 +be perceived noisily such that a training example might 
 +be lost or mis-heard, or data could be perceived correctly 
 +but not remembered at test. Model 2 attempts to account 
 +for these sources of uncertainty by consolidating them all 
 +within a single parameter. While future research will almost 
 +certainly differentiate these factors (for an example 
 +of this kind of work, see Frank, Goldwater, Griffiths, & 
 +Tenenbaum, 2010), here we consolidate them for 
 +simplicity. 
 +\\ 
 +To add noise to the input data, we add an additional 
 +step to the generative process: after strings are sampled 
 +from the set consistent with a particular rule, we flip a 
 +biased coin with weight a. With probability a, the string 
 +remains the same, while with probability 1 - a, the string 
 +is replaced with another randomly chosen element. 
 +\\ 
 +Under Model 1, a rule had likelihood zero if any string in 
 +the set T was inconsistent with it. With any appreciable level 
 +of input uncertainty, this likelihood function would result 
 +in nearly all rules having probability zero. To deal with 
 +this issue, we assume in Model 2 that learners know that 
 +their memory is fallible, and that strings may be misremembered 
 +with probability 1 - a. 
 +\\ 
 +\\ 
 +{{ :screenshot_2016-02-06_20.43.30.png?nolink&300 |}} 
 +\\ 
 +\\ 
 +**Model 3: //multiple rules under noise//** 
 +\\ 
 +Model 3 loosens an additional assumption: that all the 
 +strings in the input data are the product of a single rule. Instead, 
 +it considers the possibility that there are multiple 
 +rules, each consistent with a subset of the training data. 
 +We encode a weak bias to have fewer rules via a prior 
 +probability distribution that favors more compact partitions 
 +of the input. This prior is known as a Chinese Restaurant 
 +Process (CRP) prior (Rasmussen, 2000); it introduces a 
 +second free parameter, c, which controls the bias over clusterings. 
 +A low value of c encodes a bias that there are likely 
 +to be many small clusters, while a high value of c encodes a 
 +bias that there are likely to be a small number of large 
 +clusters. 
 +The joint probability of the training data T and a partition 
 +Z of those strings into rule clusters is given by 
 +P(T,Z) = P(T|Z)P(Z) 
 +neglecting the parameters a and c. The probability of a 
 +clustering P(Z) is given by CRP(Z,c). 
 +\\ 
 +Unlike in Models 1 and 2, inference by exact enumeration 
 +is not possible and so we are not able to compute the 
 +normalizing constant. But we are still able to compute the 
 +relative posterior probability of a partition of strings into 
 +clusters (and hence the posterior probability distribution 
 +over rules for that cluster). Thus, we can use a Markovchain 
 +Monte Carlo (MCMC) scheme to find the posterior 
 +distribution over partitions. In practice we use a Gibbs 
 +sampler, an MCMC method for drawing repeated samples 
 +from the posterior probability distribution via iteratively 
 +testing all possible cluster assignments for each string 
 +(MacKay, 2003). 
 +\\ 
 +In all simulations we calculate the posterior probability 
 +distribution over rules given the set of unique string types 
 +used in the experimental stimuli. We use types rather than 
 +rather than individual string tokens because a number of 
 +computational and experimental investigations have suggested 
 +that types rather than tokens may be a psychologically 
 +natural unit for generalization (Gerken & Bollt, 2008; 
 +Goldwater, Griffiths, & Johnson, 2006; Richtsmeier, Gerken, 
 +& Ohala, in press). 
 +\\ 
 +To assess the probability of a set of test items 
 +E = e1 . . . en (again computed over types rather than tokens) 
 +after a particular training sequence, we calculate the total 
 +probability that those items would be generated under a 
 +particular posterior distribution over hypotheses. This 
 +probability is 
 +\\ 
 +\\ 
 +{{ ::screenshot_2016-02-06_20.54.45.png?nolink&300 |}} 
 +\\ 
 +\\ 
 +which is the product over examples of the probability of a 
 +particular example, summed across the posterior distribution 
 +over rules p(R|T). For Model 1 we compute p(ek|rj) 
 +using Eq. (2); for Models 2 and 3 we use Eq. (4). 
 +We use surprisal as our main measure linking posterior 
 +probabilities to the results of looking time studies. Surprisal 
 +(negative log probability) is an information-theoretic 
 +measure of how unlikely a particular outcome is. It has 
 +been used previously to model adult reaction time data 
 +in sentence processing tasks (Hale, 2001; Levy, 2008) as 
 +well as infant looking times (Frank, Goodman, & 
 +Tenenbaum, 2009). 
 +\\ 
 +\\ 
 +**Results** 
 +\\ 
 +\\ 
 +{{ :screenshot_2016-02-06_21.03.21.png?nolink |}}
 \\ \\
 \\ \\
Line 112: Line 325:
 ---- ----
 **Conclusions:** **Conclusions:**
 +\\
 +\\
 +The infant language learning literature has often been
 +framed around the question ‘‘rules or statistics?’’ We suggest
 +that this is the wrong question. Even if infants represent
 +symbolic rules with relations like identity—and there
 +is every reason to believe they do—there is still the question
 +of how they learn these rules, and how they converge
 +on the correct rule so quickly in a large hypothesis space.
 +This challenge requires statistics for guiding generalization
 +from sparse data.
 +\\
 +from sparse data.
 +In our work here we have shown how domain-general
 +statistical inference principles operating over minimal
 +rule-like representations can explain a broad set of results
 +in the rule learning literature.
 +\\
 +The inferential principles encoded in our models—the
 +size principle (or in its more general form, Bayesian Occam’s
 +razor) and the non-parametric tradeoff between
 +complexity and fit to data encoded in the Chinese Restaurant
 +Process—are not only useful in modeling rule learning
 +within simple artificial languages. They are also the same
 +principles that are used in computational systems for natural
 +language processing that are engineered to scale to
 +large datasets. These principle have been applied to tasks
 +as varied as unsupervised word segmentation (Brent,
 +1999; Goldwater, Griffiths, & Johnson, 2009), morphology
 +learning (Albright & Hayes, 2003; Goldwater et al., 2006;
 +Goldsmith, 2001), and grammar induction (Bannard,
 +Lieven, & Tomasello, 2009; Klein & Manning, 2005; Perfors,
 +Tenenbaum, & Regier, 2006).
 +\\
 +First, our models assumed the minimal machinery
 +needed to capture a range of findings. Rather than making
 +a realistic guess about the structure of the hypothesis
 +space for rule learning, where evidence was limited we assumed
 +the simplest possible structure. For example,
 +although there is some evidence that infants may not always
 +encode absolute positions (Lewkowicz & Berent,
 +2009), there have been few rule learning studies that go
 +beyond three-element strings. We therefore defined our
 +rules based on absolute positions in fixed-length strings.
 +For the same reason, although previous work on adult concept
 +learning has used infinitely expressive hypothesis
 +spaces with prior distributions that penalize complexity
 +(e.g. Goodman, Tenenbaum, Feldman, & Griffiths, 2008;
 +Kemp, Goodman, & Tenenbaum, 2008), we chose a simple
 +uniform prior over rules instead. With the collection of
 +more data from infants, however, we expect that both
 +more complex hypothesis spaces and priors that prefer
 +simpler hypotheses will become necessary.
 +\\
 +Second, our models operated over unique string types
 +as input rather than individual tokens. This assumption
 +highlights an issue in interpreting the a parameter of Models
 +2 and 3: there are likely different processes of forgetting
 +that happen over types and tokens. While individual tokens
 +are likely to be forgotten or misperceived with constant
 +probability, the probability of a type being
 +misremembered or corrupted will grow smaller as more
 +tokens of that type are observed (Frank et al., 2010). An
 +interacting issue concerns serial position effects. Depending
 +on the location of identity regularities within sequences,
 +rules vary in the ease with which they can be
 +learned (Endress, Scholl, & Mehler, 2005; Johnson et al.,
 +2009). Both of these sets of effects could likely be captured
 +by a better understanding of how limits on memory interact
 +with the principles underlying rule learning. Although a
 +model that operates only over types may be appropriate
 +for experiments in which each type is nearly always heard
 +the same number of times, models that deal with linguistic
 +data must include processes that operate over both types
 +and tokens (Goldwater et al., 2006; Johnson, Griffiths, &
 +Goldwater, 2007).
 +\\
 +Finally, though the domain-general principles we have
 +identified here do capture many results, there is some
 +additional evidence for domain-specific effects. Learners
 +may acquire expectations for the kinds of regularities that
 +appear in domains like music compared with those that
 +appear in speech (Dawson & Gerken, 2009); in addition, a
 +number of papers have described a striking dissociation
 +between the kinds of regularities that can be learned from
 +vowels and those that can be learned from consonants
 +(Bonatti, Peña, Nespor, & Mehler, 2005; Toro, Nespor,
 +Mehler, & Bonatti, 2008). Both sets of results point to a
 +need for a hierarchical approach to rule learning, in which
 +knowledge of what kinds of regularities are possible in a
 +domain can itself be learned from the evidence. Only
 +through further empirical and computational work can
 +we understand which of these effects can be explained
 +through acquired domain expectations and which are best
 +explained as innate domain-specific biases or constraints.
 \\ \\