The Role of Distributional Information in Linguistic Category Formation

Abstract

A crucial component of language acquisition involves organizing words into grammatical categories and discovering relations between them. Many studies have argued that phonological or semantic cues or multiple correlated cues are required for learning. Here we examine how distributional variables will shift learners from forming a category of lexical items to maintaining lexical specificity. In a series of artificial language learning experiments, we vary a number of distributional variables to category structure and test how adult learners use this information to inform their hypotheses about categorization. Our results show that learners are sensitive to the contexts in which each word occurs, the overlap in contexts across words, the non-overlap of contexts (or systematic gaps), and the size of the data set. These variables taken together determine whether learners fully generalize or preserve lexical specificity.

Introduction

Language acquisition crucially involves finding the grammatical categories of words in the input. The organization of elements into categories, and the generalization of patterns from some seen element combinations to novel ones, account for important aspects of the expansion of linguistic knowledge in early stages of language acquisition. One hypothesis of how learners approach the problem of categorization is that the categories (but not their contents) are innately specified prior to experiencing any linguistic input, with the assignment of tokens to categories accomplished with minimal exposure. A second possibility is that the categories are formed around a semantic definition. A third hypothesis, explored in the present research, is that the distributional information in the environment is sufficient (along with a set of learning biases) to extract the categorical structure of natural language. While it is likely that each of these sources of evidence makes important contributions to language acquisition, this third hypothesis regarding distributional learning has often been thought to be an unlikely contributor, given the information processing limitations of young children and the complexity of the computational processes that would be entailed.

Furthermore, it has been difficult to test the importance of such a distributional learning mechanism because the cues to category structure in natural languages are highly correlated. In fact, it has been argued in many artificial language studies that the formation of linguistic categories (e.g., noun, verb) depends crucially on some perceptual property linking items within the category (Braine, 1987). This perceptual similarity relation might arise from identity or repetition of elements in grammatical sequences, or a phonological or semantic cue identifying words across different sentences as similar to one another (for example, words ending in –a are feminine, or words referring to concrete objects are nouns). Learners of artificial languages have been unable to acquire grammatical categories and to extend their linguistic contexts to new items correctly without such cues (Braine et al., 1990; Frigo & McDonald, 1998; Gomez & Gerken, 2000). However, this has been somewhat of a puzzle: Maratsos & Chalkley (1980) argued that in natural languages, grammatical categories do not have reliable phonological or semantic cues; rather, learners must utilize distributional cues about the linguistic contexts in which words occur to acquire such categories. Mintz, Newport & Bever (2002), as well as several other researchers, have shown that computational procedures utilizing distributional contexts can form elementary linguistic categories on corpora of mothers’ speech to young children from the CHILDES database, and Mintz (2002) and Gerken et al. (2005) have shown that both adults and infants can learn a simple version of this paradigm in the laboratory, at least when there are multiple correlated distributional cues. In the present series of experiments we also begin by demonstrating that there are distributional properties that lead to successful learning of linguistic categories in artificial language paradigms. Importantly, however, in order to understand how this mechanism works in human learners and why many previous experiments have not found such learning, we present a series of experiments that manipulate various aspects of these distributional variables, in order to understand the computational requirements for successful category learning.