Cornell Cognitive Studies Symposium

Statistical Learning across Cognition

Knowledge-lean Approaches to Statistical Natural Language Processing

Lillian Lee
Cornell University
llee@CS.cornell.edu

 

The goal of natural language processing is to enable computers to use human language as a communication medium accurately, robustly, and gracefully. It is clear that a massive amount of knowledge, linguistic and otherwise, is needed to achieve this goal. As a result, much recent research has focused on getting computers to automatically learn high-quality information about language, and about the world, directly from the statistics of unprocessed or minimally processed language samples alone. We are not particular; any regularities in such samples that enable us to predict, classify, or otherwise characterize the apparent complexity of language for computational use is fair game. As examples, I will focus on two lines of work. The first uses information-theoretic distributional clustering methods trained on large language samples to induce sophisticated probabilistic models of linguistic co-occurrences. The second, in contrast, uses very simple statistics --- and no model at all --- to learn rules for segmenting Japanese kanji character sequences into words, and, surprisingly, achieves accuracy rates rivaling those of grammatically-based methods.

Portions of this talk are based on joint work with Fernando Pereira and with Rie Kubota Ando.

 

Back to main page