Probabilistic Grammars and Data-Oriented Parsing 

LecturersDetlef Prescher, Remko Scha, and Khalil Sima'an
Date: Tuesday, 4-6 p.m.
Location: P.018
First Lecture: February 3, 2004

Het college wordt in het Engels gegeven! This course will be given in English!

Content: Probabilistic grammars start where formal language theory stops.  Ambiguity resolution, robustness, efficiency, learning/adaption and estimation from data is the starting point for probabilistic grammars. The subject matter for these mathematical models is Natural Language Processing (as oppose to Formal Language Processing - e.g. compilation). The course concentrates on syntactic parsing with ambiguity resolution as the core subject for probabilistic grammars. We will look mainly at the following subjects for Probabilistic Context-Free Grammars (PCFGs): (1) Mathematical aspects of probabilistic grammars, (2) Estimation Theory for supervised and unsupervised learning, (3) Algorithms for estimation of PCFGs, (4) Efficient parsing under PCFGs. Then we will consider aspects of Data Oriented Parsing and how it resembles/deviates from PCFGs: parameter estimation, efficiency and linguistic issues. The course has a slot for student presentations of a paper from a list of papers provided by the lecturers.


Beside Remko Scha's background material on Data-Oriented Parsing (DOP), (parts of) the following papers or books are especially interesting for the class:


FIRST PART (Lectures 1-7, by Detlef Prescher)
Keywords: Motivation for working with probabilistic grammars (the need to deal with uncertainty); definition of probabilistic context-free grammars (PCFG); statistical estimation theory and how it applies to PCFGs; formal notions of a corpus and a probability model; maximum-likelihood estimation (MLE); relation of MLE to relative-frequency estimation; other kinds of estimators: maximum-entropy estimation and the expectation-maximization (EM) algorithm; treebank training for PCFGs and unsupervised training of PCFGs using the EM algorithm; efficient parsing under PCFGs: Viterbi, inside and outside algorithm; the inside-outside algorithm for unupervised training of PCFGs and its relation to EM.
MID-TERM PROJECT (Three weeks: From March 16 to April 6)
The mid-term project consists of an experiment and a report on this experiment. It shall comprise the homework for about 7 lectures. [Note also that there are no lectures in the week of March 22-26.] The submission deadline for the project report is April 13.

SECOND PART (Lectures 8-14, mainly by Remko Scha)
Keywords: Introduction to data-oriented parsing (DOP). This final part will also consist of presentations given by the students (Relevant papers will be put on this page. The topics are: Shallow parsing, parsing with PCFGs and unification grammars, data-oriented parsing, etc.).
FINAL PROJECT (Four weeks: From May 11 to June 8)
The final project is similiar to the mid-term project but shall be a bit bigger. The submission deadline for the project report is June 15.

The final mark consists of:  40%*midterm project + 40%*final project + 20%*presentation