Introduction to Statistical Natural Language Processing

Lecturer: Detlef Prescher
Date: Monday, 2:15 - 3:45 p.m.
Location: SR 24, INF 325, Department of Computational Linguistics, University of Heidelberg
First Lecture: Monday, April 16, 2007

Exercises
Date: Wednesday, 4:15 - 5:45 p.m.
Location: SR7 (or CIP pool), INF 325, Department of Computational Linguistics, University of Heidelberg
First Exercise: Wednesday, April 25, 2007

Announcements

2007-08-30: The final exam produced one "sehr gut", six "gut", four "befriedigend", and two "mangelhaft". Here are the final grades...
2007-07-19: The script covers now slots 2 to 10
2007-07-19: Assignment 12, due July 23
2007-07-19: The midterm exam produced two "sehr gut", five "gut", four "befriedigend", and two "ausreichend"
2007-07-12: Assignment 11, due July 17
2007-06-08: The new version of our script covers now slots 2 to 7
2007-07-04: Assignment 10, due July 9
2007-06-26: No assignment this week (because of the midterm exam at July 4, 2007)
2007-06-18: Assignment 9, due June 25
2007-06-11: No assignment this week
2007-06-08: First version of our script (covering slots 2 to 4)
2007-06-01: Assignment 8, due June 18
2007-05-26: Assignment 7, due June 11
2007-05-26: Tentative date of the final exam is July 25, 2007
2007-05-26: Tentative date of the midterm exam is July 4, 2007
2007-05-15: Assignment 6, due May 29
2007-05-11: Assignment 5, due May 21
2007-05-09: Please, sign up! You can commit to this course until June 15 only...
2007-05-05: Assignment 4, due May 14
2007-04-28: Assignment 3, due May 7
2007-04-23: Please, solve the assignments in team work
2007-04-23: Andreas Dörr and Mateusz Dworaczek will provide us with a script
2007-04-23: Assignment 2, due April 30
2007-04-16: Assignment 1, due April 23
2007-04-16: Website published

Course Description

The course has three parts. We start with an introdution to Probability Theory and Statistics and get to know such terms as frequency, relative frequency, probability distribution, corpus, probability model, estimation, and un/supervised learning. Then, we give an introduction to Statistical NLP. We present the most important models and training methods of the field (such as MMs, HMMs, PCFGs for language modeling, POS tagging, and parsing), thereby focussing on the fact that we can identify these training methods as instances of MLE, the most important estimation method in Statistics. Finally, we present important evaluation methods within NLP (such as cross-entropy measures for language modeling, PARSEVAL for parsing, and BLEU for machine translation). Some exercises accompany the course. It is also intended to download and to play with some NLP tools and corpora from the World Wide Web. It might also happen that participants shall implement some small things (such as smoothing methods).

Syllabus

April 16, 2007. Slot 1

- Course description
- Statistical NLP at First Glance

April 23, 2007. Slot 2

- Estimation Theory at First Glance

April 30, 2007. Slot 3

- Maximum-Likelihood Estimation at First Glance

May 7, 2007. Slot 4

- Relative entropy
- MLE and relative entropy
- The Information Inequality of Information Theory

May 14, 2007. Slot 5

- From context-free grammars to probabilistic context-free grammars
- From rule probabilities to tree probabilities

May 21, 2007. Slot 6

- Disambiguation with PCFGs
- Treebank Training

May 28, 2007. Pfingstmontag

June 4, 2007. Slot 7

- Treebank training is maximum-likelihood estimation

June 11, 2007. Slot 8

- Unsupervised training of PCFGs (iterated treebank training)

June 18, 2007. Slot 9

- Symbolic and statistical analysers
- Input and procedure of the EM algorithm

June 25, 2007. Slot 10

- Output of the EM algorithm
- Why EM?

July 2, 2007. Slot 11

- Statistical Machine Translation (SMT) and the EM algorithm
Readings: Kevin Knight (1999), A Statistical MT Tutorial Workbook. JHU Summer Workshop.

July 9, 2007. Slot 12

- Language Modeling
Readings: Chapter 4 of Jurafsky and Martin (to appear), Speech and Language Processing.

July 16, 2007. Slot 13

- Evaluation (parsing, tagging, SMT, language modeling, and in general)

July 23, 2007. Slot 14

- Wrap-up
- Evaluation of the lecturer

Grading

10% - participation (active/passive)
30% - first written exam
60% - second written exam
Note that you also have to solve 50% of the assignments

Script

Thanks to Andreas Dörr and Mateusz Dworaczek, there is a script of this course (in German only).