Forecast, Detlef Prescher

"Chance favors the prepared mind." (Louis Pasteur)

What has Computational Linguistics to do with weather forecasts?!

Satellite pictures of Europe provided by METEO FRANCE and Institut für Meteorologie, FU Berlin:

(today)
(tomorrow)

The picture at the left-hand side displays the current weather, whereas the picture at the right-hand side displays a preview of tomorrow's weather. Meteorologists provide weather forecasts on the basis of Probability Theory and Statistics. They use a probability model to determine the most probable outcome among the different alternatives for tomorros' weather, and they use Statistics to infer the therefor needed probability model from a large corpus of empirically observed weather data. In line with our experience, meteorologists are quite successful in predicting our tomorrow's weather with this method.

A similiar situation comes up in Computational Linguistics. Most sentences in natural language are ambiguous, i.e., they have more than one possible reading / analysis (or outcome in the wheather terminology). For example, the famous sentence "the man saw the woman with the telescope" has at least two readings: 'saw with the telescope' versus 'woman with the telescope'. The situation gets even worse when using a formal grammar mimicing natural language. (One job, maybe even the job, of computational linguists is to create such grammars...) Then sentences have typically millions of readings! The crucial problem is that someone has to (create a disambiguator being able to) select the correct one.

Although some computational linguists still believe that Probability Theory and Statistics are not appropriate for their discipline, it is true that Probability Theory is the theory of uncertainty / ambiguity. Thus this theory offers the best chance to resolve the presented natural-language disambiguation problem: The most probable among the different alternatives is the reading of a given sentence. Like in weather forecasts, Statistics helps to infer the necessary probabilities from a large corpus of linguistic data.

Last updated: February 2007.