Discovery and Translation of Collocations from English to German

Adam Goforth
EECS 595
Fall 2005

Project Overview

What's a collocation?

A collocation is defined as "recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages." [1] Basically, they're groups of words that often go together and usually mean something different when together than when they're apart. Examples are "The United Nations" and "Natural Language Processing".

What's the project about?

I implemented two pieces of software concerning Computational Linguistics and collocations. Both are reimplementations of previous work. The first is a reimplementation of Frank Smadja's Xtract program, described in the paper "Retrieving Collocations from Text: Xtract". [1] It takes a corpus and extracts collocations from it based on statistical information about the proximity of words to each other. The second is a reimplementation of Champollion, a program described by Smadja, McKeown and Hatzivassiloglou in "Translating Collocations for Bilingual Lexicons: A Statistical Approach". [2] Champollion takes collocations and a sentence aligned bilingual corpus and finds translations of the collocations using statistical methods.

Implementation Details

The implementations are written as command line tools in Java.

The sentence-aligned corpus used is the Europarl Corpus. One year's worth of procedings (2000) from the European Parliament was taken in English and German and used as the source and target corpora. Each language's corpus contains approximately 4.1 million words, 143,000 sentences and is about 23MB of text. The exact files used are included with the project files download as ep-00-en.txt and ep-00-de.txt

JXtract

JavaDoc

JXtract accepts a word and a corpus as input, and it outputs collocations found in the corpus containing the word. First, I found the most frequently used words in the corpus. Then, I gave this list to JXtract to try to find collocations. It runs rather slowly (it is not optimized for speed at all) and it takes on the order of 15 seconds to a minute to find each collocation. The output format looks like this:

_ the european union _ most _ _ _ _ _
_ _ _ _ _ most of us _ _ _
_ _ _ _ the most vulnerable _ _ _ _
_ _ _ _ the report presented _ _ _ _
_ _ _ _ _ report on competition policy _ _
_ _ _ _ the most _ _ in the world

Each underscore is a variable word. The outside underscores can generally be cut off, then you are (hopefully) left with collocations. Notice that some of these collocations have holes in the middle of them. For more extensive output from JXtract, see this file: collocations.txt

Limitations

The original Xtract uses a Part of Speech tagger and a parser to both increase accuracy and create phrasal templates. JXtract is self contained, and does not have these features.

JChampollion

JavaDoc

JChampollion accepts a sentence-aligned, bilingual corpus and a collocation in the source text (such as those produced by Xtract) and produces a translation of the collocation in the target language. The original Champollion was written for English as the source text and French as the target text, and used the Hansards Corpus for evaluation. JChampollion uses the Europarl Corpus with English as the source language and German as the target language. There is a preprocessing step to index the corpus. The Lucene indexer is used by JChampollion for this purpose.

A few samples of the output:

Source Language Collocation JChampollion Output
Madam President frau präsidentin
member states mitgliedstaaten
the committee on agriculture and rural development landwirtschaft ländliche
report on competition policy wettbewerbspolitik

Limitations

JChampollion is as close to the original implementation of Champollion as could be achieved from reading the paper describing its algorithm. The only detail that is left vague is that the authors of the paper mention that they do not return closed class words from the target language in their translations, because their frequency messes up the statistical correlation data for the rest of the corpus. However, they don't specify exactly which closed class words they exclude. In JChampollion, most of the German articles and prepositions are excluded (with some morphological differences accounted for), but nothing else. Nevertheless, the lack of prepositions and articles greatly reduces the accuracy of the translations.

The index files for the corpus are rather large, about 50% larger than the corpus itself. Ideally the index would be kept in memory, but as the corpus size grows, this becomes impractical. So, the index is used from the hard disk.

Files

JChampollion

Download Binary Jar & Source Code

Usage: unzip the files to a directory, then run "java -jar jchamp.jar" for help information.
Note: the first time JChampollion is run with a new corpus, it must be run with the -index option, which will build a new index. This operation will take several minutes to complete. From then on it can be run without the -index option, unless the corpus changes.

JXtract

Download Binary Jar & Source Code

Usage: unzip the files to a directory, then run "java -jar jxtract.jar" for help information.

Poster

OpenDocument Format
PDF Format

References

[1] Smadja, F. 1993. Retrieving collocations from text: Xtract. Comput. Linguist. 19, 1 (Mar. 1993), 143-177.
[2] Smadja, F., McKeown, K. R., and Hatzivassiloglou, V. 1996. Translating collocations for bilingual lexicons: a statistical approach. Comput. Linguist. 22, 1 (Mar. 1996), 1-38.