A collocation is defined as "recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages." [1] Basically, they're groups of words that often go together and usually mean something different when together than when they're apart. Examples are "The United Nations" and "Natural Language Processing".
I implemented two pieces of software concerning Computational Linguistics and collocations. Both are reimplementations of previous work. The first is a reimplementation of Frank Smadja's Xtract program, described in the paper "Retrieving Collocations from Text: Xtract". [1] It takes a corpus and extracts collocations from it based on statistical information about the proximity of words to each other. The second is a reimplementation of Champollion, a program described by Smadja, McKeown and Hatzivassiloglou in "Translating Collocations for Bilingual Lexicons: A Statistical Approach". [2] Champollion takes collocations and a sentence aligned bilingual corpus and finds translations of the collocations using statistical methods.
The implementations are written as command line tools in Java.
The sentence-aligned corpus used is the Europarl Corpus. One year's worth of procedings (2000) from the European Parliament was taken in English and German and used as the source and target corpora. Each language's corpus contains approximately 4.1 million words, 143,000 sentences and is about 23MB of text. The exact files used are included with the project files download as ep-00-en.txt and ep-00-de.txt
JXtract accepts a word and a corpus as input, and it outputs collocations found in the corpus containing the word. First, I found the most frequently used words in the corpus. Then, I gave this list to JXtract to try to find collocations. It runs rather slowly (it is not optimized for speed at all) and it takes on the order of 15 seconds to a minute to find each collocation. The output format looks like this:
_ the european union _ most _ _ _ _ _ _ _ _ _ _ most of us _ _ _ _ _ _ _ the most vulnerable _ _ _ _ _ _ _ _ the report presented _ _ _ _ _ _ _ _ _ report on competition policy _ _ _ _ _ _ the most _ _ in the world
Each underscore is a variable word. The outside underscores can generally be cut off, then you are (hopefully) left with collocations. Notice that some of these collocations have holes in the middle of them. For more extensive output from JXtract, see this file: collocations.txt
The original Xtract uses a Part of Speech tagger and a parser to both increase accuracy and create phrasal templates. JXtract is self contained, and does not have these features.
JChampollion accepts a sentence-aligned, bilingual corpus and a collocation in the source text (such as those produced by Xtract) and produces a translation of the collocation in the target language. The original Champollion was written for English as the source text and French as the target text, and used the Hansards Corpus for evaluation. JChampollion uses the Europarl Corpus with English as the source language and German as the target language. There is a preprocessing step to index the corpus. The Lucene indexer is used by JChampollion for this purpose.
A few samples of the output:
Source Language Collocation | JChampollion Output |
Madam President | frau präsidentin |
member states | mitgliedstaaten |
the committee on agriculture and rural development | landwirtschaft ländliche |
report on competition policy | wettbewerbspolitik |
JChampollion is as close to the original implementation of Champollion as could be achieved from reading the paper describing its algorithm. The only detail that is left vague is that the authors of the paper mention that they do not return closed class words from the target language in their translations, because their frequency messes up the statistical correlation data for the rest of the corpus. However, they don't specify exactly which closed class words they exclude. In JChampollion, most of the German articles and prepositions are excluded (with some morphological differences accounted for), but nothing else. Nevertheless, the lack of prepositions and articles greatly reduces the accuracy of the translations.
The index files for the corpus are rather large, about 50% larger than the corpus itself. Ideally the index would be kept in memory, but as the corpus size grows, this becomes impractical. So, the index is used from the hard disk.
Download Binary Jar & Source Code
Usage: unzip the files to a directory, then run "java -jar jchamp.jar" for help information.
Note: the first time JChampollion is run with a new corpus, it must be run with the -index option, which will build a new index. This operation will take several minutes to complete. From then on it can be run without the -index option, unless the corpus changes.
Download Binary Jar & Source Code
Usage: unzip the files to a directory, then run "java -jar jxtract.jar" for help information.