jxtract
Class BigramCollection

java.lang.Object
  extended by jxtract.BigramCollection

public class BigramCollection
extends java.lang.Object

BigramCollection is the datastructure that contains information about a word and all of the bigrams that are within a distance of 5 words from it in any sentence in the corpus.

Author:
Adam Goforth

Constructor Summary
BigramCollection()
          Contructor
 
Method Summary
 boolean addSentence(java.lang.String w, java.lang.String s, boolean includeClosedClass)
          Takes a sentence and adds all bigrams in the "phrase" (+/- 5 words) to the collection.
 double getFbar()
          Returns fbar, the average frequency of all bigrams for this word.
 double getSigma()
          Returns sigma, the standard deviation of the frequency of all bigrams for this word.
 java.util.Vector getStageOneBigrams(double k0, double k1, double U0)
          Returns a Vector with the bigrams left after Stage 1 of the algorithm and their characteristics.
 java.lang.String getTable2()
          Gets a string representing the contents of the BigramCollection similar to the format presented in Table 2 of the Smadja paper.
 java.lang.String getTable4()
          Present the results of Step 1.3 in the Smadja algorithm.
 void stage2(double T)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BigramCollection

public BigramCollection()
Contructor

Parameters:
w_ - The word for which bigrams will be collected.
Method Detail

addSentence

public boolean addSentence(java.lang.String w,
                           java.lang.String s,
                           boolean includeClosedClass)
                    throws java.lang.Exception
Takes a sentence and adds all bigrams in the "phrase" (+/- 5 words) to the collection. Note: The p value of the wi word is determined based on the examples in Table 2 of the Smadja paper, not the examples in the Step 1.2 description. These examples seem to be contradictory, and the convention of "p+1 means wi is one word to the right of w" seems more intuitive.

Parameters:
s - The sentence to be added.
Returns:
true if insert is successful, false otherwise.
Throws:
java.lang.Exception - If given sentence does not contain the word this BigramCollection tracks.

getFbar

public double getFbar()
Returns fbar, the average frequency of all bigrams for this word.

Returns:
fbar, the average frequency

getSigma

public double getSigma()
Returns sigma, the standard deviation of the frequency of all bigrams for this word.

Returns:
sigma, the standard deviation

getTable2

public java.lang.String getTable2()
Gets a string representing the contents of the BigramCollection similar to the format presented in Table 2 of the Smadja paper.

Returns:
The String containing the table-format contents of the bigrams.

getTable4

public java.lang.String getTable4()
Present the results of Step 1.3 in the Smadja algorithm. This is similar to Table 4 in the paper.

Returns:
A string containing the table.

getStageOneBigrams

public java.util.Vector getStageOneBigrams(double k0,
                                           double k1,
                                           double U0)
Returns a Vector with the bigrams left after Stage 1 of the algorithm and their characteristics.

Returns:
A Vector containing all of the S1Bigrams

stage2

public void stage2(double T)