What are "Related Words"?
If two words appear in similar contexts in a large corpus, they are regarded as distributionally
similar. We say these two words related each other.
What corpus is used to generate "Related Words"?
AP News corpus compiled by NIST is used for calculating. The corpus contains AP news from 1988 to 1990, with about 240,000
news stories.
What algorithm is used in this program?
The program generates a contextual word list for each word in the corpus, which consists of words appear
in a small window around the original word. The TF*IDF value of each word in the list is calculated
and attached as the weight. Cosine similarity method is used to measure how close a word list to all other
word lists (about 900,000) as vectors. Those words with close word lists are considered as related words.
What language is used to implement the algorithm?
The whole program is written in C.
Is this program easy to use?
Examples: