Automatic English Sentence Segmenter


More and more NLP tasks are related to Web documents processing. All of these tasks require a reliable and effective HTML/TEXT sentence segmenter. Many sentence segmentation tools have been written in perl using regular expressions and pattern matching. They only work with simple pure text, not for complicated text document or web pages in HTML format, A number of problems related to Web-page parsing need to be addressed, including:
  1. A sentence may ended with '?' or '!' as well as a dot '.'. A blank or carriage return can also be a sentence boundary in special case.
  2. A phrases or even a single word should be counted as a sentence if it is not related to contexts.
  3. A dot '.' sometimes is not a sentence boundary. For example, a dot in a URL or an email address.
  4. Non-contextual contents in a web page should be excluded. These include JavaScript code, image in the web page, HTML comments and other HTML tags.
This sentence segmenter is originally designed for AnswerBus Question Answering System. Now it is also used in Seven Tones Search Engine and several other online NLP applications. Feel free to download and use these local command versions in different operating systems: The outputs from these version may slightly differ. Always check back for updated versions.