More and more NLP tasks are related to Web documents processing. All of
these tasks require a reliable and effective HTML/TEXT sentence segmenter.
Many sentence segmentation tools have been written in perl using regular
expressions and pattern matching. They only work with simple pure text,
not for complicated text document or web pages in HTML format, A
number of problems related to Web-page parsing need to be addressed, including:
A sentence may ended with '?' or '!' as well as a dot '.'. A blank or
carriage return can also be a sentence boundary in special case.
A phrases or even a single word should be counted as a sentence if it
is not related to contexts.
A dot '.' sometimes is not a sentence boundary. For example, a dot in a
URL or an email address.
Non-contextual contents in a web page should be excluded. These include