When stripping a page from HTML, the residual text will contain menu texts, headers, footers and pieces of advertisements, our Article Cleaner removes these pieces reliably so that the text can be stored or sent to the Summariser. As it is self-adapting, it works perfectly if it gets additional pages from the same website for comparison (e.g. RSS feeds), but it has a fall-back mechanism if that is not possible.
A component which analyzes a collection of documents and extracts their typical structure. If the collection contains multiple different types of documents, it will produce a list of those types with their typical structure. It is neutral regarding which elements are used to structure documents and works both on plain text files as well as xml and html of various types. Currently, we have a prototype which is being worked on and which shows really promising results.
We can reliably determine the language in which a text is written in. In contrast to other engines, we also have a category “unknown” so that texts are not forced into a category. This improves the quality of subsequent processing steps. Recognition rates are near-perfect even for strings as short as 100 characters. With longer texts it can also be trained to determine domain differences within the same language. Implementation is very efficient and allows real-time recognition even on a smartphone.
Break up a text into tokens fast and in high quality. Remove sentence marks from words, but keep Addresses, URLs and time expressions intact. Understands measurement units. All of that optimized for maximum quality as well as speed at the same time and has been extensively tested across many different scenarios.
Morphology decomposition that works across languages and can be used also for “less” complex languages such as German, etc. Will also produce morphologically based POS tags for morphologically rich languages.
Tag words of a text with a PoS tag so that further analyses can use this information as features, etc. Unsupervised will also lead to non-human / textbook categories, but within our processing pipeline this is actually a strength as it improves overall quality.