A module that can recognize addresses and tag the parts with the correct semantics (street, number, city, zip-code etc). This has complexities when covering the whole world as different languages and cultures have their different ways of denoting a place.
Very robust deep technological stack combining unsupervised (unsupervised morphology (more) and unsupervised part-of-speech tagging (more)) and supervised methods
- Genes (and their mutations)
This is a place recognition system that is geared towards cities but can also include states, locations etc. It works independent of lists, i.e. it can detect locations that were not present in training data.
Recognize names of natural persons in texts. This is not done via extensive lists but by deep linguistic analysis of the surrounding text. This means that also new names can be identified with a high precision. System can also extract longer names like “John F. Kennedy”.
Same as Persons but for Companies. Here, the problem is that many normal words (“Apple”) and phrases can be names. Since our system uses the linguistic context, such problems are greatly reduced.
Determine keywords of a text. Certainly in a quality beyond the text book tf/idf methods.
Language independent extraction of list items such as bullet point lists, especially in the context of email handling. Can also handle broken lines, embedded emails, forwards, etc. and is proven to work with different languages and their different characters used for bulleted lists.
Like the time extractor this module understands all kinds of measurement units such as kg, mmol or km/h and extracts them as useful factual informations, which can then be used otherwise.
Combination of statistical and rule-based self-learning system to extract the signature at the end of emails to remove this text from further analysis. This is important for high quality document clustering and sensitive NER.
Language independent summarization of texts with a variable / customizable target length. It can also work on multiple documents, i.e. summarization across documents (for example different newspapers writing about the same incident).
Language-neutral, but rule-based system for time extraction. Understands also complex phrases like “next Sunday in the morning” by actually doing arithmetics based on what the text says. It uses a reference time so that also old messages can be understood correctly.