Address Extraction

A module that can recognize addresses and tag the parts with the correct semantics (street, number, city, zip-code etc). This has complexities when covering the whole world as different languages and cultures have their different ways of denoting a place.

Entity Extraction

Very robust deep technological stack combining unsupervised (unsupervised morphology (more) and unsupervised part-of-speech tagging (more)) and supervised methods

  • Diseases
  • Chemicals/drugs
  • Genes (and their mutations)
  • Places

This is a place recognition system that is geared towards cities but can also include states, locations etc. It works independent of lists, i.e. it can detect locations that were not present in training data.

  • Persons

Recognize names of natural persons in texts. This is not done via extensive lists but by deep linguistic analysis of the surrounding text. This means that also new names can be identified with a high precision. System can also extract longer names like “John F. Kennedy”.

  • Organizations

Same as Persons but for Companies. Here, the problem is that many normal words (“Apple”) and phrases can be names. Since our system uses the linguistic context, such problems are greatly reduced.

Keyword Extraction

Determine keywords of a text. Certainly in a quality beyond the text book tf/idf methods.

List Detector

Language independent extraction of list items such as bullet point lists, especially in the context of email handling. Can also handle broken lines, embedded emails, forwards, etc. and is proven to work with different languages and their different characters used for bulleted lists.

Measurement Units Extraction

Like the time extractor this module understands all kinds of measurement units such as kg, mmol or km/h and extracts them as useful factual informations, which can then be used otherwise.

Signature Detector

Combination of statistical and rule-based self-learning system to extract the signature at the end of emails to remove this text from further analysis. This is important for high quality document clustering and sensitive NER.

Text Summarization

Language independent summarization of texts with a variable / customizable target length. It can also work on multiple documents, i.e. summarization across documents (for example different newspapers writing about the same incident).

Time Extraction

Language-neutral, but rule-based system for time extraction. Understands also complex phrases like “next Sunday in the morning” by actually doing arithmetics based on what the text says. It uses a reference time so that also old messages can be understood correctly.