NLP Modules


Information Extraction Technology - What is State of the Art?

NORMALIZATION

The most basic form of normalization is from spelling and conjugation variations; colors, coloring, coloured = color, all forms are disambiguated through hand edited lists and procedural rules vital to performing information extraction.


  • Time Normalization Most temporal references in text do not contain the complete month-day-year/ hour:minute:second representations necessary to place events on a timeline. Through pattern matching rules and a procedural module applying an algorithm to the document reference time, temporal mentions in the text are translated into meaningful units, combining days with months and years or assigning a specific date to "next Monday," "three days ago," or "now."

  • Location Normalization Cities or place names add value to information when they are referred to in relation to the world at large. Augusta Maine or Georgia? This information while not always available in the text is helpful when intersecting searches and visualizing information on a map. In InfoXtract geographic references are decoded through local pattern matching and discourse co-occurrence analysis as well as weighted default senses.

  • Word Sense Disambiguation (WSD) Extracting the meaning from text is especially difficult when one word has many senses or definitions, river or financial bank, to run a mile or a company, through domain-specific machine learning Semantex™ has the ability to make a decision which is the most likely sense of a word in context.

COREFERENCE
One of the most difficult challenges posed in information extraction, coreference is handled by InfoXtract as a multi-stage hybrid approach. Named entity coreference, accomplished through patterned string-matching creates links between named entity mentions. In rule-based coreference anaphoric links constrained by "hard" grammatical constraints (e.g. syntactically-bound reflexives) are resolved, creating links between pronouns and any entity mentions. For any mention that remains unlinked, the probability of each possible antecedent is calculated based on a maximum-entropy model for statistical co reference.


PARSING
Semantex™ goes beyond shallow parsing to work out logical relations between sentence components. Regardless of the initial composition each semantically valuable phrase is translated into a Subject-Verb-Object-Complement (SVOC) clause structure. When performing event or relationship searches this arrangement is vital to finding the correct links between elements. Understanding that "The information was transferred to headquarters by John," and "John transferred the information to headquarters" are equivalent while "Headquarters transferred the information to John" is decidedly not, enables information understanding rather than simple extraction for quicker more accurate results.


For more information, see the Semantex™ product page.