NLP Modules


Information Extraction Technology - What is State of the Art?

NORMALIZATION

Normalization is the process of converting differing values to a standard type or format. Once values are normalized thay can can be easily compared, clustered or used by another application. The most basic form of normalization is from spelling and conjugation variations; colors, coloring, coloured, color, all forms are disambiguated through hand edited lists and procedural rules vital to performing information extraction.


  • Time Normalization Most temporal references in text do not contain the complete month-day-year/ hour:minute:second representations necessary to place events on a timeline. Through pattern matching rules and a procedural module applying an algorithm to the document reference time, temporal mentions in the text are translated into meaningful units, combining days with months and years or assigning a specific date to "next Monday," "three days ago," or "now."

  • Location Normalization Cities or place names add value to information when they are referred to in relation to the world at large. Augusta Maine or Georgia? This information, while not always available in the text, is helpful when intersecting searches and visualizing information on a map. In Semantex™, geographic references are decoded through local pattern matching and discourse co-occurrence analysis as well as weighted default senses.

  • Word Sense Disambiguation (WSD) Extracting the meaning from text is especially difficult when one word has many senses or definitions, such as river or financial bank, to run a mile or run a company. Through domain-specific machine learning, Semantex has the ability to make a decision about which is the most likely sense of a word in context.

COREFERENCE
Coreference is one of the most difficult challenges posed in information extraction. Semantex handles coreference using a multi-stage hybrid approach. Named entity coreference, accomplished through patterned string-matching, creates links between named entity mentions. In rule-based coreference anaphoric links constrained by "hard" grammatical constraints (e.g. syntactically-bound reflexives) are resolved, creating links between pronouns and any entity mentions. For any mention that remains unlinked, the probability of each possible antecedent is calculated based on a maximum-entropy model for statistical coreference.


PARSING
Semantex goes beyond shallow parsing to work out logical relations between sentence components. Regardless of the initial composition each semantically valuable phrase is translated into a Subject-Verb-Object-Complement (SVOC) clause structure. When performing event or relationship searches this arrangement is vital to finding the correct links between elements. Understanding that "The information was transferred to headquarters by John," and "John transferred the information to headquarters" are equivalent while "Headquarters transferred the information to John" is decidedly not, enables information understanding rather than simple extraction for quicker more accurate results.


For more information, see the Semantex product page.

News & Events

Meet with Janya representatives at DoDIIS 2010 in Phoeniz, Arizona!



Janya Webinar on Customizable Text Analytics Solutions.



Janya joins partners to create Savanna Solution.



Mark Logic's Open Enrichment Framework features Semantex