NORMALIZATION
The most basic form of normalization is from spelling and conjugation variations; colors, coloring, coloured = color, all forms are disambiguated through hand edited lists and procedural rules vital to performing information extraction.
COREFERENCE
One of the most difficult challenges posed in information extraction, coreference is handled by
InfoXtract as a multi-stage hybrid approach. Named entity coreference, accomplished through patterned
string-matching creates links between named entity mentions. In rule-based coreference anaphoric links
constrained by "hard" grammatical constraints (e.g. syntactically-bound reflexives) are resolved,
creating links between pronouns and any entity mentions. For any mention that remains unlinked,
the probability of each possible antecedent is calculated based on a maximum-entropy model for
statistical co reference.
PARSING
Semantex™ goes beyond shallow parsing to work out logical relations between sentence components.
Regardless of the initial composition each semantically valuable phrase is translated into a
Subject-Verb-Object-Complement (SVOC) clause structure. When performing event or relationship searches
this arrangement is vital to finding the correct links between elements. Understanding that "The
information was transferred to headquarters by John," and "John transferred the information to
headquarters" are equivalent while "Headquarters transferred the information to John" is decidedly not, enables information understanding rather than simple extraction for quicker more accurate results.
For more information, see the Semantex™ product page.