Pre-Processing


Janya has developed techniques to facilitate improved precision of information extraction from a wide variety of formats.


CASE RESTORATION

Case restoration is accomplished through machine learning algorithms. Since much of the text used by the intelligence community is entirely in uppercase, it was necessary to adapt Semantex™ for better performance on this type of data. Utilizing a bi-gram Hidden Markov model (HMM) the system is able to automatically restore the uppercase document to natural mixed case.


TEXT ZONING

Text zoning identifies structure within a document and distinguishes metadata and reference information from those sections that should be processed by Semantex™. To accomplish this method, a unique rule-specification language and control structure were developed that enables users to specify their own rules for text zoning. The TextZoner has been used to detect page breaks and section headers within HUMINT documents.


For more information, see Pre-Processing Tools.