Janya has experience processing a wide variety of document formats and sources, from military message traffic to commercial content providers. Although every format has its own quirks, Janya has developed techniques to facilitate improved precision of information extraction from different formats by providing preprocessing modules that operate on the document source in advance of extraction, in order to provide a variety of capabilities.
FORMAT CONVERSION
Semantex can support a variety of XML and HTML source types and perform optimal conversion on the data prior to extraction. This includes identification and storage of pertinent metadata to aid in extraction, such as using document data as a reference point for time normalization.
FILTERING
In many document types, there are blocks of text that do not contribute useful information to the overall extraction process. Examples include standard boilerplate text in press releases, corporate address information for your company and other fixed or repetitive data. Semantex provides rule-based capabilities for filtering selected areas of text and restricting the extraction to skip them. This helps improve the efficiency of extraction and optimizes the usefulness of the resulting data.
CASE RESTORATION
Many government and commercial data sources produce text in all uppercase letters, which can reduce the effectiveness of infomration extraction. Case restoration is the process of restoring proper mixed case to text, thereby improving the extraction performance. In Semantex, Case Restoration uses machine learning algorithms to automatically restore proper case.
For more information, see Pre-Processing Tools.