Task: The problem of inferring structure in a data stream of unknown content requires significant machine learning strategies. However if some limited information is known about the structure then it can be used to prime the search algorithm and so reduce its search space. The totally general inference problem is too large to be tractable for data of any significant commercial volume. However priming offers a real potential for producing worthwhile results for large data sizes. There are many data streams that require analysis in the commercial world, such as, undocumented program code, web documents, stock exchange prices, etc.
We have built a prototype system that performs basic inference functions and has been adapted to analyse text steams in documents. It expersses the strucutres that are primed by the user as XML codes in an output file. However it needs significant development to bring it to a completed releasable product.
This project aims to convert the parser-editor from its basic form to that of releasable product. The software requires expansion to provide for the specification of XML markup codes by the user. The output of XML compliant dta files, the processing of partially marked up files, the inference of structures in the data stream not identified by the priming process.
The conversion of reference books (like encyclopaedia and dictionaries) in paper form into electronic formats is a difficult task. Initially they have to be read using a scanner and then converted into text using OCR software. This process typically produces an error rate of incorrect character identification of 1-5%. Subsequently reference entries have to be loaded into a database. This is also a difficult problem as the structure of the data is ill-defined and not always entirely consistent across all entries in the book. As well information for attribute demarcation is most often implied by the text's typographic formats and not by explicit symbols. The domain task is to convert two dictionaries from RTF format into XML format by the use of the Intelligent Editor.
This software currently satisfies a number of functional requirements:
1. The function of reading records of data and identifying mark-up codes (RTF) in the text. Further codes need to be able to be defined both interactively and in advance. They will vary from one piece of text to another.
2. Define the action to be taken on identifying a particular mark-up code.
3. Accept a specification for the structure of the codes and check that the text codes conform to that structure.
4. Ask for prompts from users to resolve ambiguous or unresolvable grammar structures.