Theme: Natural Language Processing

Intelligent Self-Learning Parser-editor for Data Stream Analysis (Ferret III)

Task: The problem of inferring structure in a data stream of unknown content requires significant machine learning strategies. However if some limited information is known about the structure then it can be used to prime the search algorithm and so reduce its search space. The totally general inference problem is too large to be tractable for data of any significant commercial volume. However priming offers a real potential for producing worthwhile results for large data sizes. There are many data streams that require analysis in the commercial world, such as, undocumented program code, web documents, stock exchange prices, etc.

We have built a prototype system that performs basic inference functions and has been adapted to analyse text steams in documents. It expersses the strucutres that are primed by the user as XML codes in an output file. However it needs significant development to bring it to a completed releasable product.

Project Aim:

This project aims to convert the parser-editor from its basic form to that of releasable product. The software requires expansion to provide for the specification of XML markup codes by the user. The output of XML compliant dta files, the processing of partially marked up files, the inference of structures in the data stream not identified by the priming process.

Problem Context:

The conversion of reference books (like encyclopaedia and dictionaries) in paper form into electronic formats is a difficult task. Initially they have to be read using a scanner and then converted into text using OCR software. This process typically produces an error rate of incorrect character identification of 1-5%. Subsequently reference entries have to be loaded into a database. This is also a difficult problem as the structure of the data is ill-defined and not always entirely consistent across all entries in the book. As well information for attribute demarcation is most often implied by the text's typographic formats and not by explicit symbols. The domain task is to convert two dictionaries from RTF format into XML format by the use of the Intelligent Editor.

Project Resources:

Two bilingual dictionaries in rtf/word formats have to be analysed.

Current Functionality

This software currently satisfies a number of functional requirements:

1. The function of reading records of data and identifying mark-up codes (RTF) in the text. Further codes need to be able to be defined both interactively and in advance. They will vary from one piece of text to another.

2. Define the action to be taken on identifying a particular mark-up code.

3. Accept a specification for the structure of the codes and check that the text codes conform to that structure.

4. Ask for prompts from users to resolve ambiguous or unresolvable grammar structures.

5. Design an appropriate user interface.

Enhanced Functionality

1. Ablity to assign a full range of XML features to any part of the text.

2. Ability to write text files with fully compliant XML

3. Inference of structure in the automata modelling the structure of the priming rules and appropriate adjustment of the XML tags.

4. Ability to infer the XML DTD for the document

Deliverables

1. An executable system on CD-ROM including all reports, code, databases and the run-time system.

2. The fully XML encoded dictionaries.