Workflow and Revision Control System for Multi-media Multi-Publication Reference Materials - Beta Version

GENERAL TASK

The task of publishing complex document groups (multiple books requiring elaborate interlinking) taken from a variety of media (text, sound, figure, video) for publication in a variety of formats (e.g, book, web or executable CD-ROM) requires the fusing of two conceptual technologies, that is Workflow and Version Control. This project requires the development of an integrated workflow management system and version control system so that after updates to source materials new publications and run-time systems can be automatically generated.

SPECIFIC CASE STUDY

The general task described above can best be approached by studying a particular case to illuminate the details of the problems. In our case we have chosen the area of publication of reference materials for second language learning and translation. The processes of second language learning and translation are difficult and laborious and can be made significantly more comfortable with fast computer based aids. At the same time these reference materials are used by different people in different circumstances. While a professional translator might want an fast and sophisticated information system to ensure high productivity, a student may prefer to have the books to study in a more leisurely way. Hence we are most interested in publishing all the reference materials in this case both as books and as an integrated and sophisticated run-time system. Currently, for this project we have available three resources for learning the Basque language that can be usefully co-ordinated for language learning.

Project Aims

1. To develop a run-time system that integrates text and sound resources from various sources in a seamless way to support second language learning.

2. Develop a revision control system so that modifications in the documents are effectively managed and satisfy the needs of multiple modes of publication, such as book, web or CD_ROM.

3. Develop a workflow management system for automating the regeneration of the publication version or run-time system after documents have been updated.

Project Resources

The resources available to the project are a Basque-English and an English-Basque dictionary and, a Basque grammar book (written in English). As well, half of the examples of Basque and English sentences in the grammar book have been recorded by native speakers and will have to be linked into the software environment with seamless access available to the user. The final product should be delivered on a CD-ROM.

Project Components

The project needs to be separated into a variety of components. As this project is directed at producing a Beta version of the software a sophisticated solution for each component needs to be provided. The prototype systems will be available for scrutiny and reuse. The product is expected to be completed to a professional standard. Broadly, the components and issues can be stated as:

System Management Issues

1. Data preparation. This stage requires preparing the data for the final storage organisation.

2. Update Strategy: Design update procedures for each of the documents and construct computer controlled processes where possible.

3. System Regeneration Issues. Design the procedures and processes to automatically regenerate each publication version of the system, that is book, web pages, CD_ROM.

Run-time System Issues

1. Storage organisation: Design a storage organisation for each of the different types of data.

2. Design indexes: Devise an cross-indexing strategy for each of the five data sources.

3 Retrieval Strategies: create retrieval functions given the storage structures and indexing strategies.

4. Design display interface: The interface should provide for exploitation of the technology to give extensive services to the user.

5. Operating systems issues {NT}: Most of the technical decisions will be dictated by the operating system for implementation.

6. Create CDROM: The final system should be free standing and run automatically after a single load from the CDROM.

Detailed Description

RESOURCE MATERIALS

The resource materials consist of 3 books, and 2 sound files.

Book1. This is a Basque-English dictionary. This book exists in electronic form as a word processing file (WORD).

Book2. This is an English-Basque dictionary. This book exists in electronic form as a word processing file (WORD).

Book3. This is a grammar book written in English. This book exists in electronic form, as a word processing file in Framemaker. It is a descriptive grammar of the Basque language. It is divided into chapters sections and subsections and Appendices. Throughout the book there are an extensive number of examples of basque sentences with their english equivalents forming a pair of sentences that have to indexed as one object.

Soundfile 1 is a sound file of each of the Basque sentences referred in Book 3. 50% of the sentences are recorded and the remainder need to be recorded. Arrangements will be made to get a native speaker to do this.

Soundfile 2 is a sound file of each of the English sentences referred in Book 3. 50% of the sentences are recorded and the remainder need to be recorded. Your group will have to do the remainder.

INTERFACES

The interface has to treat each of the 5 resources as independent windows with synchronised semantic displays.

Book 3 will act as a type of master display that the other windows will have to keep in time with. Book3 will allow the user to open any page of the book and read the contents using typical text file scrolling functions (horizontal & vertical scrolling). The user should be able to select a text-example in either basque or english. If they select a whole sentence then the sound file record for that sentence should be played, if multiple examples are selected then the sound records should be played from first to last. If they select a word then the dictionary entry for the word should be displayed along with the surrounding entries, that is, the display should have the appearance of a book.

Issues for the interfaces

1. The form of the display of the books 1, 2 & 3 has to be considered so that it retains a display appearance that is at least similar to the typographical and display layout of the paper versions of the books.

2. The display of the sound records needs to show continuous progress thru the length of the record such as a moving scroll bar, percentage count, vibrating spectral analysis graph, etc.

3. Selection of any word in either of the dictionaries implies it's entry from the other dictionary is wanted.

4. A user has to have a window pane for typing in a word of interest. There should be a window pane on each book for this function.

5. There is a conceptual difference in the nature of a query on the grammar book vis-a-vis the dictionaries. The grammar book has an Index and a Table of Contents. These are the topics it covers. A user may be interested in the topic or examples of word usage. They require different processing activities and both need to be implemented.

6. Selection of words or lines from the dictionary cannot result in playing of sound as all sound is prerecorded. Sound then is a pane of the grammar book window.

INTERNAL STRUCTURAL ELEMENTS

Storage structures

Storage structures need to be developed for the three types of data, that is the dictionaries, the grammar book and the sound files. Each is indexed for a different type of object. The dictionaries have the headword of each entry as the primary key, along with the part-of-speech (POS) of the word as the same word can have multiple POS. The grammar book will have two types of indices. The book is divided into sections running down 4 hierarchical levels which constitute one type of index. The second index will be on the text examples, and in turn the basque and english versions of each example need an index. The sound files need to be indexed according to the text example numbering used in the grammar book. Suitable storage structures, indexing mechanisms and retrieval methods need to be chosen.

It is important to determine the exact storage image of each of the books as they have to be displayed through the interface in some aesthetically acceptable style and the computation for conversion to that style has to be performed on the fly. For example, the dictionaries may be stored in some form (such as a database) which has to be passed through a HTML converter which in turn is delivered to HTML display software.

Preprocessing

There are requirements to perform some intelligent processing on the text-examples from the grammar book. Firstly both the english and the basque sentences need to be tagged for their part-of-speech and grammar features. This can be done by external programs that will not reside in the final system. Hence the system will need to retain stored versions of the tagged sentences, independently of their version embedded in the grammar book text. This arrangement requires the book text to point not only to sound files but also to the part-of-speech tagged text which in turn points into the dictionaries.

The Basque tagging will be provided by our collaborators in the Basque Country. We will use one of the English taggers to do the English examples. Note that the tagging scheme used by electronic taggers (both English and Basque) may use POS terminology that is not the same as in the dictionaries. Heuristics for conversion between the two systems will have to be developed.

Moving between documents

Users may want to select a word in any of the three books to retrieve from another book. There needs to be a standard method of passing a word to a receiving module that directs a request to the correct book. The dictionaries on receiving a word need to retrieve it from their database. The grammar book will have to direct it to an appropriate grammar section (this is a hard problem).

Some functional features that may be added to the prototype system.

1. Any highlighting activity can use colour variation or grey scale gradation.

2. Users may wish to have bookmarks to important pages in the grammar book and/or important words in the dictionary. As these will be dynamic and it is important to make them easy to create and delete.

3. A thumbindex on the side of the dictionary and the grammar book would be useful.

4. Ability to search from the Table of Contents and the Keyword Index of the grammar book.

5. Ability to enter a keyword for retrieval from any book.

6. Ability for the user to add information to books by way of annotation/margin comments.

Revision Control and Workflow Management Systems

Project Emphasis

One of the important facets of this project is to understand the difference between a solution that solves the static problem and one that solves the dynamic problem. The static solution merely gets a system up and running without consideration for the long-term life of the system. The dynamic problem is the issue of ensuring the finished product can have a long life where it has to be improved and adapted to changing needs. It is expected that the product will have an indefinite life. Hence the final system has to be designed in terms of revisablity and maintainability. Revisions constitute major changes where things are either reconceptualised or significantly extended in functionality. Maintenance refers to tasks that are directed to maintaining the current functionality in the context of updates and additions to the current data contents. To maximize the efficiency of these operations it is necessary to develop strategies for revision control and for regeneration of the system you need a good Workflow Management strategy.

It is important to realise that there are 3 different structural elements of the system that require maintenance and revisability, the first is the text and sound in all documents, the second is the code that is necessary for the run-time system, and the third is the code that controls workflow management.

Revisions

Reconceptualisations represent a change in the broader purpose of the system. In this particular case the initial concept was to provide browsing and fast retrieval for 3 separate but interlinked data sources(2 dictionaries and 1 grammar book) and 2 dependent data sources(2 sound tracks) with the aim of supporting the learning of basque by an english speaker. A reconceptualisation could be to develop the system for such things as:

a. a translation assistance tool,

b. language tutoring,

c. reference source for key basque literature,

d. reference collection of dictionaries and grammar sources.

To support reconceptualisation the system must be designed so that, minimally its data sources are readily available for use with a variety of software tools (note: this means you should use standard data storage techniques and data description notations), and the code must be written in a manner that does not exclude its reuse in as yet unforeseen software environments (note: this means you should write industry standard code).

Maintenance

Maintenance in this context refers to the process of maintaining a single source document either for corrections or for extensions or rewrites of its sections. In the examples of the dictionaries minor corrections are likely to be common with spelling, descriptive content, and layout formats and organisation. The grammar book is certainly likely to be revised where translations of examples are reinterpreted and as new material becomes available for inclusion. In these circumstances the dependent material (the 2 sound databases) plus all indexing mechanisms will need to be adjusted. The problems in this work include issues such as

a. cross-indexing between databases,

b. parsing of data examples for bother english and basque,

c. updating the various databases,

d. updating text files,

e. automatic generation of working systems with minimal user intervention.

Project Requirements

The design and implementation has to achieve all of these requirements, the project report needs to show consideration for the needs of Revision and Maintenance and the solution has to satisfy these needs. The prototypes which were produced in 1998 and 1999 will be available for analysis and reuse. The project report must discuss the full ramifications of these issues.

RESEARCH QUESTIONS

These are issues of important concern to making this system highly sophisticated. They represent questions at the cutting edge of research.

  • a. How will books of straight text be added to the system and what indexing will be provided?
  • b. What dynamic language processing could be useful?
  • c. Addition of new texts - For basque, morphological analysis is a problem so it may be sensible to take a text and analyse each word against a list of dictionary entries and then only permit indexing into those words that match the dictionary and flag other words as unavailable.

d. How are computing functions for the run-time system introduced in an automated way.