Basser Seminar Series

Million Books to the Web - Technological Challenges and Research Issues

Professor N. Balakrishnan
Chairman, Division of Information Sciences
Indian Institute of Science, Bangalore, India

TUESDAY, 26. July 2005, 2-3pm (Note: different day)

Basser Conference Room (Madsen Building G92)

Abstract

The stunning growth in storage technology, had prompted Prof Raj Reddy of Carnegie Mellon University to envision that it would be possible to store in a digital form, in the very near future, all of the knowledge ever produced by the human race. As part of this grandiose vision and as a first step, in realizing this vision, it was proposed to create the Digital Library with a free-to-read, searchable collection of one million books, predominantly in Indian languages, available to everyone over the Internet. The Digital Library of India is expected to foster creativity and free access to all human knowledge. This portal is also planned to become an aggregator of all the knowledge and digital contents created by other digital library initiatives in India and in other partner countries such as China. This portal (http://dli.ernet.in and http://dli.iiit.ac.in) is slowly and steadily becoming a gateway to Indian Digital Libraries in science, arts, culture, music, movies, traditional medicine, palm leaves and many more. Currently more than 140,000 books (around 50 Million pages) have been scanned and most of them are available on the web for free browsing.

One of the goals of the Digital Library of India is to provide support for full text indexing and searching based on OCR (optical character recognition) technologies where available. The availability of online search allows users to locate relevant information quickly and reliably and possibly in a language independent way, thus enhancing student's success in their research endeavors. In order to achieve this mission, CMU and the Indian Institute of Science along with 21 partner institutions, first established technologies and processes for the selection of books, their scanning and cropping, OCRing and the storage architectures. Besides acting as a repository of Information, the Digital Library of India had also become one of the finest test-beds for Indian language processing research in areas such as machine translation, optical character recognition, summarization, speech and hand writing recognition, intelligent indexing, and information retrieval in Indian languages.

In this talk, we address the technological challenges and the research issues in the Digital Library of India- Million Books to the Web Project and also discuss the Indian Language Technology Research that was stimulated by the vast information base made available by the DLI project. In the first part of the talk, we present a brief over view of technological advances in processors, storage and connectivity and show the paradigm shift in computing from recall to recognition and establish the feasibility of storing all of the human race’s knowledge in digital form. We then present the technologies developed for scanning and taking the books to the web. The Indian Language Technology research addressing issues in the development of OCRs, Example based machine translation, Summarization and Search engines are then discussed. The talk concludes with the discussion on the issue of copy right and presents a novel idea of a “Consortium for Compensating for Crating Contents” the FourCs.