Basser Seminar Series

Scalable Data Processing for High-Throughput Genomics

Speaker: Dr Uwe Roehm
School of Information Technologies, University of Sydney

Time: Friday 22 October 2010, 4:00-5:00pm
Refreshments will be available from 3:30pm

Location: The University of Sydney, School of IT Building, Lecture Theatre (Room 123), Level 1

Add seminar to my diary


While information technology has become mainstream for businesses already for a long time, the Scientific community is now facing a similar revolution in that automatic scientific instruments allow the automated screening of large sample sets for unprecedented low costs.

For example, today's DNA sequencing technology allows to sequence an individual genome within a few weeks for a fraction of the costs of the original Human Genome project (an estimated $3 billion over 10 years).

The ultimate goal is the personal genome within a few hours and for $1000, which would revolutionise modern health care and research areas such as cancer and HIV research. This also means that Genomics labs are facing several terrabytes of data per week that have to be automatically processed and made available to the research comment. This talk explores the potential and the limitations of using modern database systems as data processing platform for high-throughput genomics. In particular, we are interested in the storage management for high-throughput DNA sequence data and in leveraging declarative SQL queries and user-defined functions for DNA data analysis tasks inside a database system. We report about a feasibility study around the 1000 genome project using advanced features of SQL Server 2008 (such as Filestreams and CLR-based user functions & aggregates), as well as our ongoing work with CSIRO on building k-mer indexes and using SQL to efficiently cross-compare bacteria genomes.

Speaker's biography

Uwe Roehm is senior lecturer in the School of Information Technologies at the University of Sydney. He is a computer science graduate from the University of Passau, Germany, and received his Ph.D. degree for his work on OLAP with database clusters in 2002 from ETH Zurich, Switzerland. His research interests are scientific data management, quality-of-service guarantees for web information systems, scalable data replication, and data processing in cloud computing.