EDBT Summer School 2007

Tutorial Summaries and Slides

Database Tuning (Philippe Bonnet, University of Copenhagen, Denmark)

Database tuning is the activity of making a database application run faster (i.e., with higher throughput or lower response time). To make a system run faster, the database tuner may have to change the way applications are constructed, the data structures and parameters of a database system, the configuration of the operating system, or the hardware. The best database tuners therefore can solve problems requiring broad knowledge of an application and of computer systems. The tutorial has three objectives: (1) review database internals and their impact on performance, (2) describe and illustrate the recurring principles underlying database tuning, and (3) present a principled approach to database tuning problems.

Tutorial slides

Biological Data Management (H.V. Jagadish, University of Michigan, USA)

Biological and medical data is diverse, complex and challenging. It makes for a compelling driver for innovation in database research. In this segment, we will introduce the fundamentals of biology and bioinformatics, consider a case study in the integration and management of biological data to understand the challenges that arise, and finally discuss three specific challenges in depth: provenance, nomenclature/ontologies, and usability.

Tutorial slides: part 1, part 2

Database Performance (Patrick and Betty O'Neil, University of Massachusetts, USA)

We have recently completed an effort to measure data warehousing performance on three major commercial database systems, using a new star schema benchmark we designed. Data warehousing is the major application under development in companies today, and unlike OLTP which has been a solved problem for many years, the data requirements of data warehousing ensure that disk I/O will be the main performance problem to be dealt with.

Indexing has been the classic solution to speeding up disk access, allowing indexed lookup to replace sequential access to all the rows. But measurements we have performed with modern processors show that, while indexing is still important, it has lost much of its edge compared to sequential access since the early days of DB2, when "filter factor" calculations of a group of index predicates in a query WHERE clause were used to determine indexed access speedup. We explain why the importance of data clustering by indexed has become so much more crucial, and why new indexing capabilities such as DB2's Multi-Dimensional Clustering (MDC) provides such an important clustering approach. However, there are a number of rather subtle issues to be solved in properly using MDC (or our own variant, Multi Column Clustering (MCC), which can be used with other products as well), and we provide a motivated and detailed explanation of these issues in our tutorial. From comments we've had from warehouse practitioners, our explanation of these issues seem to add value to current practice.

Tutorial slides

Data Integration in Bioinformatics and Life Sciences (Erhard Rahm, University of Leipzig, Germany)

New advances in life sciences, e.g. molecular biology, biodiversity, drug discovery and medical research, increasingly depend on the management and analysis of vast amounts of highly diverse data. Relevant data is distributed across many sources on the web with high degree of semantic heterogeneity and different levels of quality. Such data often needs to be integrated with application-specific experimental data and clinical data to support new scientific discoveries. The tutorial will provide an overview of major integration approaches and systems to deal with these integration problems for life science, and discuss selected research problems. Covered topics include:

Alternatives for data integration for life science applications
Warehouse-based integration approaches
Virtual integration approaches (including peer data management)
Data quality
Matching of life science ontologies

Tutorial slides

Stream Data Management (Divesh Srivastava, AT&T Labs-Research, USA)

Measuring and monitoring complex dynamic phenomena, such as IP network traffic, produces highly detailed and voluminous data streams. The applications that need to analyze these massive data streams require sophisticated query capabilities. This tutorial provides a practitioner's perspective on data streams, illustrating the desired functionality and scalability issues using the Gigascope data stream management system.

Tutorial slides

Techniques for Managing Probabilistic Data (Dan Suciu, University of Washington, USA)

Many applications today need to manage large volumes of uncertain data, and represent the degree of uncertainty as probabilities. Examples of such applications are fuzzy object matching, uncertain schema mappings, exploratory queries in databases, RFID and sensor data. This tutorial will explore some fundamental techniques for managing probabilistic data. We will study the following:

The probabilistic data model based on possible worlds, and disjoint/independent databases.
The evaluation problem for conjunctive queries: we will see that query evaluation is hard in general, that it can be done efficiently (i.e. in PTIME) for some queries, and will study the latter queries in detail.
We will discuss some pragmatics of building a query processor over a probabilistic database, and how to optimize the query processor on "hard" queries, to compute efficiently the top-k answers.
Finally, we will discuss several advanced topics: probabilistic views, and probabilistic databases modeled as random graphs.

While the tutorial emphasizes mostly the formal aspects of managing probabilistic data, it will also be accessible to practitioners. No prior background in probability theory or statistics is needed.

Tutorial slides

Text, XML, and Multimedia Information Retrieval (Arjen de Vries, Centre of Mathematics and Computer Science, The Netherlands)

The tutorial starts to discuss several differences between information retrieval and database research. After a quick overview of information retrieval basics, we focus on the language modelling approach to information retrieval and its application to information retrieval over textual, XML and multimedia information sources. The tutorial concludes with a discussion of issues and challenges for the integration of information retrieval and databases.

Tutorial slides

Web, Semantic, and Social Information Retrieval (Gerhard Weikum, Max Plank Institute, Germany)

On one hand, Web search is a mature technology that can index tens of billions of Web pages and provides fast and effective Internet search for the daily information needs of many millions of users. On the other hand, the continuing rapid growth of digital information on the Internet, and also in enterprises and digital libraries, poses tremendous challenges regarding scalability and information quality. This entails technical issues like index partitioning, caching, and top-k query processing on the scalability side, and advanced link analysis, combatting Web spam, and mining query and click logs for personalization on the quality side. The first part of the tutorial discusses recent approaches that aim to address these challenges.

In addition to these ongoing efforts to further scale up today's keyword-oriented Web search functionality and maintain its high quality, there are also major trends to advance the functionality itself to a more expressive semantic level. Faceted Information Retrieval organizes search results along dimension hierarchies based on metadata or automatic topic detection. Digital libraries, e-science archives, and Deep-Web portals provide steadily growing structured datasets, along with Semantic-Web-style ontologies and other kinds of knowledge sources. Information extraction technology has become more efficient and robust so as to enable large-scale extraction of entities and relationships from natural-language text sources; search and ranking could then be performed in terms of entities rather than pages. Finally, social networks provide means for tagging and organizing Web pages, photos, and videos in a way that resembles structured, albeit schemaless, data. The second part of the tutorial discusses these issues, presents selected approaches towards such richer search experiences, and points out research opportunities and challenges.

Tutorial slides: part 1, part 2

8th EDBT Summer School

Database Technologies for Novel Applications

September 3-7, 2007 ♦ Bozen-Bolzano, Italy

Tutorial Summaries and Slides