Tutorial Summaries and Slides
|Database Tuning (Philippe Bonnet, University of Copenhagen, Denmark)|
|Database tuning is the activity of making a database application run faster (i.e., with higher throughput or lower response time). To make a system run faster, the database tuner may have to change the way applications are constructed, the data structures and parameters of a database system, the configuration of the operating system, or the hardware. The best database tuners therefore can solve problems requiring broad knowledge of an application and of computer systems. The tutorial has three objectives: (1) review database internals and their impact on performance, (2) describe and illustrate the recurring principles underlying database tuning, and (3) present a principled approach to database tuning problems.|
|Biological Data Management (H.V. Jagadish, University of Michigan, USA)|
|Biological and medical data is diverse, complex and challenging. It makes for a compelling driver for innovation in database research. In this segment, we will introduce the fundamentals of biology and bioinformatics, consider a case study in the integration and management of biological data to understand the challenges that arise, and finally discuss three specific challenges in depth: provenance, nomenclature/ontologies, and usability.|
|Database Performance (Patrick and Betty O'Neil, University of Massachusetts, USA)|
We have recently completed an effort to measure data warehousing
performance on three major commercial database systems, using a new
star schema benchmark we designed. Data warehousing is the major
application under development in companies today, and unlike OLTP
which has been a solved problem for many years, the data requirements
of data warehousing ensure that disk I/O will be the main performance
problem to be dealt with.
Indexing has been the classic solution to speeding up disk access, allowing indexed lookup to replace sequential access to all the rows. But measurements we have performed with modern processors show that, while indexing is still important, it has lost much of its edge compared to sequential access since the early days of DB2, when "filter factor" calculations of a group of index predicates in a query WHERE clause were used to determine indexed access speedup. We explain why the importance of data clustering by indexed has become so much more crucial, and why new indexing capabilities such as DB2's Multi-Dimensional Clustering (MDC) provides such an important clustering approach. However, there are a number of rather subtle issues to be solved in properly using MDC (or our own variant, Multi Column Clustering (MCC), which can be used with other products as well), and we provide a motivated and detailed explanation of these issues in our tutorial. From comments we've had from warehouse practitioners, our explanation of these issues seem to add value to current practice.
|Data Integration in Bioinformatics and Life Sciences (Erhard Rahm, University of Leipzig, Germany)|
New advances in life sciences, e.g. molecular biology, biodiversity, drug
discovery and medical research, increasingly depend on the management and
analysis of vast amounts of highly diverse data. Relevant data is
distributed across many sources on the web with high degree of semantic
heterogeneity and different levels of quality. Such data often needs to be
integrated with application-specific experimental data and clinical data to
support new scientific discoveries. The tutorial will provide an overview
of major integration approaches and systems to deal with these integration
problems for life science, and discuss selected research problems. Covered
|Stream Data Management (Divesh Srivastava, AT&T Labs-Research, USA)|
|Measuring and monitoring complex dynamic phenomena, such as IP network traffic, produces highly detailed and voluminous data streams. The applications that need to analyze these massive data streams require sophisticated query capabilities. This tutorial provides a practitioner's perspective on data streams, illustrating the desired functionality and scalability issues using the Gigascope data stream management system.|
|Techniques for Managing Probabilistic Data (Dan Suciu, University of Washington, USA)|
Many applications today need to manage large volumes of uncertain
data, and represent the degree of uncertainty as probabilities.
Examples of such applications are fuzzy object matching, uncertain
schema mappings, exploratory queries in databases, RFID and sensor
data. This tutorial will explore some fundamental techniques for
managing probabilistic data. We will study the following:
|Text, XML, and Multimedia Information Retrieval (Arjen de Vries, Centre of Mathematics and Computer Science, The Netherlands)|
|The tutorial starts to discuss several differences between information retrieval and database research. After a quick overview of information retrieval basics, we focus on the language modelling approach to information retrieval and its application to information retrieval over textual, XML and multimedia information sources. The tutorial concludes with a discussion of issues and challenges for the integration of information retrieval and databases.|
|Web, Semantic, and Social Information Retrieval (Gerhard Weikum, Max Plank Institute, Germany)|
On one hand, Web search is a mature technology that can index tens of
billions of Web pages and provides fast and effective Internet search
for the daily information needs of many millions of users. On the
other hand, the continuing rapid growth of digital information on the
Internet, and also in enterprises and digital libraries, poses
tremendous challenges regarding scalability and information quality.
This entails technical issues like index partitioning, caching, and
top-k query processing on the scalability side, and advanced link
analysis, combatting Web spam, and mining query and click logs for
personalization on the quality side. The first part of the tutorial
discusses recent approaches that aim to address these challenges.
In addition to these ongoing efforts to further scale up today's keyword-oriented Web search functionality and maintain its high quality, there are also major trends to advance the functionality itself to a more expressive semantic level. Faceted Information Retrieval organizes search results along dimension hierarchies based on metadata or automatic topic detection. Digital libraries, e-science archives, and Deep-Web portals provide steadily growing structured datasets, along with Semantic-Web-style ontologies and other kinds of knowledge sources. Information extraction technology has become more efficient and robust so as to enable large-scale extraction of entities and relationships from natural-language text sources; search and ranking could then be performed in terms of entities rather than pages. Finally, social networks provide means for tagging and organizing Web pages, photos, and videos in a way that resembles structured, albeit schemaless, data. The second part of the tutorial discusses these issues, presents selected approaches towards such richer search experiences, and points out research opportunities and challenges.