Data Warehousing and Data Mining (DWDM)

Academic Year:

2011/12, 1st semester

Lecturer:

Johann Gamper and Mouna Kacimi

Teaching assistant:

Matteusz Pawlik

Lectures:

TU 10:30-12:30, TH 10:30-12:30, Room D003

Exercises:

Office hours:

Gamper: TU and TH 09:00-10:00 or email arrangement

 

Pawlik: MO 13:00-14:00 or email arrangement


Home | Lectures | Projects | Exam


Objectives: Enable students to understand and implement classical algorithms in data mining and data warehousing. Students will learn how to analyze the data, identify the problems, and choose the relevant algorithms to apply. Then, they will be able to assess the strengths and weaknesses of the algorithms and analyze their behavior on real datasets.

Course Content: Data warehousing and data mining course covers the classical data mining (DM) and data warehousing algorithms (DW). The course is organized in two parts. The first part covers the DM part (the first six weeks), the second part covers the DW part (the second six weeks).

Mini-Projects: The DWDM course emphasizes the applications of the data warehousing and data mining algorithms. During the semester the students will work on a concrete mini-project. The students are welcome to choose any topic from the data warehousing and data mining course, or pick a mini-project from the list of mini projects proposals. Projects are worked on in small groups of two-four people. The mini-project is an integral part of the final exam (cf. the exam web page).

Textbooks:

Data Warehousing

·       M. Golfarelli, S. Rizzi. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, 2009. (recommended!)

·       R. Kimball, "The Data Warehouse Toolkit", 2nd edition.

·       W. H. Inmon, "Building the Data Warehouse", 3rd edition.

·       Selected papers

Data Mining

·       Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", Second Edition, 2006

·       Margaret H. Dunham, "Data Mining: Introductory and Advanced Topics", Prentice Hall, 2003, ISBN: 0-13-088892-3

·       Simon Haykin, "Neural Networks: A Comprehensive Foundation", Prentice Hall, 2005, ISBN: 0-13-147139-2

·       Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, "Introduction to Data Mining", Pearson Addison Wesley, 2005, ISBN: 0-32-132136-7

Lecture Notes: The lecture notes for this course will be updated as we progress through the semester. The lecture notes of the DM part can be found in the syllabus of the data mining web page, while the lecture notes of the DW part can be found in the syllabus of the data warehousing web page. The lecture notes will be updated as the course continues.

Syllabus

·       Data warehousing

·       SQL OLAP extensions

·       Multi-dimensional Join

·       Data warehouse performance

·       Data Analysis and Uncertainty

·       Classification and Prediction

·       Cluster Analysis

·       Association rules

Lectures

Data Warehousing Part

1.     TH, 29.09.11: Data warehousing: introduction, business intelligence, data integration, OLTP vs. OLAP, methodological framework, DW definition [slides]

2.     TU, 04.10.11: Data warehousing: multidimensional modeling, cubes, facts, dimensions, DW design [slides]

3.     TH, 06.10.11: Data warehousing: more about dimensions, star scheme, snowflake scheme, DW implementation, DW applications [previous lecture]

4.     TU, 11.10.11: Data warehousing: case studies [slides]

5.     TH, 13.10.11: SQL OLAP extensions: SQL query expression, crosstabs, group by extensions, rollup, cube, grouping sets [slides] [sql]

6.     TU, 18.10.11: SQL OLAP extensions: hierarchical cube, ranking, window functions [previous lecture]

7.     TH, 20.10.11: Generalized multi-dimensional join: GMDJ definition, evaluation algorithms [slides] [Akinde et al. 11] [Chatziantoniou et al. 01] [Akinde et al. 02] [sql]

8.     TH, 03.11.11: Generalized multi-dimensional join: subqueries, optimization rules, distributed evaluation [previous lecture]

9.     TU, 08.11.11: DW performance: pre-aggregation, lattice framework, view selection [slides] [Harinarayan et al. 96] [Wu and Buchmann 98]

10.  TH, 10.11.11: DW performance: view selection, view maintenance, bitmap indexing [previous lecture]

11.  TU, 15.11.11: ETL and advanced modeling: ETL process [slides]

12.  TH, 17.11.11: ETL and advanced modeling: changing dimensions, large-scale dimensional modeling [previous lecture]

Data Mining Part

13.  Lecture 1: Introduction (slides). 29/11/2011

14.  Lecture 2: Data (slides). 01/12/2011

15.  Lecture 3: Statistics (slides). 06/12/2011

16.  Lecture 4: Classification- DecisionTrees (slides). 13/12/2011

17.  Lecture 5: Classification- DecisionTrees + Bayesian Classifiers (slides). 15/12/2011

18.  Lecture 6: Classification- Rule-Based Classification + Lazy Learners (slides). 20/12/2011

19.  Lecture 7: Classification- Prediction + Evaluation (slides). 22/12/2011

20.  Lecture 8: Clustering- Partitioning Methods (slides). 10/01/2012

21.  Lecture 9: Clustering- Hierarchical Methods (slides). 12/01/2012

22.  Lecture 10: Clustering- High Dimensional Clustering + Outlier Mining (slides). 13/01/2012

23.  Lecture 11-12: Frequent Pattern Mining (slides). 17/01/2012 and 19/01/2012

 

Projects

Description

Working in groups you will design and implement an example data warehouse. Each group will choose a project domain and go through several stages of the design process. The project is divided into several milestones. Each milestone contains a set of tasks to be solved by each group. Each milestone has to be submitted in the form of a report.

Milestones

·       Milestone 1 - Data warehouse design start 03.10.2011 14:00 due 06.11.2011 23:59 Delivery consists of a report, where all Milestone 1 tasks are addressed, and an SQL script (and any additional files) for creating and populating your data warehouse. If the process is different than running a script, remember to include a README file with the detailed instructions.

·       Milestone 2 - Data warehouse querying start 07.11.2011 14:00 due 18.12.2011 23:59 Delivery consists of an extended report from Milestone 1 addressing all Milestone 2 tasks.

·       Milestone 3 - Data mining challenge start 12.12.2011 14:00 due 15.01.2012 23:59 Delivery consists of a four-pages (or more) report describing the obtained results.

Exercises

·       Exercises description

·       Milestone 1 exercises

·       Milestone 2 exercises

·       Milestone 3 exercises

Exam

The assessment of the course consists of two parts:

·       project part (60%)

·       theory part (40%)

The project work is assessed through a presentation, demo and a final report. The theory part is assessed with a written exam (multiple choice). Both parts must be positive to pass the course.

The written exam is an open book exam. You are allowed to use lecture notes and course books. The use of notebooks is not allowed!

A successful project is required to be admitted to the theoretical exam.

A successful project remains valid even if the student fails the theoretical exam.

If a student fails the project part, he has to do a new project for the next exam session. In this case, the teaching assistant does not guarantee support for supervising the students.

Here are some examples of exam questions Data Mining