Data Warehousing and Data Mining (DWDM)
Academic Year: |
2012/13, 1st semester |
Lecturer: |
|
Teaching assistant: |
|
Lectures: |
TU 10:30-12:30, WE 08:30-10:30, Room E411 |
Exercises: |
|
|
Office hours: |
Gamper: WE 10:30-12:30 or email arrangement |
|
Pawlik: MO and WE 13:00-14:00 or email arrangement |
Objectives: Students will be enabled to understand and implement classical models and algorithms in data warehousing and data mining. They will learn how to analyze the data, identify the problems, and choose the relevant models and algorithms to apply. They will further be able to assess the strengths and weaknesses of various methods and algorithms and to analyze their behavior.
Syllabus
- Data warehousing
- SQL OLAP extensions
- Multi-dimensional Join
- Data warehouse performance
- Data Analysis and Uncertainty
- Classification and Prediction
- Cluster Analysis
- Association rules
Organization
The course organization is divided in two parts that are thaught in parallel: a data warehousing part and a data mining part. The exercises consist in doing a project alone or in groups of 2-3 students (more details below).Textbooks
Data Warehousing
- M. Golfarelli, S. Rizzi. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, 2009. (recommended!)
- R. Kimball, "The Data Warehouse Toolkit", 2nd edition.
- W. H. Inmon, "Building the Data Warehouse", 3rd edition.
- Selected papers
Data Mining
- Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", Second Edition, 2006
- Margaret H. Dunham, "Data Mining: Introductory and Advanced Topics", Prentice Hall, 2003, ISBN: 0-13-088892-3
- Simon Haykin, "Neural Networks: A Comprehensive Foundation", Prentice Hall, 2005, ISBN: 0-13-147139-2
- Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, "Introduction to Data Mining", Pearson Addison Wesley, 2005, ISBN: 0-32-132136-7
Lectures and Lecture Notes
The lecture notes for this course will be updated as we progress through the semester. The lecture notes of the DM part can be found in the syllabus of the data mining web page, while the lecture notes of the DW part can be found in the syllabus of the data warehousing web page.
Data Warehousing Part
| 1. | WE, 03.10.2012 | Data warehousing: introduction, business intelligence, data integration, OLTP vs. OLAP, methodological framework, DW definition [slides] |
| 2. | WE, 10.10.2012 | Data warehousing: multidimensional modeling, cubes, facts, dimensions, DW design [slides] |
| 3. | WE, 17.10.2012 | Data warehousing: more about dimensions, star scheme, snowflake scheme, DW implementation, DW applications [previous lecture] |
| 4. | WE, 24.10.2012 | Data warehousing: case studies [slides] |
| 5. | WE, 31.10.1012 | SQL OLAP extensions: SQL query expression, crosstabs, group by extensions, rollup, cube, grouping sets [slides] [sql] |
| 6. | WE, 07.11.2012 | SQL OLAP extensions: analytic/window functions, ranking, moving window aggregates, densification [slides] |
| 7. | WE, 14.11.2012 | Generalized multi-dimensional join: GMDJ definition, evaluation algorithms [slides] [Akinde et al. 11] [Chatziantoniou et al. 01] [Akinde et al. 02] [sql] |
| 8. | WE, 21.11.2012 | Generalized multi-dimensional join: subqueries, optimization rules, reducing range to point queries, late initialization of result table, distributed evaluation [slides] |
| 9. | WE, 28.11.2012 | DW performance: pre-aggregation, lattice framework, view selection [slides] [Harinarayan et al. 96] [Wu and Buchmann 98] |
| 10. | WE, 05.12.2012 | DW performance: view selection, view maintenance, bitmap indexing [previous lecture] |
| 11. | WE, 12.12.2012 | Extract-Transform-Load: ETL process, building dimensions and fact tables, extract, transform, load. [slides] |
| 12. | WE, 19.12.2012 | Advanced modeling: changing dimensions, large-scale dimensional modeling, project management. [slides] |
Data Mining Part
| 1. | Tuesday, 09.10.2012 | Data Mining: Introduction [slides] |
| 2. | Tuesday, 16.10.2012 | Data Mining: Getting to know your data [slides] |
| 3. | Tuesday, 23.10.2012 | Data Mining: Statistics [slides] |
| 4-5. | Tuesday, 06.11.2012 and 13.11.2012 | Data Mining: Pattern Mining [slides] |
| 6. | Tuesday, 20.11.2012 | Data Mining: Clustering: Partitioning Methods[slides] |
| 7. | Tuesday, 27.11.2012 | Data Mining: Clustering: Hierarchical Methods [slides] |
| 8. | Tuesday, 04.12.2012 | Data Mining: Density-based Methods and High Dimensional Clustering [slides] |
| 9. | Tuesday, 11.12.2012 | Data Mining: Classification: Decision Trees [slides] | 10. | Tuesday, 08.01.2013 | Data Mining: Classification: Bayes Classifier [slides] | 11-12. | Tuesday, 15.01.2013 | Data Mining: Classification: Rule-based Classification, Lazy Learners, Prediction, Evaluation (to be updated next week) [slides] |
Projects
Description: During the semester, students do a project that is divided in two modules. Each module lasts for six weeks and can be done either in the area of DW or DM. The following options exist:
- 2 modules in DW;
- 2 modules in DM;
- 1 module in DW and 1 module in DM.
The project can be done alone or in groups of 2-3 students.
More details will be explained during in first exercise on Tuesday, October 9, 2012.
Deadline for the decision about the project and the groups: October 19 (send an email to both Mouna Kacimi and Matteusz Pawlik)
Data Warehousing Part
IntroductionModule 1
[NEW] The deadline for Task 6 has been extended till Friday, 30.11.2012, 23:59.
Module 1 Task 1 Module 1 Task 2 Module 1 Task 3 Module 1 Task 4 Module 1 Task 5 Module 1 Task 6Module 2
Project requirements and guidelinesData MiningPart
Part1: Task1Organization: you can work alone or team up with other students (2-students groups are preferred)
Deadline for the deliverable: November 2, 2012 at 23:59
Part1: Task2Additional references: Apriori Algorithm, FP-growth Algorithm
Organization: you can work alone or team up with other students (2-students groups are preferred)
Deadline for the deliverable: November 26, 2012 at 23:59
Part2: Task1 & Task2Additional references: KMEANS, DBScan, Birch
Organization: you can work alone or team up with other students (2-students groups are preferred)
Deadline for the first task: December 21, 2012 at 23:59
Deadline for the first task: January 15, 2013 at 23:59
Exam
The assessment of the course consists of two parts:
- project part (40%): assessed through a presentation, demo and a final report;
- theory part (60%): assessed with a written exam (multiple choice).
Both parts must be positive to pass the exam. A successful project is required to be admitted to the theoretical exam.
A successful project remains valid even if the student fails the theoretical exam. If a student fails the project part, he has to do a new project for the next exam session. In this case, the teaching assistant does not guarantee support for supervising the students.
The written exam is a closed book exam. The only resources allowed to use are blank paper, pens, and your head.
Here is an example of an exam [pdf].
Here is the correction for the DM part questions Data Mining