Approximation: Theory and Algorithms (ATA)

Lecturer: Nikolaus Augsten
Teaching assistant: Nikolaus Augsten
Teaching language:English
Prerequisites: Basic concepts in databases and XML.
Office hours: Friday 9:00–10:30 (Feb 27 – May 29, unless class is canceled)
Academic Year 2008 / 2009, 2nd Semester


Home | General | Lecture Notes | Mini-Project | Milestones | Project Proposals | Exam |

General

NEW! Course evaluation by students 2006/2007 [PDF], 2007/2008 [PDF], and 2008/2009 [PDF]

Start date: February 27, 2009
Lectures: Friday 10:30-12:30, room E411
Labs: Friday 14:00-15:00, room E331
Exception:

Prerequisites: Students should be familiar with basic concepts in databases (including relational databases, SQL, and relational algebra) and algorithms, as well as having good programming skills in Java. Further basic knowledge about XML is required. This material is taught in the following courses:

Course content: This course will discuss approximate matching techniques for flat strings and hierarchical data (for example, XML). Selected methods will be presented, their effectiveness and efficiency will be discussed. Filtering techniques to improve the efficiency will be introduced. The students will implement approximate matching techniques in a relational database management system.

Syllabus

Lecture notes: The lecture notes for this course will be published as we progress through the semester. There is a set of lecture notes for each week.

Project/exercise: The exercise part of the course consists in the elaboration of a mini-project, which is done in groups of 2 to 3 students.

References: There is no single textbook that covers the entire course. The course material is collected from the following book chapters and research papers:

See also: Course Presentation Form

Lecture Notes

The lecture notes for this course will be created as we progress through the semester. There is a set of lecture notes for each week.
Date Topics Slides   Reading

1. Fr 2009-02-27, 10:30-12:30    General Introduction: Approximate Matching, Course Organization, Introductory Example and Demo [1up] [4up] -
2. Fr 2009-03-06, 10:30-12:30    Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants [1up] [4up] [Nav01]
3. Fr 2009-03-13, 10:30-12:30    q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments [1up] [4up] [Gra01]
4. Fr 2009-03-20, 10:30-12:30    Trees and Relational Databases I: Tree Definition, Adjacency List Encoding [1up] [4up] [Cel04]
5. Mo 2009-03-30, 10:30-12:30    Trees and Relational Databases II: Dewey Encoding, Interval Encoding, XML as a Tree [1up] [4up] [Tat02]
6. Fr 2009-04-03, 10:30-12:30    Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas (I) [1up] [4up] [Apo97]
[Zha89]
7. Fr 2009-04-17, 10:30-12:30    Tree Edit Distance: Deriving the Recursive Formulas (II), Dynamic Programming Algorithm [1up] [4up] [Apo97]
[Zha89]
8. Fr 2009-04-24, 10:30-12:30    Tree Edit Distance: Key Roots, Left-Most Leaf Descendant, Example [1up] [4up] [Zha89]
9. Fr 2009-05-08, 10:30-12:30    Tree Edit Distance: Complexity Pruning: Lower Bound (Traversal Strings), Upper Bound (Constrained Edit Distance) [1up] [4up] [Zha89]
[Guh02]
10. Fr 2009-05-15, 10:30-12:30    Reference Sets: Pruning with Reference Sets [1up] [4up] [Guh02]
11. Fr 2009-05-22, 10:30-12:30    Binary Branch Distance: Lower Bound Proof and Complexity; pq-Gram Distance: Algorithm and Lower Bound Theorem. [1up] [4up] [Yan05]
[Aug05]
12. Fr 2009-05-29, 10:30-12:30    Windowed pq-Grams: Approximate Joins for Unordered Trees (Data-Centric XML) [1up] [4up]
[animation]
[Aug08]

Reading

With the lecture notes for each week also the main references to literature are given. You do not need to study all these papers, but they may be useful to delve deeper into some of the topics.

[Nav01]
Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.  [PDF]
[Gra01]
Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Proc. of VLDB, pages 491–500, Roma, Italy, September 2001. Morgan Kaufmann Publishers Inc. [PDF]
[Cel04] (Adjacency List and Interval Encoding)
Joe Celko. Trees and Hierarchies in SQL for Smarties. Morgan Kaufmann, 2004. [library]
[Tat02] (Dewey Encoding)
Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang. Storing and Querying Ordered XML Using a Relational Database System. In Proc. of SIGMOD 2002, pages 204–215. Madison, Wisconsin, June 2002. Press. [PDF]
[Apo97] (Tree Edit Distance)
Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
[Zha89] (Tree Edit Distance)
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. [SIAM Journals Online]
[Guh02] (Tree Edit Distance: Upper and Lower Bound, Reference Sets)
Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. Approximate XML joins. In Proc. of SIGMOD, pages 287–298,Madison, Wisconsin, 2002. ACM Press. [PDF]
[Yan05] (Binary Branch Distance)
Rui Yang, Panos Kalnis, and Anthony K. H. Tung. Similarity evaluation on tree-structured data. In Proc. of SIGMOD, pages 754–765, Baltimore, Maryland, USA, June 2005. ACM Press. [PDF]
[Aug05] (pq-Gram Distance)
Nikolaus Augsten, Michael Böhlen, and Johann Gamper. Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 301–312, Trondheim, Norway, September 2005. Morgan Kaufmann Publishers Inc. [PDF]
[Aug08] (pq-Gram Distance)
Nikolaus Augsten, Michael Böhlen, Cyrtis Dyreson, and Johann Gamper. Approximate Joins for Data Centric XML. In Proceedings of the International Conference on Data Engineering (ICDE-08), CancĂșn, Mexico, April 2008. IEEE Computer Society.  [PDF]

Mini-Project

The exercise part of the course consists in the elaboration of a mini-project, which is done in groups of 2 to 3 students. You choose a topic from the list of project proposals, you implement the required algorithms, write a report, and give a presentation.

Report

Hint: Schedule 50% of your time for the report and the experiments.

Project Presentation

Downloads

Download: Basic Java classes to access a MySQL database. Example program (LoadXMLForest) that parses XML and stores it in a relational table. [ZIP]

Usage of xmlgen: You will find xmlgen useful to produce large XML documents for scalability tests. However, you have no control over the structure of the XML documents produced by the tool. xmlgen takes two parameters. The first parameter, size, is a positive number and controls the size of the output XML. The second parameter, filename, is the name of the output file.

Hint: The XML produced with size=1.0 is already 112MB…

How to Set Up Good Experiments?

Setting up good experiments is a challenging task and is usually underestimated. Here are some hints on how to set up your experiments:

  1. What do you want to measure?
    E.g., the run time of a tree algorithm.
  2. What are the parameters?
    E.g., tree size, tree shape (deep/flat), maximum/average fanout.
  3. Start with a hypothesis.
    E.g., the tested algorithm is faster for flat trees than for deep trees.
  4. Create a synthetic data set or use real world data that is useful to verify/falsify your hypothesis.
    E.g., create flat and deep trees of similar size.
  5. Run the experiment and display the result.
    E.g., you show the the depth of the tree on the x-axis and the runtime on the y-axis.
  6. Does the experiment approve your hypothesis, i.e., is the result expected?
    • Expected result: Why did you expect this result?
      E.g., the analytic complexity result for the algorithm depends on the depth of the tree.
    • Unexpected result: Why are the results different from what you have expected? What did you not consider in your hypothesis? Are there characteristics in the test data that hide your test parameters?
      E.g., for the tested algorithm, the number of different labels in the tree has more influence on the runtime than the depth – to isolate the influence of the tree depth, you must construct trees with similar label distribution.
    • Eye-catching characteristics of the results: Peaks, sharp bends, noise in the graph.
      E.g., the peak is due to a nightly backup that happened during the execution of experiment; the sharp upward bend indicates the start of main memory swapping.
  7. Showing the graph is not enough. The crucial question that should always be answered: What can we learn from the experiment?
    E.g., although the algorithm performs slightly better for flat trees than for deep trees, this parameter is negligible compared to other parameters and does not influence the choice of the algorithm.

Milestones

Projects outside the lecture time span. In exceptional cases you can hand in a project outside the standard schedule, respecting the following deadlines:

  1. Agree with me on a project topic at least two months before the exam session in which you would like to defend the exam.
  2. Hand in the implementation and your report at least three weeks before the actual exam date.
You will give your presentation during the oral exam, following the
usual rules. While you have full support during the lecture time span (exercise hours and office hours), I can not guarantee support the rest of the year.

Project Proposals

Proposal 1: q-Gram Distance as a Filter for the Edit Distance (assigned)

The final report defines the problem, shortly introduces edit distance and q-gram distance, explains the filter mechanism and presents the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 2: Tree Edit Distance and pq-Gram Distance (assigned)

The final report defines the problem, shortly introduces the tree edit distance and the pq-gram distance, compares the matching quality of both approaches, and presents the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 3: Interval Encoding and Dewey Encoding

The final report defines the problem, shortly introduces the Interval and the Dewey Encoding, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 4: Adjacency List Encoding and Dewey Encoding

The final report defines the problem, shortly introduces the Adjacency List Encoding and the Dewey Encoding, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 5: Binary Branch Edit Distance Filter

The final report defines the problem, shortly introduces the Tree Edit Distance and the Binary Branch Filter, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 6: Traversal String Edit Distance Filter

The final report defines the problem, shortly introduces the Tree Edit Distance and the Traversal String Filter, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 7: Global Greedy Matching and Stable Marriage

The final report defines the problem, shortly introduces the Global Greedy Matching and the Stable Marriage algorithm, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 8: Tree Edit Distance and Windowed pq-Gram Distance

The final report defines the problem, shortly introduces the tree edit distance and the windowed pq-gram distance, compares the robustness to sibling permutations of both approaches, and presents the experimental results. The report includes a usage manual for the software-deliverable.

References

Exam

Project (40%) and oral exam (60%).

Project: The group defends the project by

The positive grade for the project counts for all 3 regular exam sessions. After that, the student has to defend a new project.

Oral Exam: There will be no midterms. The oral exam at the end will cover all topics treated in the course. Each student will be examined individually for 15 minutes. The student will have to answer a question by example.


Oral Exam Details: Here is a list of exam questions [PDF]. The student will randomly choose one question from a list of exam questions (which will be published) and present the topic in 15 minutes. The student will be evaluated according to the correctness, clarity, relevance, and completness of the answer, and the proper use of the terminology. The examiner may ask follow-up questions to verify the thorough understanding of the chosen topic by the student.


Final grade. Both parts (project and oral exam) are graded with up to 30 points. Each part has to be passed (at least 18 points). The student has to pass the project before taking the oral exam. The final grade weights the project with 40% and the oral exam with 60%:

final_grade = grade_project * 0.4 + grade_oral_exam * 0.6