Approximation: Theory and Algorithms (ATA)
Lecturer: | Nikolaus Augsten |
Teaching assistant: | Nikolaus Augsten |
Teaching language: | English |
Prerequisites: | Basic concepts in databases and XML. |
Office hours: | Friday 9:00–10:30 (Feb 27 – May 29, unless class is canceled) |
Academic Year | 2008 / 2009, 2nd Semester |
General
NEW! Course evaluation by students 2006/2007 [PDF], 2007/2008 [PDF], and 2008/2009 [PDF]
Start date: February 27, 2009
Lectures: Friday 10:30-12:30, room E411
Labs: Friday 14:00-15:00, room E331
Exception:
- lab (E331) Wed, March 25 (same time)
- lecutre (E411)/office hour on Mon, March 30 (same time)
- lecutre (E221)/lab (E331)/office hour on Wed, April 22 (same time)
- no lecture/lab/office hour on March 27 and April 24
Prerequisites: Students should be familiar with basic concepts in databases (including relational databases, SQL, and relational algebra) and algorithms, as well as having good programming skills in Java. Further basic knowledge about XML is required. This material is taught in the following courses:
- Introduction to Programming
- Introduction to Databases
- Data Structures and Algorithms
- (XML and Semistructured Databases)
Course content: This course will discuss approximate matching techniques for flat strings and hierarchical data (for example, XML). Selected methods will be presented, their effectiveness and efficiency will be discussed. Filtering techniques to improve the efficiency will be introduced. The students will implement approximate matching techniques in a relational database management system.
Syllabus
- Introduction: Approximate Matching
- Approximate Joins and Stable Matching Algorithms
- String Edit Distance: Algorithm and Complexity
- q-Gram Distance for Strings: Algorithm and Complexity
- Implementation of q-Grams in a Relational Database
- q-Grams as a Filter for the Edit Distance
- Hierarchical Data and XML in Relational Database Systems
- Tree Edit Distance: Algorithms and Complexity
- Filters for the Tree Edit Distance
- pq-Gram Distance for Ordered Trees: Algorithm and Complexity
- Similarity of Unordered Trees (Data-Centric XML)
Lecture notes: The lecture notes for this course will be published as we progress through the semester. There is a set of lecture notes for each week.
Project/exercise: The exercise part of the course consists in the elaboration of a mini-project, which is done in groups of 2 to 3 students.
References: There is no single textbook that covers the entire course. The course material is collected from the following book chapters and research papers:
- Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
- Nikolaus Augsten, Michael Böhlen, and Johann Gamper. Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 301–312, Trondheim, Norway, September 2005. Morgan Kaufmann Publishers Inc. [PDF]
- Nikolaus Augsten, Michael Böhlen, Curtis Dyreson, and Johann Gamper. Approximate Joins for Data-Centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), Cancun, Mexico, April 2008. [PDF]
- Minos Garofalakis and Amit Kumar. XML stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279–332, 2005. [ACM Portal]
- Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Proc. of VLDB, pages 491–500, Roma, Italy, September 2001. Morgan Kaufmann Publishers Inc. [PDF]
- Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. Approximate XML joins. In Proc. of SIGMOD, pages 287–298,Madison, Wisconsin, 2002. ACM Press. [PDF]
- Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. [PDF]
- Esko Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211, January 1992. [PDF]
- Rui Yang, Panos Kalnis, and Anthony K. H. Tung. Similarity evaluation on tree-structured data. In Proc. of SIGMOD, pages 754–765, Baltimore, Maryland, USA, June 2005. ACM Press. [PDF]
- Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. [SIAM Journals Online]
See also: Course Presentation Form
Lecture Notes
The lecture notes for this course will be created as we progress through the semester. There is a set of lecture notes for each week.
Date | Topics | Slides | Reading | |||
1. | Fr 2009-02-27, 10:30-12:30 | General Introduction: Approximate Matching, Course Organization, Introductory Example and Demo | [1up] [4up] | - | ||
2. | Fr 2009-03-06, 10:30-12:30 | Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants | [1up] [4up] | [Nav01] | ||
3. | Fr 2009-03-13, 10:30-12:30 | q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments | [1up] [4up] | [Gra01] | ||
4. | Fr 2009-03-20, 10:30-12:30 | Trees and Relational Databases I: Tree Definition, Adjacency List Encoding | [1up] [4up] | [Cel04] | ||
5. | Mo 2009-03-30, 10:30-12:30 | Trees and Relational Databases II: Dewey Encoding, Interval Encoding, XML as a Tree | [1up] [4up] | [Tat02] | ||
6. | Fr 2009-04-03, 10:30-12:30 | Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas (I) | [1up] [4up] | [Apo97] [Zha89] |
||
7. | Fr 2009-04-17, 10:30-12:30 | Tree Edit Distance: Deriving the Recursive Formulas (II), Dynamic Programming Algorithm | [1up] [4up] | [Apo97] [Zha89] |
||
8. | Fr 2009-04-24, 10:30-12:30 | Tree Edit Distance: Key Roots, Left-Most Leaf Descendant, Example | [1up] [4up] | [Zha89] | ||
9. | Fr 2009-05-08, 10:30-12:30 | Tree Edit Distance: Complexity Pruning: Lower Bound (Traversal Strings), Upper Bound (Constrained Edit Distance) | [1up] [4up] | [Zha89] [Guh02] |
||
10. | Fr 2009-05-15, 10:30-12:30 | Reference Sets: Pruning with Reference Sets | [1up] [4up] | [Guh02] | ||
11. | Fr 2009-05-22, 10:30-12:30 | Binary Branch Distance: Lower Bound Proof and Complexity; pq-Gram Distance: Algorithm and Lower Bound Theorem. | [1up] [4up] | [Yan05] [Aug05] | ||
12. | Fr 2009-05-29, 10:30-12:30 | Windowed pq-Grams: Approximate Joins for Unordered Trees (Data-Centric XML) |
[1up] [4up]
[animation] |
[Aug08] | ||
Reading
With the lecture notes for each week also the main references to literature are given. You do not need to study all these papers, but they may be useful to delve deeper into some of the topics.
- [Nav01]
- Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. [PDF]
- [Gra01]
- Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Proc. of VLDB, pages 491–500, Roma, Italy, September 2001. Morgan Kaufmann Publishers Inc. [PDF]
- [Cel04] (Adjacency List and Interval Encoding)
- Joe Celko. Trees and Hierarchies in SQL for Smarties. Morgan Kaufmann, 2004. [library]
- [Tat02] (Dewey Encoding)
- Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang. Storing and Querying Ordered XML Using a Relational Database System. In Proc. of SIGMOD 2002, pages 204–215. Madison, Wisconsin, June 2002. Press. [PDF]
- [Apo97] (Tree Edit Distance)
- Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
- [Zha89] (Tree Edit Distance)
- Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. [SIAM Journals Online]
- [Guh02] (Tree Edit Distance: Upper and Lower Bound, Reference Sets)
- Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. Approximate XML joins. In Proc. of SIGMOD, pages 287–298,Madison, Wisconsin, 2002. ACM Press. [PDF]
- [Yan05] (Binary Branch Distance)
- Rui Yang, Panos Kalnis, and Anthony K. H. Tung. Similarity evaluation on tree-structured data. In Proc. of SIGMOD, pages 754–765, Baltimore, Maryland, USA, June 2005. ACM Press. [PDF]
- [Aug05] (pq-Gram Distance)
- Nikolaus Augsten, Michael Böhlen, and Johann Gamper. Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 301–312, Trondheim, Norway, September 2005. Morgan Kaufmann Publishers Inc. [PDF]
- [Aug08] (pq-Gram Distance)
- Nikolaus Augsten, Michael Böhlen, Cyrtis Dyreson, and Johann Gamper. Approximate Joins for Data Centric XML. In Proceedings of the International Conference on Data Engineering (ICDE-08), CancĂșn, Mexico, April 2008. IEEE Computer Society. [PDF]
Mini-Project
The exercise part of the course consists in the elaboration of a mini-project, which is done in groups of 2 to 3 students. You choose a topic from the list of project proposals, you implement the required algorithms, write a report, and give a presentation.
Report
- Elements:
- Motivation
- Problem definition
- Short introduction to the solution (1-2 paragraphs per method)
- Experiments and Discussion
- Manual (short installation and usage instruction for your code deliverable)
- Approximately 6 pages
- Be concise!
Hint: Schedule 50% of your time for the report and the experiments.
Project Presentation
- Implementation demo: 5 minutes
- Presentation: 10 minutes
- Discussion: 5-10 minutes
Downloads
Download: Basic Java classes to access a MySQL database. Example program (LoadXMLForest) that parses XML and stores it in a relational table. [ZIP]
Usage of xmlgen: You will find xmlgen useful to produce large XML documents for scalability tests. However, you have no control over the structure of the XML documents produced by the tool. xmlgen takes two parameters. The first parameter, size, is a positive number and controls the size of the output XML. The second parameter, filename, is the name of the output file.
- Windows: xmlgen /f size /o filename
- Linux: ./xmlgen -f size -o filename
Hint: The XML produced with size=1.0 is already 112MB…
How to Set Up Good Experiments?
Setting up good experiments is a challenging task and is usually underestimated. Here are some hints on how to set up your experiments:
- What do you want to measure?
E.g., the run time of a tree algorithm. - What are the parameters?
E.g., tree size, tree shape (deep/flat), maximum/average fanout. - Start with a hypothesis.
E.g., the tested algorithm is faster for flat trees than for deep trees. - Create a synthetic data set or use real world data that is
useful to verify/falsify your hypothesis.
E.g., create flat and deep trees of similar size. - Run the experiment and display the result.
E.g., you show the the depth of the tree on the x-axis and the runtime on the y-axis. - Does the experiment approve your hypothesis, i.e., is the result
expected?
- Expected result: Why did you expect this result?
E.g., the analytic complexity result for the algorithm depends on the depth of the tree. - Unexpected result: Why are the results different from
what you have expected? What did you not consider in your hypothesis?
Are there characteristics in the test data that hide your test parameters?
E.g., for the tested algorithm, the number of different labels in the tree has more influence on the runtime than the depth – to isolate the influence of the tree depth, you must construct trees with similar label distribution. - Eye-catching characteristics of the results: Peaks, sharp
bends, noise in the graph.
E.g., the peak is due to a nightly backup that happened during the execution of experiment; the sharp upward bend indicates the start of main memory swapping.
- Expected result: Why did you expect this result?
- Showing the graph is not enough. The crucial question that
should always be answered:
What can we learn from the experiment?
E.g., although the algorithm performs slightly better for flat trees than for deep trees, this parameter is negligible compared to other parameters and does not influence the choice of the algorithm.
Milestones
- Milestone 1 — Mar 13 (Lab 3)
- form groups
- choose project
- Milestone 2 — Apr 22 (Lab 8)
- hand in implementation
- hand in report (without experimental part)
- Milestone 3 — May 29 (Lab 12)
- presentation
- hand in final version of report
Projects outside the lecture time span. In exceptional cases you can hand in a project outside the standard schedule, respecting the following deadlines:
- Agree with me on a project topic at least two months before the exam session in which you would like to defend the exam.
- Hand in the implementation and your report at least three weeks before the actual exam date.
Project Proposals
Proposal 1: q-Gram Distance as a Filter for the Edit Distance (assigned)
- Implement the edit distance for strings.
- Implement the q-gram distance for strings (database implementation).
- Perform an approximate join using the edit distance and use the q-gram distance as a filter.
- Experiments:
- compare the efficieny of the edit and the q-gram distance (vary the string length and the number of strings);
- runtime performance of the edit distance with and without filter;
- effectiveness of the filter on the real world datasets.
The final report defines the problem, shortly introduces edit distance and q-gram distance, explains the filter mechanism and presents the experimental results. The report includes a usage manual for the software-deliverable.
References
- Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Proc. of VLDB, pages 491–500, Roma, Italy, September 2001. Morgan Kaufmann Publishers Inc. [PDF]
- Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. [PDF]
- Esko Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211, January 1992. [PDF]
Proposal 2: Tree Edit Distance and pq-Gram Distance (assigned)
- Implement the tree edit distance.
- Implement the pq-gram distance for ordered trees.
- Compare the efficiency of the pq-gram distance and the tree edit distance for XML files of different size. The XML files can be produced with xmlgen.
- Find examples trees where
- the tree edit distance is intuitive / not intuitive;
- the pq-gram distance is intuitive / not intuitive;
- the pq-gram distance is a good / bad approximation of the tree edit distance.
References
- Nikolaus Augsten, Michael Böhlen, and Johann Gamper. Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 301–312, Trondheim, Norway, September 2005. Morgan Kaufmann Publishers Inc. [PDF]
- Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
Proposal 3: Interval Encoding and Dewey Encoding
- Implement the Interval Encoding (sparse numbering) for storing ordered trees in a database.
- Implement the Dewey Encoding for storing ordered trees in a database.
- Compare the efficiency of both encodings with respect to:
- Node rename, node insertion, node deletion
- Preorder, postorder, and breath first traversal
References
- Quanzhong Li, Bongki Moon. Indexing and Querying XML Data for Regular Path Expressions. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 261–370. Roma, Italy, September 2001. Morgan Kaufmann. [PDF]
- Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang. Storing and Querying Ordered XML Using a Relational Database System. In Proc. of SIGMOD 2002, pages 204–215. Madison, Wisconsin, June 2002. Press. [PDF]
Proposal 4: Adjacency List Encoding and Dewey Encoding
- Implement the Adjacency List Encoding for storing ordered trees in a database.
- Implement the Dewey Encoding for storing ordered trees in a database.
- Compare the efficiency of both encodings with respect to:
- Node rename, node insertion, node deletion
- Preorder, postorder, and breath first traversal
References
- Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang. Storing and Querying Ordered XML Using a Relational Database System. In Proc. of SIGMOD 2002, pages 204–215. Madison, Wisconsin, June 2002. Press. [PDF]
Proposal 5: Binary Branch Edit Distance Filter
- Implement the Tree Edit Distance.
- Implement the Binary Branch Filter (a lower bound of the Tree Edit Distance).
- Experiments:
- runtime performance of the tree edit distance with and without filter;
- effectiveness of the filter on example datasets.
References
- Rui Yang, Panos Kalnis, and Anthony K. H. Tung. Similarity evaluation on tree-structured data. In Proc. of SIGMOD, pages 754–765, Baltimore, Maryland, USA, June 2005. ACM Press. [PDF]
- Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
Proposal 6: Traversal String Edit Distance Filter
- Implement the Tree Edit Distance.
- Implement the Traversal String Filter (a lower bound of the Tree Edit Distance).
- Experiments:
- runtime performance of the tree edit distance with and without filter;
- effectiveness of the filter on example datasets.
References
- Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. Approximate XML Joins. In Proc. of SIGMOD, pages 287–298,Madison, Wisconsin, 2002. ACM Press. [PDF]
- Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
Proposal 7: Global Greedy Matching and Stable Marriage
- Implement the Global Greedy Matching algorithm.
- Implement the Stable Marriage algorithm.
- Experiments:
- compare the runtime performance of both matching algorithms;
- find examples where the Stable Marriage algorithm depends on the matching order;
- compare the matching results of Stable Marriage with the results of Global Greedy, in particular where the matching order matters;
References
- D. Gale, L. S. Shapley, College Admissions and the Stability of Marriage, The American Mathematical Monthly 69(1):9–15, 1962. [PDF]
Proposal 8: Tree Edit Distance and Windowed pq-Gram Distance
- Implement the tree edit distance.
- Implement the windowed pq-gram distance.
- Compare the robustness of the windowed pq-gram distance and the tree edit distance (with tree sorting) to permutations in the sibling order. Find examples where both approaches do well and where one of them fails.
References
- Nikolaus Augsten, Michael Böhlen, Curtis Dyreson, and Johann Gamper. Approximate Joins for Data-Centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), Cancun, Mexico, April 2008. [PDF]
- Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
Exam
Project (40%) and oral exam (60%).Project: The group defends the project by
- handing in the resulting code;
- handing in the project report;
- giving a presentation of the project. The presentation is followed by a discussion.
Oral Exam: There will be no midterms. The oral exam at the end will cover all topics treated in the course. Each student will be examined individually for 15 minutes. The student will have to answer a question by example.
Oral Exam Details: Here is a list of exam questions [PDF]. The student will randomly choose one question from a list of exam questions (which will be published) and present the topic in 15 minutes. The student will be evaluated according to the correctness, clarity, relevance, and completness of the answer, and the proper use of the terminology. The examiner may ask follow-up questions to verify the thorough understanding of the chosen topic by the student.
Final grade. Both parts (project and oral exam) are graded with up to 30 points. Each part has to be passed (at least 18 points). The student has to pass the project before taking the oral exam. The final grade weights the project with 40% and the oral exam with 60%:
final_grade = grade_project * 0.4 + grade_oral_exam * 0.6