Similarity Search

Lecturer: Nikolaus Augsten
Teaching assistant: Nikolaus Augsten
Teaching language:English
Prerequisites: Basic concepts in databases and XML.
Office hours: Friday 8:45–10:00, March 2–April 27, except: April 6; Wednesday 8:45–10:00, May 9–June 6 (office 2.19)
Academic Year 2011 / 2012, 2nd Semester


Home | General | Lecture Notes | Mini-Project | Milestones | Project Proposals | Exam |

General

NEW: course evaluation by students 2011/2012 [PDF].

Course evaluation by students 2009/2010 [PDF].

Schedule: Usually both lecture and lab take place on Thursdays:

Please check the official schedule for exceptions.

Prerequisites: Students should be familiar with basic concepts in databases (including relational databases, SQL, and relational algebra) and algorithms, as well as having good programming skills in Java. Further basic knowledge about XML is required. This material is taught in the following courses:

Course content: This course will discuss similarity search techniques for flat strings and hierarchical data (for example, XML). Selected methods will be presented, their effectiveness and efficiency will be discussed. Filtering techniques to improve the efficiency will be introduced. The students will implement similarity joins in a relational database management system.

Lecture notes: The lecture notes for this course will be published as we progress through the semester. There is a set of lecture notes for each week.

Literature: There is no single textbook that covers the entire course. The course material is collected from various book chapters and research papers. For each topic, a link to relevant literature is provided with the respective lecture notes.

Project/exercise: The exercise part of the course consists in the elaboration of a mini-project, which is done in groups of 2 to 3 students.

See also: Course Presentation Form

Lecture Notes

The lecture notes for this course will be created as we progress through the semester. There is a set of lecture notes for each week.
Date Topics Slides   Literature

1. Thu 2012-03-01 General Introduction: Similarity Search, Course Organization, Introductory Example and Demo [1up] [4up] -
2. Thu 2012-03-08 Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants [1up] [4up] [Nav01]
3. Thu 2012-03-15 q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments [1up] [4up] [Gra01]
4. Thu 2012-03-22 Trees and Relational Databases: Tree Definition, Adjacency List Encoding, Dewey Encoding, Interval Encoding, XML as a Tree [1up] [4up] [Cel04][Tat02]
5. Thu 2012-03-29 Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas (I) [1up] [4up] [Apo97] [Zha89][Paw11]
6. Thu 2012-04-12 Tree Edit Distance: Deriving the Recursive Formulas (II), Dynamic Programming Algorithm (→ Unit 5)
7. Thu 2012-04-19 Tree Edit Distance: Example, Complexity (→ Unit 5)
8. Thu 2012-04-26 Pruning: Traversal String Lower Bound, Constrained Edit Distance Upper Bound [1up] [4up] [Guh02]
9. Thu 2012-05-10 Reference Sets: Pruning with Reference Sets [1up] [4up] [Guh02]
10. Thu 2012-05-17 Binary Branch Distance: Algorithm, Lower Bound Proof, Complexity [1up] [4up] [Yan05]
11. Thu 2012-05-24 pq-Gram Distance: Algorithm and Lower Bound Theorem. [1up] [4up] [Aug05]
12. Thu 2012-06-07 Metric Index Structures: Insertion and Range Queries with M-Trees [1up] [4up] [Zez06]

Literature

With the lecture notes for each week also the main references to literature are given. You do not need to study all these papers, but they may be useful to delve deeper into some of the topics.

[Nav01]
Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.  [PDF]
[Gra01]
Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Proc. of VLDB, pages 491–500, Roma, Italy, September 2001. Morgan Kaufmann Publishers Inc. [PDF]
[Cel04] (Adjacency List and Interval Encoding)
Joe Celko. Trees and Hierarchies in SQL for Smarties. Morgan Kaufmann, 2004. [library]
[Tat02] (Dewey Encoding)
Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang. Storing and Querying Ordered XML Using a Relational Database System. In Proc. of SIGMOD 2002, pages 204–215. Madison, Wisconsin, June 2002. Press. [PDF]
[Apo97] (Tree Edit Distance)
Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
[Zha89] (Tree Edit Distance)
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. [SIAM Journals Online]
[Paw11] (Tree Edit Distance)
Mateusz Pawlik, Nikolaus Augsten. RTED: A Robust Algorithm for the Tree Edit Distance. PVLDB 5(4):334–345, 2011. [PDF]
Tree edit distance resources: tutorial, algorithms, source code
[Guh02] (Tree Edit Distance: Upper and Lower Bound, Reference Sets)
Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. Approximate XML joins. In Proc. of SIGMOD, pages 287–298,Madison, Wisconsin, 2002. ACM Press. [PDF]
[Yan05] (Binary Branch Distance)
Rui Yang, Panos Kalnis, and Anthony K. H. Tung. Similarity evaluation on tree-structured data. In Proc. of SIGMOD, pages 754–765, Baltimore, Maryland, USA, June 2005. ACM Press. [PDF]
[Aug05] (pq-Gram Distance)
Nikolaus Augsten, Michael Böhlen, and Johann Gamper. Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 301–312, Trondheim, Norway, September 2005. Morgan Kaufmann Publishers Inc. [PDF]
[Zez06] (M-Trees)
Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko. Similarity Search - The Metric Space Appraoch. Springer 2006.

Mini-Project

Slides of Lab 1 -- Project Proposals, Database Connection.

The exercise part of the course consists in the elaboration of a mini-project, which is done in groups of 2 to 3 students. You choose a topic from a list of project proposals, you implement the required algorithms, write a report, and give a presentation.

Report

Hint: Schedule 50% of your time for the report and the experiments.

Project Presentation

How to Set Up Good Experiments?

Setting up good experiments is a challenging task and is usually underestimated. Here are some hints on how to set up your experiments:

  1. What do you want to measure?
    E.g., the run time of a tree algorithm.
  2. What are the parameters?
    E.g., tree size, tree shape (deep/flat), maximum/average fanout.
  3. Start with a hypothesis.
    E.g., the tested algorithm is faster for flat trees than for deep trees.
  4. Create a synthetic data set or use real world data that is useful to verify/falsify your hypothesis.
    E.g., create flat and deep trees of similar size.
  5. Run the experiment and display the result.
    E.g., you show the the depth of the tree on the x-axis and the runtime on the y-axis.
  6. Does the experiment approve your hypothesis, i.e., is the result expected?
    • Expected result: Why did you expect this result?
      E.g., the analytic complexity result for the algorithm depends on the depth of the tree.
    • Unexpected result: Why are the results different from what you have expected? What did you not consider in your hypothesis? Are there characteristics in the test data that hide your test parameters?
      E.g., for the tested algorithm, the number of different labels in the tree has more influence on the runtime than the depth – to isolate the influence of the tree depth, you must construct trees with similar label distribution.
    • Eye-catching characteristics of the results: Peaks, sharp bends, noise in the graph.
      E.g., the peak is due to a nightly backup that happened during the execution of experiment; the sharp upward bend indicates the start of main memory swapping.
  7. Showing the graph is not enough. The crucial question that should always be answered: What can we learn from the experiment?
    E.g., although the algorithm performs slightly better for flat trees than for deep trees, this parameter is negligible compared to other parameters and does not influence the choice of the algorithm.

Milestones

Projects outside the lecture time span. In exceptional cases you can hand in a project outside the standard schedule, respecting the following deadlines:

  1. Agree with me on a project topic at least two months before the exam session in which you would like to defend the exam.
  2. Hand in the implementation and your report at least three weeks before the actual exam date.
The deadlines are firm - no exceptions. You will give your presentation during the oral exam, following the
usual rules. While you have full support during the lecture time span (exercise hours and office hours), I can not guarantee support the rest of the year.

Project Proposals

Proposal 1: q-Gram Filter for String Edit Distance (assigned)

The final report defines the problem, shortly introduces edit distance and q-gram distance, explains the filter mechanism and presents the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 2: Binary Branch Filter for Tree Edit Distance (assigned)

The final report defines the problem, shortly introduces the tree edit distance and the binary branch filter, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 3: pq-Gram Filter for Tree Edit Distance (assigned)

The final report defines the problem, shortly introduces the tree edit distance and the pq-gram distance, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 4: Traversal String Filter for Tree Edit Distance

The final report defines the problem, shortly introduces the tree edit distance and the traversal string filter, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 5: Tree Edit Distance and Windowed pq-Gram Distance

The final report defines the problem, shortly introduces the tree edit distance and the windowed pq-gram distance, compares the robustness to sibling permutations of both approaches, and presents the experimental results. The report includes a usage manual for the software-deliverable.

References

Proposal 6: Comparison of Matching Algorithms (assigned)

The final report defines the problem, shortly introduces the matching algorithms, and presents and discusses the experimental results. The report includes a usage manual for the software-deliverable.

References

Some Useful Links to Test Data

Here are some links to test data that you might find useful. The list is not exhaustive: you might find other datasets more useful in your specific project. For some projects you will need to produce synthetic test data with a data generator that you write.

Strings

Wikipedia Missellings: Wikipedia provides a list of common misspellings (http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings).

Corpora for Missellings: Roger Mitton provides files with misspelled words (http://www.dcs.bbk.ac.uk/~ROGER/corpora.html).

Trees

Generator for large XML documents: You will find xmlgen (http://www.xml-benchmark.org/downloads.html) useful to produce large XML documents for scalability tests. However, you have no control over the structure of the XML documents produced by the tool. xmlgen takes two parameters. The first parameter, size, is a positive number and controls the size of the output XML. The second parameter, filename, is the name of the output file.

Hint: The XML produced with size=1.0 is already 112MB…

Bolzano Address Trees: The address trees of Bolzano (http://www.inf.unibz.it/~augsten/publ/tods10/bolzano-address-trees.zip) are useful for tree matching, but you may also use the street names (labels of the tree root nodes) for string matching (see slides of Lecture 1).

XML Documents: You find a collection of real world XML documents at the UW XML Repository. Often these documents consist of one single, very large XML file. You get a set of small documents by removing the root node, which often glues together many small documents.

Exam

Project (40%) and oral exam (60%).

Project: The group defends the project by

The positive grade for the project counts for all 3 regular exam sessions. After that, the student has to defend a new project.

Oral Exam: There will be no midterms. The oral exam at the end will cover all topics treated in the course.

Oral Exam Details: Here is a list of exam questions [PDF]. Each student will be examined individually for 15 minutes. The student will randomly pick one question from a list of exam questions (which will be published). The student has to present the respective topic and answer the question by example. The student will be evaluated according to the correctness, clarity, relevance, and completness of the answer, and the proper use of the terminology. The examiner may ask follow-up questions to verify the thorough understanding of the topic.

Final grade. Both parts (project and oral exam) are graded with up to 30 points. Each part has to be passed (at least 18 points). The student has to pass the project before taking the oral exam. The final grade weights the project with 40% and the oral exam with 60%:

final_grade = grade_project * 0.4 + grade_oral_exam * 0.6


Home | General | Lecture Notes | Mini-Project | Milestones | Project Proposals | Exam |