Similarity Search (WS)
| Lecturer: | Nikolaus Augsten |
| Teaching assistant: | Nikolaus Augsten |
| Teaching language: | English |
| Prerequisites: | Basic concepts in databases and XML. |
| Office hours: | Wednesday 8:45–10:15 (Oct 7 – Dec 23) |
| Academic Year | 2009 / 2010, 1st Semester |
NEW! Course evaluation by students
2009/2010 [PDF]
Start date: October 7, 2009 Prerequisites:
Students should be familiar with basic concepts in databases
(including relational databases, SQL, and relational algebra) and
algorithms, as well as having good programming skills in Java. Further
basic knowledge about XML is required. This material is taught in the
following courses:
Course content:
This course will discuss similarity search techniques for flat strings
and hierarchical data (for example, XML). Selected methods will be
presented, their effectiveness and efficiency will be
discussed. Filtering techniques to improve the efficiency will be
introduced. The students will implement similarity joins in a
relational database management system.
Syllabus Lecture notes: The lecture notes for this course will be
published as we progress through the semester. There is a set of lecture notes for each week.
Project/exercise: The exercise part of the course consists
in the elaboration of a mini-project, which is done in groups of 2
to 3 students.
References:
There is no single textbook that covers the entire course. The course
material is collected from the following book chapters and research
papers:
See also: Course Presentation FormGeneral
Lectures: Wednesday 10:30-12:30, room D003
Labs: Wednesday 17:00-18:00, room E531
Lecture Notes
The lecture notes for this course will be created as we progress through the semester. There is a set of lecture notes for each week.
| Date | Topics | Slides | Reading | ||||
| 1. | Wed 2009-10-07 | General Introduction: Similarity Search, Course Organization, Introductory Example and Demo | [1up] [4up] | - | |||
| 2. | Wed 2009-10-14 | Edit Distance: Definition, Brute Force Algorithm, Dynamic Programming Algorithm, Edit Distance Variants | [1up] [4up] | [Nav01] | |||
| 3. | Wed 2009-10-21 | q-Gram Distance: Approximate String Join, Lower Bound Filtering, Length Filter, q-Gram Count Filter, q-Gram Position Filter, q-Gram Distance, Experiments | [1up] [4up] | [Gra01] | |||
| 4. | Wed 2009-10-28 | Trees and Relational Databases: Tree Definition, Adjacency List Encoding, Dewey Encoding, Interval Encoding, XML as a Tree | [1up] [4up] | [Cel04][Tat02] | |||
| 5. | Wed 2009-11-04 | Tree Edit Distance: Definition, Edit Cost, Edit Mapping, Deriving the Recursive Formulas (I) | [1up] [4up] | [Apo97] [Zha89] |
|||
| 6. | Wed 2009-11-11 | Tree Edit Distance: Deriving the Recursive Formulas (II), Dynamic Programming Algorithm, Example, Complexity | (→ Unit 5) | ||||
| 7. | Wed 2009-11-18 | Pruning: Traversal String Lower Bound, Constrained Edit Distance Upper Bound | [1up] [4up] | [Guh02] | |||
| 8. | Wed 2009-11-25 | Reference Sets: Pruning with Reference Sets | [1up] [4up] | [Guh02] | |||
| 9. | Wed 2009-12-02 | Binary Branch Distance: Algorithm, Lower Bound Proof, Complexity | [1up] [4up] | [Yan05] |
|||
| 10. | Wed 2009-12-09 | pq-Gram Distance: Algorithm and Lower Bound Theorem. | [1up] [4up] | [Aug05] | |||
| 11. | Wed 2009-12-16 | Windowed pq-Grams: Approximate Joins for Unordered Trees (Data-Centric XML) |
[1up] [4up]
[animation] |
[Aug08] | |||
| 12. | Wed 2009-12-23 | Project Presentations | |||||
Reading
With the lecture notes for each week also the main references to literature are given. You do not need to study all these papers, but they may be useful to delve deeper into some of the topics.
- [Nav01]
- Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. [PDF]
- [Gra01]
- Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Proc. of VLDB, pages 491–500, Roma, Italy, September 2001. Morgan Kaufmann Publishers Inc. [PDF]
- [Cel04] (Adjacency List and Interval Encoding)
- Joe Celko. Trees and Hierarchies in SQL for Smarties. Morgan Kaufmann, 2004. [library]
- [Tat02] (Dewey Encoding)
- Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J. Shekita, Chun Zhang. Storing and Querying Ordered XML Using a Relational Database System. In Proc. of SIGMOD 2002, pages 204–215. Madison, Wisconsin, June 2002. Press. [PDF]
- [Apo97] (Tree Edit Distance)
- Alberto Apostolico and Zvi Galil, editors. Pattern Matching Algorithms, chapter Tree Pattern Matching, pages 341–371. Oxford University Press, 1997. [library]
- [Zha89] (Tree Edit Distance)
- Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. [SIAM Journals Online]
- [Guh02] (Tree Edit Distance: Upper and Lower Bound, Reference Sets)
- Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. Approximate XML joins. In Proc. of SIGMOD, pages 287–298,Madison, Wisconsin, 2002. ACM Press. [PDF]
- [Yan05] (Binary Branch Distance)
- Rui Yang, Panos Kalnis, and Anthony K. H. Tung. Similarity evaluation on tree-structured data. In Proc. of SIGMOD, pages 754–765, Baltimore, Maryland, USA, June 2005. ACM Press. [PDF]
- [Aug05] (pq-Gram Distance)
- Nikolaus Augsten, Michael Böhlen, and Johann Gamper. Approximate matching of hierarchical data using pq-grams. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 301–312, Trondheim, Norway, September 2005. Morgan Kaufmann Publishers Inc. [PDF]
- [Aug08] (pq-Gram Distance)
- Nikolaus Augsten, Michael Böhlen, Cyrtis Dyreson, and Johann Gamper. Approximate Joins for Data Centric XML. In Proceedings of the International Conference on Data Engineering (ICDE-08), CancĂșn, Mexico, April 2008. IEEE Computer Society. [PDF]
Mini-Project
The exercise part of the course consists in the elaboration of a mini-project, which is done in groups of 2 to 3 students. You choose a topic from the list of project proposals, you implement the required algorithms, write a report, and give a presentation.
Report
- Elements:
- Motivation
- Problem definition
- Short introduction to the solution (1-2 paragraphs per method)
- Experiments and Discussion
- Manual (short installation and usage instruction for your code deliverable)
- Approximately 6 pages
- Be concise!
Hint: Schedule 50% of your time for the report and the experiments.
Project Presentation
- Implementation demo: 5 minutes
- Presentation: 10 minutes
- Discussion: 5-10 minutes
Downloads
Download: Basic Java classes to access a MySQL database. Example program (LoadXMLForest) that parses XML and stores it in a relational table. [ZIP][README]
Usage of xmlgen: You will find xmlgen useful to produce large XML documents for scalability tests. However, you have no control over the structure of the XML documents produced by the tool. xmlgen takes two parameters. The first parameter, size, is a positive number and controls the size of the output XML. The second parameter, filename, is the name of the output file.
- Windows: xmlgen /f size /o filename
- Linux: ./xmlgen -f size -o filename
Hint: The XML produced with size=1.0 is already 112MB…
How to Set Up Good Experiments?
Setting up good experiments is a challenging task and is usually underestimated. Here are some hints on how to set up your experiments:
- What do you want to measure?
E.g., the run time of a tree algorithm. - What are the parameters?
E.g., tree size, tree shape (deep/flat), maximum/average fanout. - Start with a hypothesis.
E.g., the tested algorithm is faster for flat trees than for deep trees. - Create a synthetic data set or use real world data that is
useful to verify/falsify your hypothesis.
E.g., create flat and deep trees of similar size. - Run the experiment and display the result.
E.g., you show the the depth of the tree on the x-axis and the runtime on the y-axis. - Does the experiment approve your hypothesis, i.e., is the result
expected?
- Expected result: Why did you expect this result?
E.g., the analytic complexity result for the algorithm depends on the depth of the tree. - Unexpected result: Why are the results different from
what you have expected? What did you not consider in your hypothesis?
Are there characteristics in the test data that hide your test parameters?
E.g., for the tested algorithm, the number of different labels in the tree has more influence on the runtime than the depth – to isolate the influence of the tree depth, you must construct trees with similar label distribution. - Eye-catching characteristics of the results: Peaks, sharp
bends, noise in the graph.
E.g., the peak is due to a nightly backup that happened during the execution of experiment; the sharp upward bend indicates the start of main memory swapping.
- Expected result: Why did you expect this result?
- Showing the graph is not enough. The crucial question that
should always be answered:
What can we learn from the experiment?
E.g., although the algorithm performs slightly better for flat trees than for deep trees, this parameter is negligible compared to other parameters and does not influence the choice of the algorithm.
Milestones
- Milestone 1 — Oct 21 (Lab 3)
- form groups
- choose project
- Milestone 2 — Nov 25 (Lab 8)
- hand in implementation
- hand in report (without experimental part)
- Milestone 3 — Dec 23 (Lab 12)
- presentation
- hand in final version of report
Projects outside the lecture time span. In exceptional cases you can hand in a project outside the standard schedule, respecting the following deadlines:
- Agree with me on a project topic at least two months before the exam session in which you would like to defend the exam.
- Hand in the implementation and your report at least three weeks before the actual exam date.
The final report defines the problem, shortly introduces edit distance and
q-gram distance, explains the filter mechanism and presents the
experimental results. The report includes a usage manual for the
software-deliverable.
References
References
References
References
References
References
References
References
Project Proposals
Proposal 1: q-Gram Distance as a Filter for the Edit
Distance (assigned)
Proposal 2: Tree Edit Distance and pq-Gram
Distance (assigned)
The final report defines the problem, shortly introduces the tree edit distance and
the pq-gram distance, compares the matching quality of both
approaches, and presents the experimental results. The report includes
a usage manual for the software-deliverable.
Proposal 3: Interval Encoding and Dewey
Encoding
The final report defines the problem, shortly introduces the Interval and the
Dewey Encoding, and presents and discusses the experimental results. The
report includes a usage manual for the software-deliverable.
Proposal 4: Adjacency List Encoding and Dewey
Encoding
The final report defines the problem, shortly introduces the Adjacency List
Encoding and the Dewey Encoding, and presents and discusses the
experimental results. The report includes a usage manual for the
software-deliverable.
Proposal 5: Binary Branch Edit Distance
Filter
The final report defines the problem, shortly introduces the Tree Edit
Distance and the Binary Branch Filter, and presents and discusses the
experimental results. The report includes a usage manual for the
software-deliverable.
Proposal 6: Traversal String Edit Distance
Filter (assigned)
The final report defines the problem, shortly introduces the Tree Edit
Distance and the Traversal String Filter, and presents and discusses the
experimental results. The report includes a usage manual for the
software-deliverable.
Proposal 7: Global Greedy Matching and Stable Marriage
(assigned)
The final report defines the problem, shortly introduces the Global Greedy
Matching and the Stable Marriage algorithm, and presents and discusses
the experimental results. The report includes a usage manual for the
software-deliverable.
Proposal 8: Tree Edit Distance and Windowed pq-Gram
Distance (assigned)
The final report defines the problem, shortly introduces the tree edit
distance and the windowed pq-gram distance, compares the
robustness to sibling permutations of both approaches, and presents
the experimental results. The report includes a usage manual for the
software-deliverable.
Exam
Project (40%) and oral exam (60%).
Project: The group defends the project by
- handing in the resulting code;
- handing in the project report;
- giving a presentation of the project. The presentation is followed by a discussion.
Oral Exam: There will be no midterms. The oral exam at the end will cover all topics treated in the course. Each student will be examined individually for 15 minutes. The student will have to answer a question by example.
Oral Exam Details: Here is a list of exam questions [PDF]. The student will randomly choose one question from a list of exam questions (which will be published) and present the topic in 15 minutes. The student will be evaluated according to the correctness, clarity, relevance, and completness of the answer, and the proper use of the terminology. The examiner may ask follow-up questions to verify the thorough understanding of the chosen topic by the student.
Final grade. Both parts (project and oral exam) are graded with up to 30 points. Each part has to be passed (at least 18 points). The student has to pass the project before taking the oral exam. The final grade weights the project with 40% and the oral exam with 60%:
final_grade = grade_project * 0.4 + grade_oral_exam * 0.6