MSc Thesis Proposals
Supervisor: Nikolaus Augsten
Experimenatal Evaluation of Tree Edit Distance Algorithms
(finished — Vaidas Komskis)
The tree edit distance is the minimum number of node edit operations that transform one tree into the other.
Example: The tree edit distance between the following trees is 2 (delete node e, insert node x).
a a
/ \ / \
b c b x
/ \ |
d e c
| / \
f d f
Tree edit distance algorithms have gained importance with the advent of XML documents. XML documents are represented as ordered, labeled trees and the tree edit distance is used to find similar XML documents.
For a long time the tree edit distance algorithm by Zhang and Shasha has been the best algorithm available (O(n4) time and O(n3) space complexity). Only very recently the results where improved:
- Klein improves the runtime to O(n3log n) in 1998.
- Demaine et al. improve the runtime to O(n3log n) in 2007 and they show the optimality of their algorithm among all decomposition algorithms.
While Demaine's algorithm is the clear winner in terms of runtime complexity, no experimental comparison of the three tree edit distance algorithms has been published so far. The goal of this thesis is to implement the tree edit distance algorithms by Zhang, Klein, and Demaine, and compare them experimentally.
Storing XML using Interval Encoding with Sparse Numbering
(finished — Eigminas Dagys)
The interval encoding is a technique for storing hierarchical data in relations and it has been used to store and query XML data. Each node of a tree is then represented as a triple (label, lft, rgt) of node label, and left and right endpoint of the node's interval.
We get an interval encoding for a tree by traversing the tree in preorder, using an incremental counter that assigns the left interval value lft to each node when it is visited first, and the right value rgt when it is visited last.
Example:
a
/|\
b c d
/ \
e f
The tree above can be stored as
(a, 1, 12) (b, 2, 3) (c, 4, 5) (d, 6, 11) (e, 7, 8) (f, 9, 10)
A problem with this approach is that node insertion and deletion are expensive, as all nodes that follow the inserted/deleted node in the preorder traversal have to be updated. Using sparse numbering instead of dense numbering has been proposed as a solution. A sparse numbering of the tree above could, for example, be the following:
(a, 1, 240) (b, 40, 60) (c, 80, 100) (d, 120, 220) (e, 140, 160) (f, 180, 200)
Newly inserted nodes use the gaps in the numbering. Only if no numbers are left, the following nodes in preorder have to be updated.
The goal is to implement an interval encoding with sparse numbering and evaluate its efficiency in terms of
- node insertion and deletion
- loading the XML in document order
- ancestor/descendant and sibling queries
Adjacency list encoding and interval encoding with dense numbering are used as a reference in the testing (code provided).