MSc Thesis Proposals

Supervisor: Nikolaus Augsten

Experimenatal Evaluation of Tree Edit Distance Algorithms
(finished — Vaidas Komskis)

The tree edit distance is the minimum number of node edit operations that transform one tree into the other.

Example: The tree edit distance between the following trees is 2 (delete node e, insert node x).

   a         a    
  / \       / \   
 b   c     b   x    
    / \        |  
   d   e       c
       |      / \
       f     d   f

Tree edit distance algorithms have gained importance with the advent of XML documents. XML documents are represented as ordered, labeled trees and the tree edit distance is used to find similar XML documents.

For a long time the tree edit distance algorithm by Zhang and Shasha has been the best algorithm available (O(n4) time and O(n3) space complexity). Only very recently the results where improved:

While Demaine's algorithm is the clear winner in terms of runtime complexity, no experimental comparison of the three tree edit distance algorithms has been published so far. The goal of this thesis is to implement the tree edit distance algorithms by Zhang, Klein, and Demaine, and compare them experimentally.

Storing XML using Interval Encoding with Sparse Numbering
(finished — Eigminas Dagys)

The interval encoding is a technique for storing hierarchical data in relations and it has been used to store and query XML data. Each node of a tree is then represented as a triple (label, lft, rgt) of node label, and left and right endpoint of the node's interval.

We get an interval encoding for a tree by traversing the tree in preorder, using an incremental counter that assigns the left interval value lft to each node when it is visited first, and the right value rgt when it is visited last.

Example:

   a
  /|\
 b c d
    / \
   e   f
The tree above can be stored as
(a, 1, 12)
(b, 2, 3)
(c, 4, 5)
(d, 6, 11)
(e, 7, 8)
(f, 9, 10)

A problem with this approach is that node insertion and deletion are expensive, as all nodes that follow the inserted/deleted node in the preorder traversal have to be updated. Using sparse numbering instead of dense numbering has been proposed as a solution. A sparse numbering of the tree above could, for example, be the following:

(a, 1, 240)
(b, 40, 60)
(c, 80, 100)
(d, 120, 220)
(e, 140, 160)
(f, 180, 200)

Newly inserted nodes use the gaps in the numbering. Only if no numbers are left, the following nodes in preorder have to be updated.

The goal is to implement an interval encoding with sparse numbering and evaluate its efficiency in terms of

Adjacency list encoding and interval encoding with dense numbering are used as a reference in the testing (code provided).

References

[Zhang and Shasha] (Tree Edit Distance)
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262, 1989. [SIAM Journals Online]