DIMACS :: Details

Simple, Deterministic, and Fast (but Weak) Approximation for Edit Distance and Dyck Edit Distance

March 08, 2023, 11:00 AM - 12:00 PM

Location:

Conference Room 301

Rutgers University

CoRE Building

96 Frelinghuysen Road

Piscataway, NJ 08854

Mike Saks, Rutgers University

Abstract

The edit distance between two strings, equal to the minimum number of operations (insertions, deletions or substitutions) needed to transform one to the other, is a standard measure of similarity of strings. The classic dynamic programming algorithm for edit distance requires time quadratic in n to compute the edit distance between two strings of length n, and there is evidence (via the strong exponential time hypothesis) that it may be impossible to improve substantially on this time complexity.

Recently, there has been considerable progress in developing approximation algorithms for edit distance (and the more general problem of Dyck edit distance) that are fast and have constant, or near constant approximation factors. These algorithms run in near linear time, but are logically complex, and the constants in both the running time and the approximation factor are huge, making the algorithms impractical.

In this work, we seek algorithms with weaker but still useful approximation guarantees that are practical: simple, fast and space efficient. We introduce a class of algorithms called single pass algorithms. In such an algorithm we maintain a single pointer within each string, starting at the left. In each step, if the current symbols match we advance both pointers, otherwise we have a mismatch and choose one of the pointers to advance. Such an algorithm is specified by its advancement rule, which determines which pointer to advance. We consider particularly simple (possibly randomized) advancement rules where at each mismatch step the pointer advanced depends only on the number of mismatches seen so far and the randomness of the algorithm. It is easy to see that the total number of mismatches is always an upper bound on edit distance. Saha (2014) showed that the simple randomized rule (on mismatch advance a pointer at random) when run on two strings of edit distance d returns (with high probability) an upper bound of O(d^2).

In this work we (1) present a deterministic single pass algorithm that achieves similar performance and (2) prove that no algorithm (even randomized) in this class can give a better approximation factor.

For the Dyck edit distance problem, Saha gave a complicated randomized reduction from Dyck edit distance to standard edit distance at a cost of a O(log d) factor where d is the Dyck edit distance. I will present a simple deterministic reduction with a similar (slightly better) approximation guarantee.

This is joint work with Michal Koucky.

See: https://theory.cs.rutgers.edu/theory_seminar