Understanding Levenshtein Distance: Applications to AI-Generated Text

•

Summary•6 min read

Levenshtein distance is a metric for comparing strings, providing a score for the difference between them. While it's been around for decades, it's found new applications in modern computing for comparing AI-generated text to it source material in ways that can help programmers preserve the privacy of data in large language models.

Artificial Intelligence, Large Language Models, Natural Language Processing, ChatGPT. Do I have your attention now? Indeed, as of this blog’s creation date, ChatGPT and AI-based technologies have swept the tech industry. Nvidia (maker of GPUs) has achieved the valuation mark of $3.335 trillion on June 18, 2024. Microsoft has invested over $10 billion into OpenAI, and Meta (Formerly Facebook) has created its own family of LLMs called Llama. Even Docusign has incorporated AI into its platform, such as with Agreement Summarization.

Problem

With so much AI-generated output these days, how can we developers work with AI-generated output without intruding on user privacy? Surely, we want to know how well our AI models are doing, but if we were to read exactly what the AI generated, or what the user types, it would be similar to breaching the privacy of the customer’s data. For example, if developers were to read the actual text of a customer’s AI-generated “summary” of a private document, it would be akin to “reading” the document itself.

Thus, in comes the Levenshtein Distance algorithm, an algorithm that calculates a numerical metric that measures the “distance” between two strings. Perhaps this “distance” between an AI-generated output and a user-generated input could be used as a privacy protection solution to detect how “off” our models are, but at the same time NOT read any of the text the AI generated.

Levenshtein Distance

History

The Levenshtein distance algorithm was introduced by the Soviet mathematician Vladimir Levenshtein in 1965. Originally formulated for error correction on binary codes, the algorithm has since found application in various fields, including spell checking, DNA sequencing, and plagiarism detection.

Definition

Levenshtein distance, also known as edit distance, is a metric used to measure the difference between two strings. It calculates the minimum number of “single-character” edits (insertions, deletions, or substitutions) required to transform one string into another.

How it works

The Levenshtein distance algorithm can be visualized as filling out a matrix where the rows represent the characters of the first string, and the columns represent the characters of the second string. The value at each cell of the metric represents the minimum number of edits required to match the substrings up to that point. A basic step-by-step pseudocode algorithm is as follows:

Initialization: Given input string1 and string2 to compare. Create matrix with rows of length (string1) + 1 and column of length (string2)+1
Base case: Fill the first row and column with the indices of the strings; this represents the cost of transforming a string into an empty string.
Matrix filling: For each character in the strings, compute the cost of the three possible operations: insertion, deletion, and substitution, then fill the value in the cell with the minimum of those three.
Typically the rules for filling in are:
1. If i == 0, matrix[i][j] = j
2. If j == 0, matrix[i][j] = i
3. If str1[i-1] == str2[j-1], matrix[i][j] = matrix[i-1][j-1]
4. Otherwise, matrix[i][j] = 1 + min(matrix[i-1][j], matrix[i][j-1], matrix[i-1][j-1])
Final value: The value in the bottom right represents the Levenshtein distance between the two strings.

Example

Here is an example comparing “samples” with “example”: Step 1: Initialize

"" s a m p l e s '' 0 1 2 3 4 5 6 7 e 1 0 0 0 0 0 0 0 x 2 0 0 0 0 0 0 0 a 3 0 0 0 0 0 0 0 m 4 0 0 0 0 0 0 0 p 5 0 0 0 0 0 0 0 l 6 0 0 0 0 0 0 0 e 7 0 0 0 0 0 0 0

Step 2: Fill in

"" s a m p l e s '' 0 1 2 3 4 5 6 7 e 1 1 2 3 4 5 5 6 x 2 2 2 3 4 5 6 6 a 3 3 2 3 4 5 6 7 m 4 4 3 2 3 4 5 6 p 5 5 4 3 2 3 4 5 l 6 6 5 4 3 2 3 4 e 7 6 6 5 4 3 2 3

The value in the bottom right is 3, so the final edit distance is 3. We can see that this is correct because, to convert “example” to “samples”, it can be done by adding “s”, deleting “e”, and changing “x” to “s” (three operations total).

Applications

When working with Levenshtein distance, the uses for which it could be suitable become immediately evident. Examples might include plagiarism detection, form validation, and comparing AI output to reference text. For this blog, we will focus on the last: “Comparing AI output to reference text” in one of our upcoming beta features, Suggested Fields.

Case Study: Comparing User Input to AI-Generated Suggested Fields

One of our upcoming features is Suggested Fields, in which Docusign takes a user-uploaded document and uses AI to “suggest” the fields that should go on the document and also automatically places them onto the document. So imagine a real estate form, or a healthcare form, or a tax form, that asks for “Name: ____________” and “Address: ________________”. Normally, preparing such a document would require the sender to manually create, label, then drag-and-drop each field to the right place. This takes a lot of time and effort, especially when there are many form fields within each page. By leveraging AI, all of these steps are automated! With just a few clicks, Suggested Fields can generate clearly labeled fields and auto-place them on the appropriate locations based on document contents.

However, as with all AI-generated applications, it is important for users to review the generated fields and correct the AI’s output if necessary. For example, the user might change “Name” to “names” or the AI might have incorrectly outputted “Addresses” as opposed to the singular “Address”. By using Levenshtein distance on the user’s inputted correction compared to the original AI-generated word, developers can get a fuzzy metric on how well our AI model is working: higher Levenshtein metrics indicating our model is not doing well, and lower numbers indicating that it is doing well. All this is done without even knowing the strings’ actual values—a wonderful example for privacy protection on behalf of users. I hope this knowledge about Levenshtein distance leaves you with inspiration to use it in your next AI-related application!

There are many libraries that implement Levenshtein distance for you, such as JS/TS fastest-levenshtein or python-Levenshtein. These libraries often have the fastest implementations, using dynamic programming or VP Trees.

But if you would like to implement a basic version of the algorithm; below are some common examples of the Levenshtein distance algorithm in various programming languages:

Code Samples

C#:

public static int Compute(string source, string target)
    {
        if (string.IsNullOrEmpty(source))
        {
            return string.IsNullOrEmpty(target) ? 0 : target.Length;
        }

        if (string.IsNullOrEmpty(target))
        {
            return source.Length;
        }

        int sourceLength = source.Length;
        int targetLength = target.Length;

        int[,] distance = new int[sourceLength + 1, targetLength + 1];

        // Initialize the first row and column
        for (int i = 0; i <= sourceLength; i++)
        {
            distance[i, 0] = i;
        }

        for (int j = 0; j <= targetLength; j++)
        {
            distance[0, j] = j;
        }

        // Compute the distances
        for (int i = 1; i <= sourceLength; i++)
        {
            for (int j = 1; j <= targetLength; j++)
            {
                int cost = (source[i - 1] == target[j - 1]) ? 0 : 1;

                distance[i, j] = Math.Min(
                    Math.Min(distance[i - 1, j] + 1, distance[i, j - 1] + 1),
                    distance[i - 1, j - 1] + cost);
            }
        }
        return distance[sourceLength, targetLength];
    }

Python:

def levenshtein_distance(source, target):
    if source == target:
        return 0
    if len(source) == 0:
        return len(target)
    if len(target) == 0:
        return len(source)

    source_length = len(source)
    target_length = len(target)

    # Initialize the distance matrix
    distance = [[0] * (target_length + 1) for _ in range(source_length + 1)]

    # Initialize the first row and column
    for i in range(source_length + 1):
        distance[i][0] = i
    for j in range(target_length + 1):
        distance[0][j] = j

    # Compute the distances
    for i in range(1, source_length + 1):
        for j in range(1, target_length + 1):
            cost = 0 if source[i - 1] == target[j - 1] else 1
            distance[i][j] = min(
                distance[i - 1][j] + 1,      # Deletion
                distance[i][j - 1] + 1,      # Insertion
                distance[i - 1][j - 1] + cost  # Substitution
            )

    return distance[source_length][target_length]

Go:

import (
    "fmt"
    "math"
)

func LevenshteinDistance(s1, s2 string) int {
    lenS1 := len(s1)
    lenS2 := len(s2)

    // Create a 2D slice to store distances
    dist := make([][]int, lenS1+1)
    for i := range dist {
        dist[i] = make([]int, lenS2+1)
    }

    // Initialize base cases
    for i := 0; i <= lenS1; i++ {
        dist[i][0] = i
    }
    for j := 0; j <= lenS2; j++ {
        dist[0][j] = j
    }

    // Compute distances
    for i := 1; i <= lenS1; i++ {
        for j := 1; j <= lenS2; j++ {
            cost := 0
            if s1[i-1] != s2[j-1] {
                cost = 1
            }
            dist[i][j] = int(math.Min(
                math.Min(float64(dist[i-1][j]+1), float64(dist[i][j-1]+1)),
                float64(dist[i-1][j-1]+cost),
            ))
        }
    }

    return dist[lenS1][lenS2]
}

Javascript:

function levenshteinDistance(source, target) {
    if (source === target) {
        return 0;
    }
    if (source.length === 0) {
        return target.length;
    }
    if (target.length === 0) {
        return source.length;
    }

    const sourceLength = source.length;
    const targetLength = target.length;

    // Initialize the distance matrix
    const distance = Array.from({ length: sourceLength + 1 }, () => Array(targetLength + 1).fill(0));

    // Initialize the first row and column
    for (let i = 0; i <= sourceLength; i++) {
        distance[i][0] = i;
    }
    for (let j = 0; j <= targetLength; j++) {
        distance[0][j] = j;
    }

    // Compute the distances
    for (let i = 1; i <= sourceLength; i++) {
        for (let j = 1; j <= targetLength; j++) {
            const cost = source[i - 1] === target[j - 1] ? 0 : 1;
            distance[i][j] = Math.min(
                distance[i - 1][j] + 1,      // Deletion
                distance[i][j - 1] + 1,      // Insertion
                distance[i - 1][j - 1] + cost  // Substitution
            );
        }
    }

    return distance[sourceLength][targetLength];
}

Citations:

https://en.wikipedia.org/wiki/Levenshtein_distance

https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0

Additional resources

Vincent PanSoftware Engineer

Vincent is an experienced software engineer who has been with Docusign since March 2023. Specializing in back-end REST APIs, he has also done front-end and most recently AI-related integrations. He is passionate about delivering high-quality, customer-facing experiences, and enjoys collaborating across teams and job functions. You can reach him on LinkedIn.