levenshtein

(PHP 4 >= 4.0.1, PHP 5)

levenshtein — Calculate Levenshtein distance between two strings

Description

int levenshtein ( string $str1 , string $str2 )

int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )

The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive).

In its simplest form the function will take only the two strings as parameter and will calculate just the number of insert, replace and delete operations needed to transform str1 into str2.

A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.

Parameters

str1: One of the strings being evaluated for Levenshtein distance.
str2: One of the strings being evaluated for Levenshtein distance.
cost_ins: Defines the cost of insertion.
cost_rep: Defines the cost of replacement.
cost_del: Defines the cost of deletion.

Return Values

This function returns the Levenshtein-Distance between the two argument strings or -1, if one of the argument strings is longer than the limit of 255 characters.

Examples

Example #1 levenshtein() example


<?php
// input misspelled word
$input = 'carrrot';

// array of words to check against
$words  = array('apple','pineapple','banana','orange',
                'radish','carrot','pea','bean','potato');

// no shortest distance found, yet
$shortest = -1;

// loop through words to find the closest
foreach ($words as $word) {

    // calculate the distance between the input word,
    // and the current word
    $lev = levenshtein($input, $word);

    // check for an exact match
    if ($lev == 0) {

        // closest word is this one (exact match)
        $closest = $word;
        $shortest = 0;

        // break out of the loop; we've found an exact match
        break;
    }

    // if this distance is less than the next found shortest
    // distance, OR if a next shortest word has not yet been found
    if ($lev <= $shortest || $shortest < 0) {
        // set the closest match, and shortest distance
        $closest  = $word;
        $shortest = $lev;
    }
}

echo "Input word: $input\n";
if ($shortest == 0) {
    echo "Exact match found: $closest\n";
} else {
    echo "Did you mean: $closest?\n";
}

?>

The above example will output:

Input word: carrrot
Did you mean: carrot?

User Contributed Notes 22 notes

down

paulrowe at iname dot com ¶

6 years ago


[EDITOR'S NOTE: original post and 2 corrections combined into 1 -- mgf]



Here is an implementation of the Levenshtein Distance calculation that only uses a one-dimensional array and doesn't have a limit to the string length. This implementation was inspired by maze generation algorithms that also use only one-dimensional arrays.



I have tested this function with two 532-character strings and it completed in 0.6-0.8 seconds. 



<?php

/*

* This function starts out with several checks in an attempt to save time.

*   1.  The shorter string is always used as the "right-hand" string (as the size of the array is based on its length).  

*   2.  If the left string is empty, the length of the right is returned.

*   3.  If the right string is empty, the length of the left is returned.

*   4.  If the strings are equal, a zero-distance is returned.

*   5.  If the left string is contained within the right string, the difference in length is returned.

*   6.  If the right string is contained within the left string, the difference in length is returned.

* If none of the above conditions were met, the Levenshtein algorithm is used.

*/

function LevenshteinDistance($s1, $s2)

{

  $sLeft = (strlen($s1) > strlen($s2)) ? $s1 : $s2;

  $sRight = (strlen($s1) > strlen($s2)) ? $s2 : $s1;

  $nLeftLength = strlen($sLeft);

  $nRightLength = strlen($sRight);

  if ($nLeftLength == 0)

    return $nRightLength;

  else if ($nRightLength == 0)

    return $nLeftLength;

  else if ($sLeft === $sRight)

    return 0;

  else if (($nLeftLength < $nRightLength) && (strpos($sRight, $sLeft) !== FALSE))

    return $nRightLength - $nLeftLength;

  else if (($nRightLength < $nLeftLength) && (strpos($sLeft, $sRight) !== FALSE))

    return $nLeftLength - $nRightLength;

  else {

    $nsDistance = range(1, $nRightLength + 1);

    for ($nLeftPos = 1; $nLeftPos <= $nLeftLength; ++$nLeftPos)

    {

      $cLeft = $sLeft[$nLeftPos - 1];

      $nDiagonal = $nLeftPos - 1;

      $nsDistance[0] = $nLeftPos;

      for ($nRightPos = 1; $nRightPos <= $nRightLength; ++$nRightPos)

      {

        $cRight = $sRight[$nRightPos - 1];

        $nCost = ($cRight == $cLeft) ? 0 : 1;

        $nNewDiagonal = $nsDistance[$nRightPos];

        $nsDistance[$nRightPos] = 

          min($nsDistance[$nRightPos] + 1, 

              $nsDistance[$nRightPos - 1] + 1, 

              $nDiagonal + $nCost);

        $nDiagonal = $nNewDiagonal;

      }

    }

    return $nsDistance[$nRightLength];

  }

}

?>

down

luciole75w at no dot spam dot gmail dot com ¶

11 months ago


The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.

Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)

You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.

Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.

Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.

<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
// 
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
    // find all multibyte characters (cf. utf-8 encoding specs)
    $matches = array();
    if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return $str; // plain ascii string
    
    // update the encoding map with the characters not already met
    foreach ($matches[0] as $mbc)
        if (!isset($map[$mbc]))
            $map[$mbc] = chr(128 + count($map));
    
    // finally remap non-ascii characters
    return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
    $charMap = array();
    $s1 = utf8_to_extended_ascii($s1, $charMap);
    $s2 = utf8_to_extended_ascii($s2, $charMap);
    
    return levenshtein($s1, $s2);
}
?>

Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms

down

dschultz at protonic dot com ¶

14 years ago


It's also useful if you want to make some sort of registration page and you want to make sure that people who register don't pick usernames that are very similar to their passwords.

down

"inerte" is my hotmail.com username ¶

11 years ago


I am using this function to avoid duplicate information on my client's database.

After retrieving a series of rows and assigning the results to an array values, I loop it with foreach comparing its levenshtein() with the user supplied string.

It helps to avoid people re-registering "John Smith", "Jon Smith" or "Jon Smit".

Of course, I can't block the operation if the user really wants to, but a suggestion is displayed along the lines of: "There's a similar client with this name.", followed by the list of the similar strings.

down

justin at visunet dot ie ¶

9 years ago


<?php

    /*********************************************************************
    * The below func, btlfsa, (better than levenstien for spelling apps)
    * produces better results when comparing words like haert against
    * haart and heart.
    *
    * For example here is the output of levenshtein compared to btlfsa
    * when comparing 'haert' to 'herat, haart, heart, harte'
    *
    * btlfsa('haert','herat'); output is.. 3
    * btlfsa('haert','haart'); output is.. 3
    * btlfsa('haert','harte'); output is.. 3
    * btlfsa('haert','heart'); output is.. 2
    *
    * levenshtein('haert','herat'); output is.. 2
    * levenshtein('haert','haart'); output is.. 1
    * levenshtein('haert','harte'); output is.. 2
    * levenshtein('haert','heart'); output is.. 2
    *
    * In other words, if you used levenshtein, 'haart' would be the
    * closest match to 'haert'. Where as, btlfsa sees that it should be
    * 'heart'
    */

    function btlfsa($word1,$word2)
    {
        $score = 0;

        // For each char that is different add 2 to the score
        // as this is a BIG difference

        $remainder  = preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word1)."]/i",'',$word2);
        $remainder .= preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word2)."]/i",'',$word1);
        $score      = strlen($remainder)*2;

        // Take the difference in string length and add it to the score
        $w1_len  = strlen($word1);
        $w2_len  = strlen($word2);
        $score  += $w1_len > $w2_len ? $w1_len - $w2_len : $w2_len - $w1_len;

        // Calculate how many letters are in different locations
        // And add it to the score i.e.
        //
        // h e a r t
        // 1 2 3 4 5
        //
        // h a e r t     a e        = 2
        // 1 2 3 4 5   1 2 3 4 5
    &

levenshtein

Description

Parameters

Return Values

Examples

See Also

User Contributed Notes 22 notes