International PHP Conference Spring 2015
    Edit Report a Bug

    levenshtein

    (PHP 4 >= 4.0.1, PHP 5)

    levenshteinCalculate Levenshtein distance between two strings

    Description

    int levenshtein ( string $str1 , string $str2 )
    int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )

    The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive).

    In its simplest form the function will take only the two strings as parameter and will calculate just the number of insert, replace and delete operations needed to transform str1 into str2.

    A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.

    Parameters

    str1

    One of the strings being evaluated for Levenshtein distance.

    str2

    One of the strings being evaluated for Levenshtein distance.

    cost_ins

    Defines the cost of insertion.

    cost_rep

    Defines the cost of replacement.

    cost_del

    Defines the cost of deletion.

    Return Values

    This function returns the Levenshtein-Distance between the two argument strings or -1, if one of the argument strings is longer than the limit of 255 characters.

    Examples

    Example #1 levenshtein() example

    <?php
    // input misspelled word
    $input 'carrrot';

    // array of words to check against
    $words  = array('apple','pineapple','banana','orange',
                    
    'radish','carrot','pea','bean','potato');

    // no shortest distance found, yet
    $shortest = -1;

    // loop through words to find the closest
    foreach ($words as $word) {

        
    // calculate the distance between the input word,
        // and the current word
        
    $lev levenshtein($input$word);

        
    // check for an exact match
        
    if ($lev == 0) {

            
    // closest word is this one (exact match)
            
    $closest $word;
            
    $shortest 0;

            
    // break out of the loop; we've found an exact match
            
    break;
        }

        
    // if this distance is less than the next found shortest
        // distance, OR if a next shortest word has not yet been found
        
    if ($lev <= $shortest || $shortest 0) {
            
    // set the closest match, and shortest distance
            
    $closest  $word;
            
    $shortest $lev;
        }
    }

    echo 
    "Input word: $input\n";
    if (
    $shortest == 0) {
        echo 
    "Exact match found: $closest\n";
    } else {
        echo 
    "Did you mean: $closest?\n";
    }

    ?>

    The above example will output:

    Input word: carrrot
    Did you mean: carrot?
    

    See Also

    • soundex() - Calculate the soundex key of a string
    • similar_text() - Calculate the similarity between two strings
    • metaphone() - Calculate the metaphone key of a string

    spacer add a note

    User Contributed Notes 22 notes

    up
    down
    9
    paulrowe at iname dot com
    6 years ago
    [EDITOR'S NOTE: original post and 2 corrections combined into 1 -- mgf]

    Here is an implementation of the Levenshtein Distance calculation that only uses a one-dimensional array and doesn't have a limit to the string length. This implementation was inspired by maze generation algorithms that also use only one-dimensional arrays.

    I have tested this function with two 532-character strings and it completed in 0.6-0.8 seconds.

    <?php
    /*
    * This function starts out with several checks in an attempt to save time.
    *   1.  The shorter string is always used as the "right-hand" string (as the size of the array is based on its length). 
    *   2.  If the left string is empty, the length of the right is returned.
    *   3.  If the right string is empty, the length of the left is returned.
    *   4.  If the strings are equal, a zero-distance is returned.
    *   5.  If the left string is contained within the right string, the difference in length is returned.
    *   6.  If the right string is contained within the left string, the difference in length is returned.
    * If none of the above conditions were met, the Levenshtein algorithm is used.
    */
    function LevenshteinDistance($s1, $s2)
    {
     
    $sLeft = (strlen($s1) > strlen($s2)) ? $s1 : $s2;
     
    $sRight = (strlen($s1) > strlen($s2)) ? $s2 : $s1;
     
    $nLeftLength = strlen($sLeft);
     
    $nRightLength = strlen($sRight);
      if (
    $nLeftLength == 0)
        return
    $nRightLength;
      else if (
    $nRightLength == 0)
        return
    $nLeftLength;
      else if (
    $sLeft === $sRight)
        return
    0;
      else if ((
    $nLeftLength < $nRightLength) && (strpos($sRight, $sLeft) !== FALSE))
        return
    $nRightLength - $nLeftLength;
      else if ((
    $nRightLength < $nLeftLength) && (strpos($sLeft, $sRight) !== FALSE))
        return
    $nLeftLength - $nRightLength;
      else {
       
    $nsDistance = range(1, $nRightLength + 1);
        for (
    $nLeftPos = 1; $nLeftPos <= $nLeftLength; ++$nLeftPos)
        {
         
    $cLeft = $sLeft[$nLeftPos - 1];
         
    $nDiagonal = $nLeftPos - 1;
         
    $nsDistance[0] = $nLeftPos;
          for (
    $nRightPos = 1; $nRightPos <= $nRightLength; ++$nRightPos)
          {
           
    $cRight = $sRight[$nRightPos - 1];
           
    $nCost = ($cRight == $cLeft) ? 0 : 1;
           
    $nNewDiagonal = $nsDistance[$nRightPos];
           
    $nsDistance[$nRightPos] =
             
    min($nsDistance[$nRightPos] + 1,
                 
    $nsDistance[$nRightPos - 1] + 1,
                 
    $nDiagonal + $nCost);
           
    $nDiagonal = $nNewDiagonal;
          }
        }
        return
    $nsDistance[$nRightLength];
      }
    }
    ?>
    up
    down
    6
    luciole75w at no dot spam dot gmail dot com
    11 months ago
    The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.

    Example with a french accented word :
    - levenshtein('notre', 'votre') = 1
    - levenshtein('notre', 'nôtre') = 2 (huh ?!)

    You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.

    Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.

    Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.

    <?php
    // Convert an UTF-8 encoded string to a single-byte string suitable for
    // functions such as levenshtein.
    //
    // The function simply uses (and updates) a tailored dynamic encoding
    // (in/out map parameter) where non-ascii characters are remapped to
    // the range [128-255] in order of appearance.
    //
    // Thus it supports up to 128 different multibyte code points max over
    // the whole set of strings sharing this encoding.
    //
    function utf8_to_extended_ascii($str, &$map)
    {
       
    // find all multibyte characters (cf. utf-8 encoding specs)
       
    $matches = array();
        if (!
    preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
            return
    $str; // plain ascii string
       
        // update the encoding map with the characters not already met
       
    foreach ($matches[0] as $mbc)
            if (!isset(
    $map[$mbc]))
               
    $map[$mbc] = chr(128 + count($map));
       
       
    // finally remap non-ascii characters
       
    return strtr($str, $map);
    }

    // Didactic example showing the usage of the previous conversion function but,
    // for better performance, in a real application with a single input string
    // matched against many strings from a database, you will probably want to
    // pre-encode the input only once.
    //
    function levenshtein_utf8($s1, $s2)
    {
       
    $charMap = array();
       
    $s1 = utf8_to_extended_ascii($s1, $charMap);
       
    $s2 = utf8_to_extended_ascii($s2, $charMap);
       
        return
    levenshtein($s1, $s2);
    }
    ?>

    Results (for about 6000 calls)
    - reference time core C function (single-byte) : 30 ms
    - utf8 to ext-ascii conversion + core function : 90 ms
    - full php implementation : 3000 ms
    up
    down
    5
    dschultz at protonic dot com
    14 years ago
    It's also useful if you want to make some sort of registration page and you want to make sure that people who register don't pick usernames that are very similar to their passwords.
    up
    down
    4
    "inerte" is my hotmail.com username
    11 years ago
    I am using this function to avoid duplicate information on my client's database.

    After retrieving a series of rows and assigning the results to an array values, I loop it with foreach comparing its levenshtein() with the user supplied string.

    It helps to avoid people re-registering "John Smith", "Jon Smith" or "Jon Smit".

    Of course, I can't block the operation if the user really wants to, but a suggestion is displayed along the lines of: "There's a similar client with this name.", followed by the list of the similar strings.
    up
    down
    6
    justin at visunet dot ie
    9 years ago
    <?php

       
    /*********************************************************************
        * The below func, btlfsa, (better than levenstien for spelling apps)
        * produces better results when comparing words like haert against
        * haart and heart.
        *
        * For example here is the output of levenshtein compared to btlfsa
        * when comparing 'haert' to 'herat, haart, heart, harte'
        *
        * btlfsa('haert','herat'); output is.. 3
        * btlfsa('haert','haart'); output is.. 3
        * btlfsa('haert','harte'); output is.. 3
        * btlfsa('haert','heart'); output is.. 2
        *
        * levenshtein('haert','herat'); output is.. 2
        * levenshtein('haert','haart'); output is.. 1
        * levenshtein('haert','harte'); output is.. 2
        * levenshtein('haert','heart'); output is.. 2
        *
        * In other words, if you used levenshtein, 'haart' would be the
        * closest match to 'haert'. Where as, btlfsa sees that it should be
        * 'heart'
        */

       
    function btlfsa($word1,$word2)
        {
           
    $score = 0;

           
    // For each char that is different add 2 to the score
            // as this is a BIG difference

           
    $remainder  = preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word1)."]/i",'',$word2);
           
    $remainder .= preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word2)."]/i",'',$word1);
           
    $score      = strlen($remainder)*2;

           
    // Take the difference in string length and add it to the score
           
    $w1_len  = strlen($word1);
           
    $w2_len  = strlen($word2);
           
    $score  += $w1_len > $w2_len ? $w1_len - $w2_len : $w2_len - $w1_len;

           
    // Calculate how many letters are in different locations
            // And add it to the score i.e.
            //
            // h e a r t
            // 1 2 3 4 5
            //
            // h a e r t     a e        = 2
            // 1 2 3 4 5   1 2 3 4 5
        &
    gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.