(PHP 4 >= 4.0.1, PHP 5)
levenshtein — Calculate Levenshtein distance between two strings
$str1
, string $str2
)$str1
, string $str2
, int $cost_ins
, int $cost_rep
, int $cost_del
)
The Levenshtein distance is defined as the minimal number of
characters you have to replace, insert or delete to transform
str1
into str2
.
The complexity of the algorithm is O(m*n),
where n and m are the
length of str1
and
str2
(rather good when compared to
similar_text(), which is O(max(n,m)**3),
but still expensive).
In its simplest form the function will take only the two
strings as parameter and will calculate just the number of
insert, replace and delete operations needed to transform
str1
into str2
.
A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.
str1
One of the strings being evaluated for Levenshtein distance.
str2
One of the strings being evaluated for Levenshtein distance.
cost_ins
Defines the cost of insertion.
cost_rep
Defines the cost of replacement.
cost_del
Defines the cost of deletion.
This function returns the Levenshtein-Distance between the two argument strings or -1, if one of the argument strings is longer than the limit of 255 characters.
Example #1 levenshtein() example
<?php
// input misspelled word
$input = 'carrrot';
// array of words to check against
$words = array('apple','pineapple','banana','orange',
'radish','carrot','pea','bean','potato');
// no shortest distance found, yet
$shortest = -1;
// loop through words to find the closest
foreach ($words as $word) {
// calculate the distance between the input word,
// and the current word
$lev = levenshtein($input, $word);
// check for an exact match
if ($lev == 0) {
// closest word is this one (exact match)
$closest = $word;
$shortest = 0;
// break out of the loop; we've found an exact match
break;
}
// if this distance is less than the next found shortest
// distance, OR if a next shortest word has not yet been found
if ($lev <= $shortest || $shortest < 0) {
// set the closest match, and shortest distance
$closest = $word;
$shortest = $lev;
}
}
echo "Input word: $input\n";
if ($shortest == 0) {
echo "Exact match found: $closest\n";
} else {
echo "Did you mean: $closest?\n";
}
?>
The above example will output:
Input word: carrrot Did you mean: carrot?
[EDITOR'S NOTE: original post and 2 corrections combined into 1 -- mgf]
Here is an implementation of the Levenshtein Distance calculation that only uses a one-dimensional array and doesn't have a limit to the string length. This implementation was inspired by maze generation algorithms that also use only one-dimensional arrays.
I have tested this function with two 532-character strings and it completed in 0.6-0.8 seconds.
<?php
/*
* This function starts out with several checks in an attempt to save time.
* 1. The shorter string is always used as the "right-hand" string (as the size of the array is based on its length).
* 2. If the left string is empty, the length of the right is returned.
* 3. If the right string is empty, the length of the left is returned.
* 4. If the strings are equal, a zero-distance is returned.
* 5. If the left string is contained within the right string, the difference in length is returned.
* 6. If the right string is contained within the left string, the difference in length is returned.
* If none of the above conditions were met, the Levenshtein algorithm is used.
*/
function LevenshteinDistance($s1, $s2)
{
$sLeft = (strlen($s1) > strlen($s2)) ? $s1 : $s2;
$sRight = (strlen($s1) > strlen($s2)) ? $s2 : $s1;
$nLeftLength = strlen($sLeft);
$nRightLength = strlen($sRight);
if ($nLeftLength == 0)
return $nRightLength;
else if ($nRightLength == 0)
return $nLeftLength;
else if ($sLeft === $sRight)
return 0;
else if (($nLeftLength < $nRightLength) && (strpos($sRight, $sLeft) !== FALSE))
return $nRightLength - $nLeftLength;
else if (($nRightLength < $nLeftLength) && (strpos($sLeft, $sRight) !== FALSE))
return $nLeftLength - $nRightLength;
else {
$nsDistance = range(1, $nRightLength + 1);
for ($nLeftPos = 1; $nLeftPos <= $nLeftLength; ++$nLeftPos)
{
$cLeft = $sLeft[$nLeftPos - 1];
$nDiagonal = $nLeftPos - 1;
$nsDistance[0] = $nLeftPos;
for ($nRightPos = 1; $nRightPos <= $nRightLength; ++$nRightPos)
{
$cRight = $sRight[$nRightPos - 1];
$nCost = ($cRight == $cLeft) ? 0 : 1;
$nNewDiagonal = $nsDistance[$nRightPos];
$nsDistance[$nRightPos] =
min($nsDistance[$nRightPos] + 1,
$nsDistance[$nRightPos - 1] + 1,
$nDiagonal + $nCost);
$nDiagonal = $nNewDiagonal;
}
}
return $nsDistance[$nRightLength];
}
}
?>
The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.
Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)
You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.
Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.
Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms
It's also useful if you want to make some sort of registration page and you want to make sure that people who register don't pick usernames that are very similar to their passwords.
I am using this function to avoid duplicate information on my client's database.
After retrieving a series of rows and assigning the results to an array values, I loop it with foreach comparing its levenshtein() with the user supplied string.
It helps to avoid people re-registering "John Smith", "Jon Smith" or "Jon Smit".
Of course, I can't block the operation if the user really wants to, but a suggestion is displayed along the lines of: "There's a similar client with this name.", followed by the list of the similar strings.
<?php
/*********************************************************************
* The below func, btlfsa, (better than levenstien for spelling apps)
* produces better results when comparing words like haert against
* haart and heart.
*
* For example here is the output of levenshtein compared to btlfsa
* when comparing 'haert' to 'herat, haart, heart, harte'
*
* btlfsa('haert','herat'); output is.. 3
* btlfsa('haert','haart'); output is.. 3
* btlfsa('haert','harte'); output is.. 3
* btlfsa('haert','heart'); output is.. 2
*
* levenshtein('haert','herat'); output is.. 2
* levenshtein('haert','haart'); output is.. 1
* levenshtein('haert','harte'); output is.. 2
* levenshtein('haert','heart'); output is.. 2
*
* In other words, if you used levenshtein, 'haart' would be the
* closest match to 'haert'. Where as, btlfsa sees that it should be
* 'heart'
*/
function btlfsa($word1,$word2)
{
$score = 0;
// For each char that is different add 2 to the score
// as this is a BIG difference
$remainder = preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word1)."]/i",'',$word2);
$remainder .= preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word2)."]/i",'',$word1);
$score = strlen($remainder)*2;
// Take the difference in string length and add it to the score
$w1_len = strlen($word1);
$w2_len = strlen($word2);
$score += $w1_len > $w2_len ? $w1_len - $w2_len : $w2_len - $w1_len;
// Calculate how many letters are in different locations
// And add it to the score i.e.
//
// h e a r t
// 1 2 3 4 5
//
// h a e r t a e = 2
// 1 2 3 4 5 1 2 3 4 5
&