Fuzzy Matching in PHP and Excel

This summer as part of a tutoring program in South Africa we needed to link manually entered names and test scores with our master spreadsheet. I created an Excel VBA function to match the names by combining exact search by full or first/last name and then using character count comparisons and Levenshtein distance to match misspelled names.

Kids taking pre-test
South African students taking the math pretest

The trip itself was amazing - our team of US college students got to partner with the The Mamelodi Initiative to tutor 8th-10th graders and in our free time we hung out with our African friends, discussed how the message of Jesus relates to serving others and had some awesome adventures!

Each year the tutoring program has better tracked student achievement through pre- and post-tests. Part of my role was to help manage the program's data which for various reasons is stored in a large Excel spreadsheet.

We had a volunteers help enter the math pretest names and scores in a separate spreadsheet (perhaps a mistake in retrospect). Misspellings between the score sheets and our master spreadsheet inevitably resulted as "Mahlemone" became "Molemone" and "Kgomotso" was entered as "Khomotso."

I remembered that I had written PHP code for string alignment and edit distance which I used in a vocabulary training site I wrote in college. The dynamic programming Needleman-Wunsch algorithm does the alignment and the resulting similarity metric is directly related to Levenshtein distance.

PHP alignment function excerpt (full source):

function getStringAlignment($str1, $str2) {  
  // Set up variables and calculate edit distance matrix $D
  ... 
  
  // Main backtracking loop
  while ($i > 0 && $j > 0) {
    $charSimilarity = char_similarity($str1[$i - 1], $str2[$j - 1]);
    if ($D[$i][$j] - $charSimilarity == $D[$i - 1][$j - 1]) {
      if ($charSimilarity > 0) {
        // A matched character
        $align = 'M';                     
      } else {
        // A changed character
        $align = 'C';
      }            
      $i--;
      $j--;
      $score += $charSimilarity;
    } else if ($D[$i][$j] - $gap_score == $D[$i][$j - 1]) {
      // Skipped in str2, i.e. str1 added it
      $align = 'A';
      $j--;
      $score += $gap_score;
    } else if ($D[$i][$j] - $gap_score == $D[$i - 1][$j]) {
      // Skipped in str1, i.e. str1 deleted it
      $align = 'D';
      $i--;            
      $score += $gap_score;   
    }
    $alignment = $align . $alignment;
  }
  
  // Record added or deleted letters for the rest of the unfinished string
  ...
  
  return array($alignment, $score);
}

I translated the algorithm to Excel VBA script to compare the names for our tutoring program pretest. It worked great, but it was painfully slow to compare all 200 names in our test scores list with the 1000 or so names in our master sheet because the alignment algorithm takes $$\operatorname{O}(length_{string1}length_{string2})$$ time while a simple string comparison takes $$\operatorname{O}(min(length_{string1}, length_{string2}))$$ time.

To speed up the matching overall, I had it try exact full name matches first and then filter for exact first or last name matches. Then I had it do a simple unordered number of characters comparison to see if a name match was way off, and only if the number of characters was similar would it do the slower to compute Levenshtein distance comparison. I also had it try reversing first and last names as that sometimes happened in our data set.

Here's an excerpt of the lookup loop, which calculates the edit distance only when needed (full source):

For i = firstRow To lastRow
    valInList = list.Cells(i, 1).Value
       
    'For performance sake check the length, space location and unordered similarity first
    'before running the edit distance algorithm
    If Abs(Len(valInList) - Len(searchStr)) / Len(searchStr) _
          < LEN_RATIO_NOMATCH_THRESHOLD Then
        If Abs(InStr(searchStr, " ") - InStr(valInList, " ")) / Len(searchStr) _
              < STR_LOC_NOMATCH_THRESHOLD Then
            If UnorderedSimilarity(searchStr, valInList) / Len(searchStr) _
                  > UNORDERED_SIM_NOMATCH_THRESHOLD Then
                sim = LevenshteinDistance(searchStr, valInList)
                If sim > maxSim Then
                    maxSim = sim
                    maxSimI = i
                End If
            End If
        End If
    End If
Next

After doing this, the majority of names matched and all I had to do manually was sanity check them and fix a few here and there. It was a lot more fun to do this as a programming exercise than to search for them one-by-one!

There's also a StackOverflow post with some other Levenshtein distance implementations for Excel with some speed tune-ups.