Last year marc wrote about our spell checking in code feature for Oxygene. This feature lets you continue using code completion after you made a typo in your code. The compiler will report an error as usual, pick the most likely choice from a list of possible identifiers and pretend that item was picked in the first place. The end result is that this code:
var s: Srting; for i: Integer :=0 to s.Length-1 do begin s[i]. end;
will show code completion for the elements of the string, despite the wrongly spelled srting, and the compiler does the same thing, it shows an error for srting as usual, but using s will make it act like it’s a string.
There are a number of algorithms we could have picked to find the best matching item for this, so after some trial and error we picked the Damerau–Levenshtein distance and modified it a bit. Given two strings, this algorithm returns an integer indicating how many changes there are. Removal, adding or changes are counted as a single change, as are swapping two characters that follow each other (unlike the plain Levenshtein distance which only does removal, adding and changing, so swapping would count as 2). After some optimizations we came up with:
method Compiler.DamerauLevenshteinDistance(string1: System.String; string2: System.String; aThres: Integer): System.Int32; begin if string1 = nil then exit length(string2); if string2 = nil then exit length(string1); var s1l := string1.Length; var s2l := string2.Length; if ((s1l - s2l) > aThres) or ((s2l - s1l) > aThres) then exit Int32.MaxValue; if s1l > s2l then begin var s := string2; string2 := string1; string1 := s; s2l := s1l; s1l := string1.Length; end; if aThres > s1l then aThres := s1l; var lDistArray1 := new Integer[s1l + 1]; var lDistArray2 := new Integer[s1l + 1]; var lPrev := lDistArray1; var lDist := lDistArray2; for i: Integer := 0 to s1l do begin if (i <= aThres) then lPrev[i] := i else lPrev[i] := Int32.MaxValue; lDist[i] := Int32.MaxValue; end; for j: Integer := 1 to s2l do begin var c := string2[j-1]; if (Integer(c) and $FF80) = 0 then c := Char(Integer(c) or 32); lDist := j; var min := j - aThres; if min <= 1 then min := 1; var max := j + aThres; if max >= s1l then max := s1l; if min > max then exit Int32.MaxValue; if min > 1 then lDist[min-1] := Int32.MaxValue; for i: Integer := min to max do begin var a := string1[i-1]; if (Integer(a) and $FF80) = 0 then a := Char(Integer(a) or 32); if a = c then lDist[i] := lPrev[i-1] else if (i < s1l) and (j < s2l) then begin var s1 := string1[i]; if (Integer(s1) and $FF80) = 0 then s1 := Char(Integer(s1) or 32); var s2 := string2[j]; if (Integer(s2) and $FF80) = 0 then s2 := Char(Integer(s2) or 32); if (s1 = c) and (a = s2) then lDist[i] := lPrev[i-1] else begin var n := lDist[i-1]; var n2 := lPrev[i]; if n2 < n then n := n2; n2 := lPrev[i - 1]; if n2 < n then n := n2; lDist[i] := 1 + n; end; end else begin var n := lDist[i-1]; var n2 := lPrev[i]; if n2 < n then n := n2; n2 := lPrev[i - 1]; if n2 < n then n := n2; lDist[i] := 1 + n; end; end; var lTmp := lDist; lDist := lPrev; lPrev := lTmp; end; exit lPrev[s1l]; end;
It’s not the exact Damerau–Levenshtein distance implementation as found on Wikipedia, the original is slower and takes a lot more memory, but it’s based on it, with some ideas gotten from StackOverflow posts. It might not give the same result 100% of the time, but it works really well for us. We run this against a list of possible identifiers in scope where normally an unknown identifier error would occur, and return the lowest item. The Fix-It dialog shows the 10 possible matches below a given threshold. Now this same code can be used for a lot of things. You can even make a regular spell checker with this method.
Subscribe to RemObjects Software Blog
Get the latest posts delivered right to your inbox