Pages

How to write your own spell checker ? C#.net

Hi There,
Today I wrote a readable & usable implementation of Norvig's Spell corrector.
Norvi's implementation of spell check is awesome! On his site he has link to many other implementation in various languages, including C#.

Most of the implementation are focused on minimum number of lines , so they are not readable. I have written my own implementation which you can use on your own risk.

Here is the source code (DOWNLOAD)

You can also check-out entire code using this svn command




            svn checkout https://jugad.googlecode.com/svn/trunk/C%23/SpellCheck SpellCheck




How Algorithm works?



1.Initially Algorithms builds a dictionary* data-structure of words, from a huge English text. (so algorithm does not uses dictionary# file as such). This dictionary* contains words-to-frequency-of-occurrence mapping. lets say it as nWords.



2.When you will fetch a word(lets say 'w') for spell correction,Then the algorithm builds a list of all possible words that can be formed from w by doing following operation

  1. Deletion (delete 1 character, for all possible combination)
  2. Transposition (flip 1 character, for all possible combination)
  3. Alteration (modify 1 character, for all possible combination)
  4. Insertion (insert 1 character, for all possible combination)

Lets say name of this list is 'edits' , note that all words in this list are just 1 step away from original word.
3. If edits has 1 or more words that exists in nWords, then it returns the word from nWords with maximum frequency, as a result.
4. If No words in 'edits' are present in nWords, then it creates another edits using edits to get second level word (word which are 2 step away from original input word)
5.Repeats from step-3 (or give up if your are already 2/3 steps apart from original word 'w')
Read more Here.

How To Use This code ?
To use this code instantiate class spell by passing huge-string of English text (from where it will build dictionary)
Then call function correct() to get the corrected word as result.

Note:This code is just to demonstrate the algorithm, modify constructor to build dictionary directly from file. You can also avoid calling sort operation in function 'correct'. Thoroughly understand the code before using it. Notify me if you find any critical problem.


*dictionary data-structure of programming
#real dictionary

10 comments:

  1. excellent translation of the code. I have a question, how this code works with Spanish?

    ReplyDelete
  2. @slietra I am just wondering if this comment is spam??? u really have a question??

    ReplyDelete
  3. Yes, i thinking in create a word games and i need check if a word existe really.
    Thanks :-)
    And sorry for my bad english

    ReplyDelete
  4. hello at function of source code have error,
    contains 2 if(nWords.ContainsKey(s)) ?

    foreach(string s in list)
    {
    if(nWords.ContainsKey(s))
    {
    if(nWords.ContainsKey(s))
    {
    candidates.Add(new wordweightpair(nWords[s], s));
    }
    }
    }

    ReplyDelete
  5. Hi Nicola,
    I found the error , and i will fix the same.
    just to confirm u are talking about the redundant check , right???

    ReplyDelete
  6. now I'm now trying the algorithm, my interested test to speed ,
    as it would be better to build the string for the dictionary, you have an example?

    ReplyDelete
  7. yes download the fresh code from here http://code.google.com/p/jugad/downloads/detail?name=SpellCheck.zip

    use bigparts.txt (file)

    ReplyDelete
  8. is there a way to spell-check other languages using this script? can it somehow do that via google translate API? thanks :-)

    ReplyDelete
  9. @slierta Hey I never replied to you sorry!
    Yes this will work for Spanish as well, basically this implementation never uses any dictionary, dictionary is built using some long text (of any language ) on the fly. You can keep improving the dictionary as well.

    @Kagz
    Yes, its simple just pass a huge text (which is correct), it will build that language dictionary, thats it!

    ReplyDelete
  10. I'm playing with this code but it is a very hyperactive spell check. For example, this returns na:

    string[] cities = {"clyde", "fremont", "new york", "na"};
    //string cities = "clyde fremont new york na";
    spell spell = new spell(cities);
    string result = spell.correct("c");
    Console.WriteLine(result);
    Console.ReadLine();

    ReplyDelete

Your comment will inspire me, Please leave your comment