Porter Stemmer 2: C# implementation

UPDATE This code now available via https://bitbucket.org/alski/englishstemmer

I’ve been busy at work on non-coding things for a couple of weeks. I also have a few processes I want to introduce for the next stage of development with my team. So I took the opportunity to write some code and trial the processes out myself.

First the results


As I mentioned before, there is an updated version of the Porter Stemmer, but there wasn’t a C# implementation of it.

The two files that implement the algorithm are highlighted. Everything else is to enable Unit testing.The tests were built up initially to assist in the logic of parsing each method within the code, but also include regression tests using all examples given on the the tartarus website.

This implementation correctly parses the example files from the tartarus website with one exception, ‘fluently’ does now parse to ‘fluent’ instead of ‘fluentli’. This matches all the other -ly words.

I’ve pretty much just followed the description on the snowball stemmers site for the ‘English’ stemmer, and referenced the implementation of the algorithm in snowball where I had ambiguities.

Now, additionally developed processes


This is the first 100% completely Test Driven piece of work I have completed. It’s also the one where I have paid attention to the Coverage. I am quite pleased with the results.

The only parts that aren’t covered are extra code put in to check inputs are within range, and some of the Exception2() cases (shown).

While I got about 95% coverage the remaining 2% comes all from Martin’s excellent vocabulary.txt and output.txt (actually 97% comes from these I just replicate 95%).

What was interesting was that for the first time in quite a while I have had a unit of work that I knew exactly when it was completed. I managed to do the simplest thing that worked and didn’t just add in another feature.