Streaming ForEach using a yield

I’ve been looking for an article on converting foreachs into a less memory intensive operation and I remember reading http://msdn2.microsoft.com/en-us/vcsharp/bb264519.aspx before, I just couldn’t find it.

Basically it uses an example of all numbers in the New York phone book to develop a means of streaming the loops using custom iterators using yield instead of loading it all in and looping through it all.

Porter Stemmer 2: C# implementation

UPDATE This code now available via https://bitbucket.org/alski/englishstemmer

I’ve been busy at work on non-coding things for a couple of weeks. I also have a few processes I want to introduce for the next stage of development with my team. So I took the opportunity to write some code and trial the processes out myself.

First the results

image

As I mentioned before, there is an updated version of the Porter Stemmer, but there wasn’t a C# implementation of it.

The two files that implement the algorithm are highlighted. Everything else is to enable Unit testing.The tests were built up initially to assist in the logic of parsing each method within the code, but also include regression tests using all examples given on the the tartarus website.

This implementation correctly parses the example files from the tartarus website with one exception, ‘fluently’ does now parse to ‘fluent’ instead of ‘fluentli’. This matches all the other -ly words.

I’ve pretty much just followed the description on the snowball stemmers site for the ‘English’ stemmer, and referenced the implementation of the algorithm in snowball where I had ambiguities.

Now, additionally developed processes

image2

This is the first 100% completely Test Driven piece of work I have completed. It’s also the one where I have paid attention to the Coverage. I am quite pleased with the results.

The only parts that aren’t covered are extra code put in to check inputs are within range, and some of the Exception2() cases (shown).

While I got about 95% coverage the remaining 2% comes all from Martin’s excellent vocabulary.txt and output.txt (actually 97% comes from these I just replicate 95%).

What was interesting was that for the first time in quite a while I have had a unit of work that I knew exactly when it was completed. I managed to do the simplest thing that worked and didn’t just add in another feature.

Download

https://bitbucket.org/alski/englishstemmer

Functionality v Framework

This is going to sound so simple, so bear with me, but at the moment its becoming my personal mantra.

There is a difference between the functionality of a system and the framework it runs in.

Lets take a simple example, we have a process at work. Basically it gets a list of files from a database copies the files over and updates a different database to point to the new files. It is currently hosted in a windows service. Can you separate the functionality from the framework?

Functionality

It gets a list of files from a database copies the files over and updates a different database to point to the new files

Framework

Hosted as a service

Why does this matter?

Well the problem is that I also need to run the functionality on ad-hoc basis with some special parameters. So what do we do, well simply we split the functionality from the framework, put the functionality in a separate assembly, call it once from the service and again separately from a console application or a WinForms GUI.

But that’s only the first level

We can consider the same at a method level, where we have a for loop with a code body. It is impossible to test a single iteration of the logic in the for body without going through the framework.

            foreach (string name in _list.Keys)
            {
                progress.DoOne("Adding " + name);

                TableRow tr = resultsTable.AddRow();
                tr.AddCell().Value = name;
                Dictionary<DateTime, TestResult> data = _list[name];
                //now add cells
                foreach (DateTime dt in _dates)
                {
                    TableCell cell = tr.AddCell();
                    if (data.ContainsKey(dt))
                    {
                        switch (data[dt])
                        {
                            case TestResult.Fail:
                                cell.Value = "Fail";
                                cell.BGColor = Color.Red;
                                break;

                            case TestResult.Pass:
                                cell.Value = "Ok";
                                cell.BGColor = Color.Green;
                                break;

                            case TestResult.Untested:
                                cell.Value = "";
                                cell.BGColor = Color.Yellow;
                                break;

                        }
                    }
                }

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

What is more is that we can ONLY call this functionality with its framework. We can refactor the body of a for loop out to a method that takes parameters or even a Visitor pattern. This lets us call it separately, for example we can now Unit Test the functionality.

At the class level we can also refactor so that we separate the multi-threading code out and leave the business logic. This lets us host the logic with or without the threading. So maybe, once again we can unit test the logic separately, or host in different applications as before.

So what does this mean?

I think it might be time for Al’s First law (see Haacked’s law).

The greatest barrier to reuse and testing is a mix of functionality and framework in the same unit.

I wonder if there will ever be 2nd?

Honest, I’m not copying off Jeff

After finally pressing post my last post on the Singleton Design pattern, I was quite surprised to find Jeff Atwood talking about Rethinking Design Patterns itself. I was however quite pleased to see that he hasn’t contradicted my conclusions.

Jeff proposes that Design Patterns is No Silver Bullet.

But I have two specific issues with the book:

  1. Design patterns are a form of complexity. As with all complexity, I’d rather see developers focus on simpler solutions before going straight to a complex recipe of design patterns.
  2. If you find yourself frequently writing a bunch of boilerplate design pattern code to deal with a “recurring design problem”, that’s not good engineering-it’s a sign that your language is fundamentally broken.

In fact in the comments he links to a previous post Head first design patterns where he proposes that this other book is a contradiction, the first part of this quote comes from the book.

First of all, when you design, solve things in the simplest way possible. Your goal should be simplicity, not “how can I apply a pattern to this problem.” Don’t feel like you aren’t a sophisticated developer if you don’t use a pattern to solve a problem. Other developers will appreciate and admire the simplicity of your design. That said, sometimes the best way to keep your design simple and flexible is to use a pattern.

Filling 593 pages with rah-rah pattern talk, and then tacking this critical guidance on at the end of the book is downright irresponsible. This advice should be in 72 point blinking Comic Sans on the very first page.

This is very much what I wanted to express. I think that people need to learn and learn and learn. They go through stages where their knowledge doesn’t have sufficient maturity to let them come to the right conclusion. They have to make mistakes first. One of those mistakes is trying to use patterns to solve every problem.

Where I think myself and Jeff differ is that he believes that the books and other sources should come with the warnings that they need to be used in moderation (strangely enough he also comes to a conclusion regarding moderation in the following days post The Technology Backlash). I believe that we need to make the mistakes where we learn why they are mistakes.

I suppose the big question is,

Who can afford for us to learn on their time and make mistakes in their code base as we develop design maturity ?

Design maturity of the singleton pattern

The more I am exposed to ‘professional developers’ the more I begin to realise that there multiple phases in their Design’s maturity. One of the best way to demonstrate this is to look at how people tend to use the Singleton pattern.

Phases over time

  • Oblivious
    • The developer is unaware of Singleton. They don’t use it at all.
    • If they come across it in a piece of code they don’t realise it is a common re-occurring piece of code
  • Discovery
    • The developer comes across an example piece of code, an article or a book that contains singleton and thinks that is really nice.
    • They understand how it is used in this context.
  • Familiarity
    • The developer starts to use Singleton for the first time in one piece of work.
    • They develop an understanding of the problem it solves.
  • Abuse
    • They developer starts seeing the Singleton problem everywhere.
    • They create a lot of classes as Singletons
  • Maturity
    • The developer realises that there are alternatives to the Singleton pattern.
    • They start to use Singleton as appropriate.

Now I currently am in a position where I have been through this cycle but I am now starting to see others doing the same thing. I think the important lesson to speed your progress from Abuse to Maturity is the common saying

Do the simplest thing that works.

Isn’t Singleton simple?

Singleton is not the simplest way to ensure that one and only one object is available at all times within the lifetime of a program. That is just a static instance.

static object _instance = new object();

Common usage of Singleton (including the Gang of Four example) combines lifetime availability with lazy instantiation.  The issue here is whether lazy instantiation is required. Would it not be simpler to just use a static constructor?

Now I can’t answer that for your project, however what I can tell you is that today I have written my first Singleton in nearly 2 years, because previously all cases where I needed a long lived object that there was only one of, I just used a static instance. Each time these objects were wrapped inside a class, which itself was not static, and only internal members of that class needed access to the long lived object.

So why use a Singleton

Today however I find myself attempting to share a cache between two Dialogs. Both these dialogs use the cache, and both used keyed Indexers (i.e. cache[key]) to get and set the items cached. Unfortunately the definition for an indexer requires an instance variable.

object this[string key]
{ 
    get { ... }
    set { ... }
}

Now I could have instead dumped the indexer and gone with static methods, but my aim here was not to re-write the underlying cache but just provide a Dictionary<string, Assembly>. So sticking with the Do the simplest thing that works idea, my cache Is-a Dictionary<>, this means that I don’t have to re-write all of the methods that I want to expose. In fact I expose much functionality than I currently use. This meant that all I had to write was the singleton instance.

/// <remarks>
/// Would just be a Dictionary&lt;String, Assembly&gt; but 
/// I want to keep only a single static copy
/// </remarks> class ChooserCache : Dictionary<string, Assembly> { private static Dictionary<string, Assembly> _cached
= new Dictionary<string,Assembly>(); private ChooserCache() { } public static ChooserCache Instance { get { return _cached; } } }

The alternative was to go the Has-a route and implement each and every method as a static which referred to a static instance of the Dictionary. This also means that I have to add new methods as I use more functionality from the Dictionary itself.

Coding Horror: C# and the Compilation Tax

I’ve been watching this with interest

Coding Horror: C# and the Compilation Tax
Dennis Forbes – Pragmatic Software Development : Process, People, Programming
Coding Horror: Background Compilation and Background Spell Checking
Knowing.NET – Death and Taxes: Compilation, Type, and Test

My opinion is quite simple. Give people what they want.

If not they will implement it themselves.

I use a workaround. I have three build commands.

  1. I’ve just changed some code, and I am not 100% certain it will compile so I want a quick answer. I use F6 redefined to Build.BuildSelection. This builds just the current project and its dependencies. It doesn’t require me to press No because it doesn’t ask me if I want to run the last good result. It just lets me see how many errors and start fixing them with Ctrl-Shift-F12. Its the fastest way I know to compile just what I have changed.
  2. I’ve got an assembly building. Now I press either,
    Repeat Test Run from TestDriven.net (Jamie, Can we have a keyboard shortcut please.)
    or F5.
    Either of these will now build just the application I need to run/test, and will detect that I have already built some of the dependencies so it doesn’t spend time rebuilding what it already has. 
  3. Finally its Ctrl-Shift-B. This is the checkin build. Build everything. Run all your unit tests, and finally checkin.

It gets me closer to an ideal. I am completely with Larry O’Brien on what I really want, but I am not going to get it until the next version of Studio, unless somebody comes up with a better IDE.

Not all developers like fixed width fonts

In particular, I don’t. Lots of times other developers come over to me and look confused as they see Tahoma, 8pt. They have all sorts of comments about it not being right, in fact my absolute favourite was that it a waste of memory. Sorry last time I looked, Tahoma 8pt is the font used everywhere else on my desktop. Windows is probably saving memory by sharing it around.

However the problem is that some applications don’t let you make your own decisions. VS2003 only ever allowed you to get a bold variant of some Fonts e.g. Tahoma, but some other applications go one step further. They only let you access a fixed width font.

Sorry, but these applications are making a decision that should be theirs and they are choosing to affect my productivity. There is considerable research into what makes text faster and easier to read (i.e. you make less errors when you read it).

http://blogs.msdn.com/fontblog/archive/2005/12/13/503236.aspx
http://blogs.msdn.com/fontblog/archive/2005/10/28/486511.aspx
http://blogs.msdn.com/fontblog/archive/2005/11/16/493452.aspx – Quite fun

So given all this research (apologies that its all from one source) why do so many applications force you  into fixed width fonts!

Generating namespaces

One common problem I find is that developers do not understand the difference between deployment units and namespaces. The most common confusion is that these can often be the same, so they keep creating new assemblies.

However just fire up the object browser and take a look at the system assemblies. In particular look at system.dll and mscorlib.dll. Notice how the System.Web namespace is spread over mscorlib.dll, system.dll and system.web.dll.  Notice how System.Diagnostics is not a separate dll just part of system.dll.

So instead of creating a new project, create a new folder. Projects create deployment complexity, whereas folders simplify projects by grouping things together. While you are still working inside a single project you can still move things around, and you haven’t needed to change all sorts of deployment scripts. Consider this diagram

If you think that the blue shapes relate to assemblies, consider instead that they might simply be namespaces.

Only create projects and separate assemblies once you have multiple applications and there is reuse. Having something reused in multiple applications is where multiple assemblies starts, not before. This diagram shows again namespaces, where the reusable parts are shipped in a single assembly.

Singleton is Evil

Well I’ve read a bit of Steve Yegge before but I hadn’t come across http://steve.yegge.googlepages.com/singleton-considered-stupid before. He might have a point too.

I am currently re-writing an application I’ve written before, using lots of lessons I’ve recently learned.

Before I had quite a few singletons. Now so far I have a factory class.

Before I had a heirarchy of classes, each having a clever constructor that could process data and return a very specific instance of a base class. Now I have a simple class, and that perviously mentioned factory handles the different data types.

Who knows, maybe I’m getting to be a better developer