What’s better for creating distinct data structures: HashSet or Linq’s Distinct()?

Anthony Pegram has said it the best. Use the right tool for the job. I say this because a Distinct or HashSet isn’t that big different when it comes to performance. Use a HashSet when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal List<T> and .Distinct() ont it when you will have to add duplicates and remove duplicates later. The intention matters.

In general,

a) a HashSet may not do any good if you’re adding new objects from db and you haven’t specified a custom Equals of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal List<T>.

b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset.

c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset.

d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that’s first class!

As for performance differences, in my testing I always found HashSet to be faster, but then that’s only marginal. That’s obvious considering with List approach you have to first add and then do a distinct on it.

Test method: Starting with two general functions,

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
    if (count < 0)
        throw new ArgumentOutOfRangeException("count");

    var ret = Enumerable.Empty<T>();

    for (var i = 0; i < count; i++)
        ret = ret.Concat(lst);

    return ret.ToList();
}

Implementation:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }
});

~3300 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list.Clear();
    foreach (var item in d)
    {
        list.Add(item);
    }

    list = list.Distinct().ToList();
});

~5800 ms

A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable.

The best approach possibly for you with your current design:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }

    list = hash.ToList();
});

~3300 ms

There isn’t any significant difference, see..

Partly unrelated – after posting this answer, I was curious to know what’s the best approach in removing duplicates, from a normal list.

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash = new HashSet<int>(d);
});

~3900 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list = d.Distinct().ToList();
});

~3200 ms

Here the right tool Distinct is faster than hackish HashSet! Perhaps its the overhead of creating a hash set.

I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.

Leave a Comment Cancel reply