Jonathan Boccara's blog

How to Remove Duplicates from an Associative Container in C++

Published September 25, 2018 - 10 Comments

remove duplicates unique mutlimap

Daily C++

For the third episode in our series about removing things from C++ containers, let’s tackle the tricky topic of removing duplicates from associative containers!

The articles of the series are:

What is a duplicate, exactly?

Removing duplicates only makes sense for the 4 associative containers that have “multi” in their name. The other don’t have duplicates, by definition.

For multimap and unordered_multimap, the concept of duplicate can have several meanings: that could be two elements having the same key, but it could also be two elements having both the same key and the same value.

However, since the elements having the same key are in no specified order in the container, we can’t remove (key, value) duplicates in O(n), because they may not be located next to each other. So we won’t look at this latter case here. We will only look at keys to determine whether two elements are duplicates.

For sets, there is no ambiguity since keys and values are one anyway.

Note that before C++11, we didn’t know which of the duplicates remain in the end. It would be the first one encountered during iteration but since they are in no specified order, this doesn’t say much. In C++11, insertion adds elements at the upper bound of the range containing equivalent keys.

Also, duplicate keys don’t mean the same thing between multimap and unordered_multimap: the former uses equivalence (with a “less than” semantics) and the latter uses equality (with an “equal to” semantics). And this difference is also true for multiset and unordered_multiset.

So two elements being “duplicates” can have several meanings. Let’s encapsulate this under a policy: DuplicatePolicy that takes two elements and returns a bool indicating whether they are duplicates.

In all cases, the idea is the same as the one we saw when removing elements according to a predicate: iterate over the collection and remove duplicates, by being careful not to invalidate iterators.

Let’s first implement the generic code using DuplicatePolicy, and then see how to implement this policy.

The traversal algorithm

Here is a possible implementation. The code is explained just afterwards:

template<typename AssociativeContainer, typename DuplicatePolicy>
void unique(AssociativeContainer& container, DuplicatePolicy areDuplicates)
{
    if (container.size() > 1)
    {
        auto it = begin(container);
        auto previousIt = it;
        ++it;
        while (it != end(container))
        {
            if (areDuplicates(*previousIt, *it))
            {
                it = container.erase(it);
            }
            else
            {
                previousIt = it;
                ++it;
            }
        }
    }
}

Here is how this code works:

if (container.size() > 1)

The algorithm is going to consider two consecutive iterators at the same time, to compare them. We can only do this if the container has at least one element. In fact if it doesn’t have at least two elements, there is no duplicate to remove anyway.

auto it = begin(container);
auto previousIt = it;
++it;

Here we make it point the second element of the container, and previousIt it to the first element.

while (it != end(container))

it is the leading iterator of the two, so we will continue until it reaches the end of the container.

if (areDuplicates(*previousIt, *it))
{
    it = container.erase(it);
}
else
{
    previousIt = it;
    ++it;
}

This structure is to avoid iterator invalidation, like when we removed according to a predicate. Note that when the element is not equivalent to the previous one, we move on the previous one to continue the traversal of the container.

How to implement the policy

We could stop here and let a client code call unique by passing a lambda that describes how to identify two duplicates. But this would present several issues:

  • it would burden every call site of unique with low-level and redundant information,
  • there would be a risk of getting the lambda wrong, especially if the container has a custom comparator.

To solve this we can provide default values for this policy, that would be correspond to the various cases.

std::multimap and std::multiset

Let’s start with the non-hash multi-containers, so std::multimap and std::multiset. They both provide a method called value_comp, that returns a function comparing the keys of two elements.

Indeed, contrary to what its name suggests, value_comp for maps does not compare values. It only compares keys. Actually, it makes a lot of sense since the container has no idea how to compare the values associated to the keys. The method is called value_comp because it accepts values, and compare their keys.

To eliminate the entries with duplicate keys in a std::multimap, the policy is:

[&container](std::pair<const Key, Value> const& element1,
             std::pair<const Key, Value> const& element2)
             {
                 return !container.value_comp()(element1, element2) &&
                        !container.value_comp()(element2, element1);
             }

Indeed, multimap and multiset use equivalence, and not equality. This means that value_comp returns a function that compares elements in the sense of being “lower than”, and not “equal to”. To check if two elements are duplicates, we see check that neither one is lower than the other.

So a unique function for std::multimap would be:

template<typename Key, typename Value, typename Comparator>
void unique(std::multimap<Key, Value, Comparator>& container)
{
    return unique(container, [&container](std::pair<const Key, Value> const& element1,
                                          std::pair<const Key, Value> const& element2)
                                          {
                                              return !container.value_comp()(element1, element2) &&
                                                     !container.value_comp()(element2, element1);
                                          });
}

The one for multisets follows the same logic:

template<typename Key, typename Comparator>
void unique(std::multiset<Key, Comparator>& container)
{
    return unique(container, [&container](Key const& element1,
                                          Key const& element2)
                                          {
                                              return !container.value_comp()(element1, element2) &&
                                                     !container.value_comp()(element2, element1);
                                          });
}

std::unordered_multimap and std::unordered_multiset

Let’s now turn to hash multi-containers: std::unordered_multimap and std::unordered_multiset.

Before getting further, let’s remember that to effectively remove duplicates from a container in one traversal those duplicates need to be next to each other. Indeed, our algorithm is in O(n). It doens’t perform a full search for every value across the container (which would be O(n2)).

But unordered_multimap and unordered_multisets are… unordered! So it’s not going to work, is it?

In fact it is, thanks to one property of those containers: the elements with the same keys are guaranteed to be consecutive in the iteration order. Phew.

Additionally, those containers follow a logic of equality for their keys. This means that their comparison function has the semantics of “equal to” and not “lower than”.

They offer a method to access their comparator: key_eq, that returns a function that compares keys. This method is the counterpart of key_comp in the non-hash containers.

However there is no equivalent of value_comp. There is no value_eq that would accept two elements and compare their keys. So we’ll have to make do with key_eq, and pass the keys to it ourselves. Here is the resulting code for std::unordered_multimap:

template<typename Key, typename Value, typename Comparator>
void unique(std::unordered_multimap<Key, Value, Comparator>& container)
{
    return unique(container, [&container](std::pair<const Key, Value> const& element1,
                                          std::pair<const Key, Value> const& element2)
                                          {
                                              return container.key_eq()(element1.first, element2.first);
                                          });
}

And the code for std::unordered_multiset follows the same logic:

template<typename Key, typename Comparator>
void unique(std::unordered_multiset<Key, Comparator>& container)
{
    return unique(container, [&container](Key const& element1,
                                          Key const& element2)
                                          {
                                              return container.key_eq()(element1, element2);
                                          });
}

Here is all this code put together, with the initial generic unique function in a technical namespace:

#include <set>
#include <map>
#include <unordered_map>
#include <unordered_set>

namespace details
{
    template<typename AssociativeContainer, typename DuplicatePolicy>
    void unique_associative(AssociativeContainer& container, DuplicatePolicy areDuplicates)
    {
        if (container.size() > 1)
        {
            auto it = begin(container);
            auto previousIt = it;
            ++it;
            while (it != end(container))
            {
                if (areDuplicates(*previousIt, *it))
                {
                    it = container.erase(it);
                }
                else
                {
                    previousIt = it;
                    ++it;
                }
            }
        }
    }
}
template<typename Key, typename Value, typename Comparator>
void unique(std::multimap<Key, Value, Comparator>& container)
{
    return details::unique_associative(container, [&container](std::pair<const Key, Value> const& element1,
                                                               std::pair<const Key, Value> const& element2)
                                                               {
                                                                   return !container.value_comp()(element1, element2) &&
                                                                          !container.value_comp()(element2, element1);
                                                               });
}

template<typename Key, typename Comparator>
void unique(std::multiset<Key, Comparator>& container)
{
    return details::unique_associative(container, [&container](Key const& element1,
                                                               Key const& element2)
                                                               {
                                                                   return !container.value_comp()(element1, element2) &&
                                                                          !container.value_comp()(element2, element1);
                                                               });
}

template<typename Key, typename Value, typename Comparator>
void unique(std::unordered_multimap<Key, Value, Comparator>& container)
{
    return details::unique_associative(container, [&container](std::pair<const Key, Value> const& element1,
                                                               std::pair<const Key, Value> const& element2)
                                                               {
                                                                   return container.key_eq()(element1.first, element2.first);
                                                               });
}

template<typename Key, typename Comparator>
void unique(std::unordered_multiset<Key, Comparator>& container)
{
    return details::unique_associative(container, [&container](Key const& element1,
                                                               Key const& element2)
                                                               {
                                                                   return container.key_eq()(element1, element2);
                                                               });
}

This closes off our series about removing stuff from containers in C++.

Removing elements, a simple topic? Oh no.

Removing elements, a good topic to better understand STL containers? Yes indeed.

Related articles:

Don't want to miss out ? Follow:   twitterlinkedinrss
Share this post!Facebooktwitterlinkedin

Comments are closed