set_aggregate, set_segregate: higher-level algorithms on sets

Published February 9, 2017 - 6 Comments

In the post describing algorithms on sets, we’ve been over what the STL offers to manipulate sets, that are sorted collection of elements – and not only std::sets. I’ve witnessed my code and the one of the people around me grow with these algorithms, for the better. Being rid of low-level for loops clears up the air to see higher-levels needs. In this post I want to present two higher-level algorithms on sets, based on the STL ones, that better target business needs: set_segregate and its little brother set_aggregate.

Motivation

I’ve come across two needs regarding sets manipulation, either in my code or by listening to my colleagues’ problems.

Updating to a new version of a set. You have one set of elements, that is being replaced by a new one. Some values have gone, some have appeared and some have stayed there. How to write expressive and efficient code that tells which values are gone, which ones have appeared and which ones have stayed in the collection?

Aggregating two sets. This need was met by my colleague Henri-Olivier Duche, the guy behind Coarse Grain Automatic Differentiation. You have two sets of elements, and you want to sort of add them together. More precisely, you want the values that were in one set but not on the other one to be kept in the resulting collection, and you want the common values to be somehow aggregated together – say by using a function object. Again how to write expressive and efficient code to achieve this?

Even though these two problems are different, you can probably feel that they have some things in common. Also, they kind of look like what std::set_difference and the like would take care of, but in a more sophisticated manner.

This post tries to isolate the underlying common need behind these two problems. The goal is to extract a more generic higher-level algorithm on sets. Spoiler alert: we will make it. Second spoiler alert: you’ll be asked your opinion on it.

Left, Right, or both

One thing in common between our two problems is that they have different treatment of the elements that are in both collections than those that are only in one of them. Said differently, we could use a partitioning of the union of the two sets Left and Right by breaking in down in 3 parts:

the elements present in Left only
the elements present both in Left and Right
the elements present in Right only

STL algorithms already fullfill each of this needs: std::set_difference can get you what is in Left only or Right only, std::set_intersection can get you what is in both, and std::set_symmetric_difference can even retrieve what is in Left only and what is in Right only, but puts them all together in the same output.

But there is no algorithm that does all this at the same time. Let’s create it. A name we can use is set_segreate, because it separates the various parts of two sets into the three above categories:

Its interface would be:

template<class LeftRange, class RightRange,
         class OutputItLeft, class OutputItBoth, class OutputItRight, class Compare>
void set_segregate(LeftRange const& leftRange, RightRange const& rightRange,
                   OutputItLeft leftOnly, OutputItBoth both, OutputItRight rightOnly,
                   Compare comp)

leftRange and rightRange are the input sorted collections

leftOnly, both and rightOnly are the output iterators filling outputs with the elements falling into each of the 3 above categories

compare is a comparison function. There would be another overload of set_segreate without this comparison function, that falls back on operator<.

By following the STL convention on algorithms on sets, for elements present in both sets, the version coming from the Left one is taken.

How to implement set_segregate? We can consider two approaches:

calling a combination of std::set_difference and std::set_intersection.

writing it manually, drawing inspiration from the implementation of std::set_difference and std::set_intersection.

The second approach has the advantage of doing a single pass over the two ranges, which lowers the constraints on iterators by requiring only input iterators (like stream iterators for example, or iterators on some adapted ranges). For this reason we go on with this approach.

You can have a look at how std::set_difference, std::set_intersection and std::set_symmetric_difference are implemented. Essentially, the two sorted ranges are traversed in parallel: while the elements of the first range keep being smaller than those of the second range, it means they are present only in the first range. If those of the second one are smaller then they are only present in the second one. And if they are neither bigger not smaller it means that they are present in both collections. This approach allows to achieve a linear complexity for all set algorithms, including those described in this post.

Here is a possible resulting implementation for set_segregate:

template<class LeftRange, class RightRange,
         class OutputItLeft, class OutputItBoth, class OutputItRight, class Compare>
void set_segregate(LeftRange const& leftRange, RightRange const& rightRange,
                   OutputItLeft leftOnly, OutputItBoth both, OutputItRight rightOnly,
                   Compare comp)
{
    auto itLeft = leftRange.begin();
    auto itRight = rightRange.begin();
    while (itLeft != leftRange.end())
    {
        if (itRight == rightRange.end())
        {
            std::copy(itLeft, leftRange.end(), leftOnly);
            return;
        }
 
        if (comp(*itLeft, *itRight))
        {
            *leftOnly++ = *itLeft++;
        }
        else
        {
            if (!comp(*itRight, *itLeft))
            {
                *both++ = *itLeft++;
                ++itRight;
            }
            else
            {
                *rightOnly++ = *itRight++;
            }
        }
    }
    std::copy(itRight, rightRange.end(), rightOnly);
}

This code can effectively retrieve for us the elements falling into the three categories:

std::vector<int> left = {1, 2, 3, 5, 7, 9};
std::vector<int> right = {3, 4, 5, 6, 7};

std::vector<int> leftOnly;
std::vector<int> both;
std::vector<int> rightOnly;

set_segregate(left, right, std::back_inserter(leftOnly), std::back_inserter(both), std::back_inserter(rightOnly));

// leftOnly contains {1, 2, 9};
// both contains {3, 5, 7};
// rightOnly contains {4, 6};

Refining the interface: retrieving Both from Left and from Right

The actual use case I encountered regarding updating an old set with a new one had another constraint: having both versions, old and new, of the elements that remained there during the update (so those in “both”). But with the above function only the old version of such elements is output, to follow the convention of STL algorithms on sets.

The need for keeping both the old and the new versions stemmed out of the fact that input collections were maps and that the comparison was really done on keys. So we needed the old and new values, since they can be different even if the elements are considered as present in both collections by the comparison function.

So we could change the interface, and expect as OutputItBoth to point to a collection of pairs. However the simpler above version that keeps the left version is useful too, so we want to keep it. So what to do? An overload with tag dispatching? A new name like set_segregate_pair? This would spoil our interface that made sense so far. Maybe stop for just a moment to think how you would have solved this problem. If your solution is different from what follows, please share it with everyone by posting a comment below.

What we would want ideally is to write only one set_segregate function, that sticks to the conventional behaviour of keeping the left version, unless we pass an iterator to a collection of pairs of elements of the types in the input ranges, in which case both versions should be filled through this iterator. Indeed passing such an iterator would let the caller express his intention to keep both versions of the common elements.

This means having a behaviour that depends on code written by the caller rather than runtime information. And this variation of behaviour would be determined by the types passed to the function. This screams for template metaprogramming, and we’ll answer that scream with template metaprogramming.

Just before we delve into it though, let’s see how code using set_segreate would then look like:

std::map<int, std::string> left = {{1, "a"}, {2, "b"}, {3, "c1"}, {5, "e1"}, {7, "g1"}, {9, "i"}};
std::map<int, std::string> right = {{3, "c2"}, {4, "d"}, {5, "e2"}, {6, "f"},  {7, "g2"}};

std::map<int, std::string> leftOnly;
std::map<int, std::string> rightOnly;

std::vector<
    std::pair<
        std::pair<int, std::string>, // left versions of common elements
        std::pair<int, std::string>  // right versions of common elements
    > 
> both;

set_segregate(left, right,
              std::inserter(leftOnly, leftOnly.end),
              std::back_inserter(both),
              std::inserter(rightOnly, rightOnly.end),
              compareFirst);

// leftOnly contains: {{1, "a"}, {2, "b"}, {9, "i"}}
// both contains: {{{3, "c1"}, {3, "c2"}},
                   {{5, "e1"}, {5, "e2"}},
                   {{7, "g1"}, {7, "g2"}}}
// rightOnly contains: {{4, "d"}, {6, "f"}}

If you’re interested about how to achieve this branching by using template metaprogramming, let me tell it briefly. Otherwise feel free to jump over to the last section solving the problem of aggregation of sets.

The idea is to translate into metaprogramming the following logic:

if (is_pair(underlying_value(OutputItBoth))
 && first(underlying_value(OutputItBoth)) == underlying_value(LeftRange)
 && second(underlying_value(OutputItBoth)) == underlying_value(RightRange))
{
    add pairs of values encountered in both sets
}
else
{
    just add the left versions of the values encountered in both sets
}

is_pair, first and second can be pretty basic template metaprogramming to implement, for instance:

template<typename T>
struct is_pair
{
    static const bool value = false;
};

template<typename T1, typename T2>
struct is_pair<std::pair<T1,T2>>
{
    static const bool value = true;
};

although this needs a bit more work to deal with const pairs and reference to pairs, and can be alleviated by using std::integral_constant if you’re familiar with it.

And the underlying_type construct relies on decltype of dereferencing iterators.

You can have a closer look by looking directly at template metaprogramming part of the code, which I grouped in the metaprog.hpp file. Even if I won’t delve into all the template details here (although they are quite exciting) because I want to keep the focus on the business needs of the higher-level algorithms on sets, writing expressive template metaprogramming code could be an interesting topic for a future post.

set_aggregate: aggregating two sets

Let’s get back to the second problem: combining two sets by aggregating the elements that are present in both.

We can build this over set_segreate, by keeping the elements from Left only and Right only, and by aggregating together the left and right versions of the elements present in both input collections. The caller would pass a function taking a left and a right version and returning the aggregated value of the two. Let’s call this function set_aggregate:

template<typename Range1, typename Range2, typename OutputIterator, typename Compare, typename Function>
OutputIterator set_aggregate(Range1 const& range1, Range2 const& range2,
                   OutputIterator output, Compare compare, Function aggregator)
{
    using value_type1 = std::remove_reference_t<decltype(*range1.begin())>;
    using value_type2 = std::remove_reference_t<decltype(*range2.begin())>;

    std::vector<std::pair<value_type1, value_type2>> elementsInBoth;
    set_segregate(range1, range2, output, std::back_inserter(elementsInBoth), output, compare);
    return std::transform(elementsInBoth.begin(), elementsInBoth.end(),
                          output,
                          [aggregator](std::pair<value_type1, value_type2> const& p){return aggregator(p.first, p.second);});
}

And a usage example:

std::map<int, std::string> left = {{1, "a"}, {2, "b"}, {3, "c1"}, {5, "e1"}, {7, "g1"}, {9, "i"}};
std::map<int, std::string> right = {{3, "c2"}, {4, "d"}, {5, "e2"}, {6, "f"},  {7, "g2"}};

std::vector<std::pair<int, std::string>> results;

set_aggregate(left, right, std::back_inserter(results),
              compareFirst, addSecond);

// results contains {{1, "a"}, {2, "b"}, {3, "c1c2"}, {4, "d"}, {5, "e1e2"}, {6, "f"}, {7, "g1g2"}, {9, "i"}} in unspecified order

with compareFirst taking two pairs and comparing them on their first elements, and addSecond taking to pairs p1 and p2 and returning a pair with p1.first as first and (p1.second + p2.second) as second, thus performing a natural aggregration.

One last thing to note about set_aggregate is that it requires sets (meaning, sorted collections) in input, but it does not output a set. Indeed the aggregation makes the relative order of the output elements unpredictable, and to keep the linear complexity coming from traversing the input collections in parallel exposed above, I haven’t found a better way than forgoing the sorted order of the output collection.

Conclusion

Being familiar with the STL shows how to think in terms of higher-levels constructs. What is your opinon on set_segreate and set_aggregate? How would you have approached the two problems described at the top of the post? Don’t hesitate to chip in via the comment section, feedback is much appreciated. If you want to have a look at the code, or fork it to play around with it yourself, knock yourself out with the dedicated GitHub repository.

Know your <algorithm>s: algos on sets

Don't want to miss out ? Follow:
Share this post!

About Jonathan Boccara