Implementing a Line Filter in C++

Published March 27, 2020

Filtering lines based on a certain pattern is a common task in the everyday life of a programmer. For example we saw in a recent post the technique taken from The Legacy Code Programmer’s Toolbox that consists in filtering code on control flow keywords in order to get an overview of its structure.

We’re going to write a C++ program that keeps the lines of a piece of text only if they contain some determined words (e.g. if, for, etc. in the example on control flow keywords). It will make us design interfaces as well as review some C++ techniques to write expressive code.

If you’d like to get some practice, you can try to write your own program that filters out the lines of a text, before reading on.

A interface based on functions

We will start by designing a classic interface that use functions. In a future post we will change this interface to make it use ranges.

The general structure of our algorithm is this:

auto const filteredText = join(filter(split(text)));

It contains 3 steps:

split takes a string a returns a collection a strings that represent its lines,
filter takes that collection of lines and only selects those that contain the desired words,
join put together the filtered lines back into one string.

At this point we can note that this algorithm evokes the simplified interpretation of functors: applying a function (filter) to the elements (the lines) inside of a “box” (the text). Here we’re not talking about functors as in function objects, but about the Functors of functional programming.

Is this the concept of a functor? I’m only a functional programming novice and I may well be wrong here. If you’re familiar with FP, please let me know in a comment what you think of the functor comparison.

Back to our C++ interface, we could make it more reusable by passing in what we split the string on and what we filter on:

auto const filteredText = join('\n', filter(contains(words), split('\n', text)));

words is a collection of strings like std::vector<std::string> (for the particular example of control flow keywords it would contain “if”, “for”, etc.) that the filter should use to keep a line if it contains one of them.

Note that passing those parameters first allow to show what step of the algorithm they correspond to. If we had passed them last, the call site would have looked like this:

auto const filteredLines = join(filter(split(lines, '\n'), contains(words)), '\n');

The beginning of this lines shows the pipeline more clearly (join(filter(split) but the end of the line contain the parameters haphazardly ('\n'), contains(words)), '\n')). It’s more difficult to see which parameters correspond to which function. For this reason I find the first interface clearer.

Let’s now implement the different steps of our algorithm, starting from the inside out.

`split`

There are several ways to split a string in C++, and the simplest one is probably to use Boost. To comply with our desired call site we wrap it in a function that returns the collection of lines:

#include <boost/algorithm/string.hpp>

std::vector<std::string> split(char delimiter, std::string const& text)
{
    auto chunks = std::vector<std::string>{};
    boost::split(chunks, text, [delimiter](char c){ return c == delimiter; });
    return chunks;
}

`filter`

To filter lines according to a certain criteria, we can use std::copy_if, which we also wrap in a function that complies with our call site:

template<typename T, typename Predicate>
std::vector<std::string> filter(Predicate pred, std::vector<T> const& input)
{
    auto results = std::vector<std::string>{};
    std::copy_if(begin(input), end(input), back_inserter(results), pred);
    return results;
}

Here is the predicate we used in our call site: contains(words). In the example of filtering on control flow keywords it would be contains(controlFlowKeywords). This expression is made of two parts: contains and controlFlowKeywords.

The point of using two parts instead of one predicate like containsControlFlowKeywords is to allow code reuse. If in the future we want to filter on something else than control flow keywords, like on lines containing domain words for example, then we can reuse the logic of contains by passing it another collection of strings: contains(domainWords).

contains is a function that takes a piece of data and returns a predicate based on that data. As we saw to make code expressive with lambdas, we can implement contains as a function that returns a lambda:

auto contains(std::vector<std::string> const& substrings)
{
    return [&substrings](std::string const& string)
           {
               return std::any_of(begin(substrings), end(substrings),
                           [string](std::string const& substring)
                           {
                               return string.find(substring) != std::string::npos;
                           });
           };
}

We use std::any_of, an STL predicates on ranges, to determine is at least one of the words is contained in the line. We use the find method of the std::string class to perform the check for each word. Indeed, it’s better to use container methods rather than algorithms when possible.

The above implementation of contains deals with substrings that come in as const&, and stores them in the returned lambda as a const& too, to avoid copying the substrings. If substrings refers to a temporary object and contains is used after this temporary object is destroyed, this could lead to undefined behaviour and the program crashing.

For that reason we can add an overload to contains that deals with the case of a temporary object by relying on C++14 generalized lambda capture:

bool contains(std::string const& string, std::vector<std::string> const& substrings)
{
    return std::any_of(begin(substrings), end(substrings),
              [string](std::string const& substring)
              {
                  return string.find(substring) != std::string::npos;
              });
}

auto contains(std::vector<std::string> const& substrings)
{
    return [&substrings](std::string const& string)
           {
               return contains(string, substrings);
           };
}

auto contains(std::vector<std::string> && substrings)
{
    return [substrings{std::move(substrings)}](std::string const& string)
           {
               return contains(string, substrings);
           };
}

`join`

join takes a collection of strings and mends them into one long string by interspersing the individual strings with a delimiter (in our case we pass it '\n').

Here is a possible implementation of join:

std::string join(char delimiter, std::vector<std::string> const& inputs)
{
    if (inputs.empty()) return "";
    
    result.insert(end(result), begin(inputs.front()), end(inputs.front()));
    for (auto input = std::next(begin(inputs)); input != end(inputs); ++input)
    {
        result.push_back(delimiter);
        result.insert(end(result), begin(*input), end(*input));
    }
    return result;
}

We make sure that the collection of lines is not empty by using a guard at the beginning of the function. This way we can insert the first element of the inputs (because we’re now sure there is at least one element in inputs). Then we alternate the insertions between the delimiter and the next input (that starts at std::next(begin(inputs)) like its name suggests!).

This allows to have one less delimiter than input, and not to have a trailing '\n' at the end.

Since we’re inserting repeatedly in the output string, it could make sense to reserve its capacity upfront to avoid reallocations:

int totalSize(std::vector<std::string> const& inputs)
{
    return std::accumulate(begin(inputs), end(inputs), 0,
           [](int currentSize, std::string const& input)
           {
               return currentSize + input.size();
           });
}

std::string join(char delimiter, std::vector<std::string> const& inputs)
{
    if (inputs.empty()) return "";
    
    auto result = std::string{};
    result.reserve(totalSize(inputs));
    
    result.insert(end(result), begin(inputs.front()), end(inputs.front()));
    for (auto input = std::next(begin(inputs)); input != end(inputs); ++input)
    {
        result.push_back(delimiter);
        result.insert(end(result), begin(*input), end(*input));
    }
    return result;
}

Implementation with the ranges library

The above code uses standard C++14 as well as a sprinkle of Boost for the splitting part.

In a future post we will see how to implement our pipeline of line filtering with the range-v3 library, that leads to simpler code and a very expressive call site. Stay tuned!