Jonathan Boccara's blog

How to Extract Words among Spaces in a C++ String

Published April 17, 2020 - 0 Comments

We’ve already seen how to split a string into words with a delimiter, but there is another use case that is pretty close, and that doesn’t have the same implementation: extracting words that are among spaces in a string.

For example, from the following string:

"word1    word2   word3  "

We’d like to extract 3 sub-strings: “word1”, “word2” and “word3”.

We will do it in two ways: the first one is to output a collection of std::strings, and the other one to output a collection of std::string_views.

This is an interesting exercise because it allows to think about how to write expressive code, in particular with good naming and to use STL algorithms. And before seeing a solution, you will have a chance to code it up yourself!

Extracting words as strings

Let’s design the extractWords function, that takes a strings and fishes out among spaces the words it contains.

The interface

What should the interface of the function look like? Its input is the string to traverse, and its output is a collection of words.

In general, we should strive for functions to output their results via their output types. But in this case, what is the return type? Should it be a std::vector<std::string>? That sounds like a reasonable choice. But what if we want to put the results into a std::set? The idea of creating an intermediary std::vector is not so seducing.

Or what if we want to send the output to a stream? Again, an intermediary, potentially large vector isn’t an appealing thought.

To solve this problem, we will build our function on the model of STL algorithm: by using an output iterator. This iterator is a template parameter, and it could be anything: the begin of a vector, a back_inserter, an stream_iterator, a smart output iterator

So here is what the interface will look like:

template <typename OutputIterator>
void extractWords(std::string const& s, OutputIterator out)

Note that some STL algorithms return an OutputIterator, to produce an interesting position in the output collection regarding the algorithm. For example, std::partition returns the partition point and std::rotate returns the new position of the element that used to be at the beginning of the collection.

But in our case, I’m not sure there is a particularly interesting position in this collection. If you see one, let me know and we’ll see if we can return it from the algorithm. But for the time being let’s stick to returning void.

Try it with tests

Could you think of a way to implement extractWords? It took me several iterations before getting to a solution here, and what helped the most was having a set of unit tests, to try different solutions and refine the function, with instantaneous feedback about whether it is correct.

It’s great to have a unit testing framework in your projects, such as Catch2 or Gtest for example, but if you want to try out some code in an online sandbox, don’t be stopped if you can’t use a testing framework. You can always hack together a function that tests your code and returns a boolean to indicate whether the tests passed or not. The point is to have some feedback on your modifications, and quickly.

Have a go at implementing the function! You can use this playground that contains some basic tests:


(As feedback for future articles, do you appreciate having a chance to write the code in an sandbox embedded on the page? How can we improve your user experience regarding this?)

Traversing the collection

Here is a possible solution.

To decide whether a character is a letter or a space, let’s use this following lambda:

static auto const isSpace = [](char letter){ return letter == ' '; };

Note that we could have defined it as a plain function, but the lambda allows it to be defined inside of extractWords. I find that this shows that it relates to our algorithm, reduces the span between definition and use, and doesn’t pollute the outside namespace.

Aso note that is_space only deals with one type of spacing (not tabs, line returns and so on), but it’s not difficult to deal with more types and parametrize our function with this lambda.

So let’s start by locating the first word. The sub-range where the first word sits starts at the first non-blank character and ends at the first blank character:

auto const beginWord = std::find_if_not(begin(s), end(s), isSpace);
auto const endWord = std::find_if(beginWord, end(s), isSpace);

beginWord and endWord are iterators. Note that we don’t call them it or it1 or it2, but we give them meaningful names to show what they represent inside the collection.

If beginWord and endWord are different, then we do have a word here. We need to send it out to the output iterator, that expects a std::string:

*out = std::string(beginWord, endWord);

And we have to increment that output iterator, to move on in the output collection:

++out;

So far, the code put together looks like this:

static auto const isSpace = [](char letter){ return letter == ' '; };

auto const beginWord = std::find_if_not(begin(s), end(s), isSpace);
auto const endWord = std::find_if(beginWord, end(s), isSpace);
if (beginWord != endWord)
{
    *out = std::string(beginWord, endWord);
    ++out;
}

This code allows to find the first word in the string. We now need to make it loop over all the words the string contains.

The loop

After a few iterations to put straighten up the loop, here is one possible solution for implementing extractWords:

template <typename OutputIterator>
void extractWords(std::string const& s, OutputIterator out)
{
    static auto const isSpace = [](char letter){ return letter == ' '; };
    
    auto lastExaminedPosition = begin(s);
    while (lastExaminedPosition != end(s))
    {
        auto const beginWord = std::find_if_not(lastExaminedPosition, end(s), isSpace);
        auto const endWord = std::find_if(beginWord, end(s), isSpace);
        if (beginWord != endWord)
        {
            *out = std::string(beginWord, endWord);
            ++out;
        }
        lastExaminedPosition = endWord;
    }
}

Again, not that we don’t have to call our iterators it. A name such as lastExaminedPosition is more explicit.

Another possibility is to get rid of the if and combine it with the loop’s condition:

template <typename OutputIterator>
void extractWords(std::string const& s, OutputIterator out)
{
    static auto const isSpace = [](char letter){ return letter == ' '; };
    
    auto beginWord = std::find_if_not(begin(s), end(s), isSpace);
    while (beginWord != end(s))
    {
        auto const endWord = std::find_if(beginWord, end(s), isSpace);
        *out = std::string(beginWord, endWord);
        ++out;
        beginWord = std::find_if_not(endWord, end(s), isSpace);
    }    
}

But I like the first solution better, because the second one duplicates some code (the call to find_if_not), and its flow is arguably harder to follow. What do you think?

Extracting words as std::string_views

If the string we pass to extractWords is not a temporary object, we could want to get a collection of C++17 std::string_views, to avoid creating new std::strings.

The algorithm itself doesn’t change. The part that changes is how we send the result to the output iterator:

template <typename OutputIterator>
void extractWordViews(std::string const& s, OutputIterator out)
{
    static auto const isSpace = [](char letter){ return letter == ' '; };
    
    auto lastExaminedPosition = begin(s);
    while (lastExaminedPosition != end(s))
    {
        auto const beginWord = std::find_if_not(lastExaminedPosition, end(s), isSpace);
        auto const endWord = std::find_if(beginWord, end(s), isSpace);
        if (beginWord != endWord)
        {
            *out = std::string_view(&*beginWord, std::distance(beginWord, endWord));
            ++out;
        }
        lastExaminedPosition = endWord;
    }
}

Note that having extractWords and extractWordViews offers flexibility, but it also brings a risk: if you use extractWords with a vector of std::string_view the code will compile:

std::vector<std::string_view> results;
extractWords(s, back_inserter(results));

But it leads to undefined behaviour, because the std::string_views output in the vector will refer to the temporary std::strings output by the algorithm on that line:

*out = std::string(beginWord, endWord);

and that temporary std::string is long gone when extractWords finishes its execution (it was destroyed at the end of the statement where it was created). If you see how we can prevent a call to extractWords from compiling when we connect it to a container of string_view by accident, please drop a comment in the comment section below!

Lumps of information

extractWords is an algoritm that traverses a collection, searching for blocks of special elements lumped together. But it’s far from being the only one. Another example is adjacent_merge, which we will examine in a future post.

If you have other examples of such algorithms, let me know! By analysing several of them, we may see some patterns and find nice generalizations and new abstractions, to make their code more expressive.

You may also like

Don't want to miss out ? Follow:   twitterlinkedinrss
Share this post!Facebooktwitterlinkedin