Jonathan Boccara's blog

How to split a string in C++

Published April 21, 2017 - 44 Comments

Daily C++

How to split a string in C++? That is to say, how to get a collection of substrings representing the words of a sentence, or the pieces of data contained in a CSV entry?

This is a simple question, but one which has multiple answers in C++.

We will see 3 solutions, each one having advantages and drawbacks. Pick the one corresponding best to your needs. The point of this post as an episode of the STL learning resource is also to show you how the iterator interface goes beyond the scope of simple containers. And this illustrates how powerful the design of the STL is.

Solution 1 uses standard components. Solution 2 is better but relies on boost. And Solution 3 is even better but uses ranges. So the one for you really depends on what you need and what you have access to.

Solution 1: Iterating on a stream

Stepping into the world of streams

A stream is an object that creates a connection with a source or with a destination of interest. A stream can obtain information from the source (std::istream) or provide information to the destination (std::ostream), or both (std::iostream).

The source and destination of interest can typically be the standard input (std::cin) or output (std::cout), a file or a string, but really anything can be connected to a stream, provided that the right machinery is put in place.

The main operations done on a stream are

  • for input streams: draw something from it with operator>>,
  • for output streams: push something into it with operator<<.

This is illustrated in the below picture:

C++ streams

The input stream that connects to a string, std::istringstream, has an interesting property: its operator>> produces a string going to the next space in the source string.

istream_iterator

std::istream_iterator is an iterator that can connect with an input stream.

It presents the regular interface of an input iterator (++, dereferencing), but its operator++ actually draws onto the input stream.

istream_iterator is templated on the type it draws from the stream. We will use istream_iterator<std::string>, that will draw a string from the stream and provide a string when dereferenced:

When the stream has nothing more to extract from its source it signals it to the iterator, and the iterator is flagged as finished.

Solution 1.1

Now with the iterator interface we can use algorithms, and this really shows the flexibility of the design of the STL. To be able to use the STL (see Inserting several elements into an STL container efficiently), we need a begin and an end. The begin would be the iterator on an untouched istringstream on the string to split: std::istream_iterator<std::string>(iss). For the end, by convention, a default constructed istream_iterator is flagged as finished: std::istream_iterator<string>():

Here is the resulting code:

std::string text = "Let me split this into words";

std::istringstream iss(text);
std::vector<std::string> results((std::istream_iterator<std::string>(iss)),
                                 std::istream_iterator<std::string>());

The extra parentheses in the first parameter are made to disambiguate from a function call – see the “most vexing parse” in Item 6 of Scott Meyers’ Effective STL.

As pointed out by Chris in the comments section, in C++11 we can use uniform initialization using braces to work around that vexing phenomenon:

std::string text = "Let me split this into words";

std::istringstream iss(text);
std::vector<std::string> results(std::istream_iterator<std::string>{iss},
                                 std::istream_iterator<std::string>());

Advantages:

  • uses standard components only,
  • works on any stream, not just strings.

Drawbacks:

  • it can’t split on anything else than spaces, which can be an issue, like for parsing a CSV,
  • it can be improved in terms of performance (but until your profiling hasn’t proved this is your bottleneck, this is not a real issue),
  • arguably a lot of code for just splitting a string!

Solution 1.2: Pimp my operator>>

(Solution 1.2 is useful to read in order to understand the reasoning that leads to Solution 1.3, but Solution 1.3 is more practical in the end)

The causes of two of above drawbacks lie at the same place: the operator>> called by the istream_iterator that draws a string from the stream. This operator>> turns out to do a lot of things: stopping at the next space (which is what we wanted initially but cannot be customized), doing some formatting, reading and setting some flags, constructing objects, etc. And most of this we don’t need here.

So we want to change the behaviour of the following function:

std::istream& operator>>(std::istream& is, std::string& output)
{
   // ...does lots of things...
}

We can’t actually change this because it is in the standard library. We can overload it with another type though, but this type still needs to be kind of like a string.

So the need is to have a string disguised into another type. There are 2 solutions for this: inheriting from std::string, and wrapping a string with implicit conversion. Let’s choose inheritance here.

Say we want to split a string by commas:

class WordDelimitedByCommas : public std::string
{};

Ok, I must admit that this point is controversial. Some would say: “std::string doesn’t have a virtual destructor, so you shouldn’t inherit from it!” and even, maybe, hypothetically, become a tiny little trifle emotional about this.

What I can say here is that the inheritance does not cause a problem in itself. Granted, a problem will occur if a pointer to WordDelimitedByCommas is deleted in the form of a pointer to std::string. Or with the slicing problem. But we’re not going to do this, as you’ll see when you read on. Now can we prevent someone to go and instantiate a WordDelimitedByCommas and coldly shoot the program in the foot with it? No we can’t. But is the risk worth taking? Let’s see the benefit and you’ll judge by yourself.

Now operator>> can be overloaded with this, in order to perform only the operations we need : getting the characters until the next comma. This can be accomplished with the getline function:

std::istream& operator>>(std::istream& is, WordDelimitedByComma& output)
{
   std::getline(is, output, ',');
   return is;
}

(the return is statement allows to chain calls to operator>> .)

Now the initial code can be rewritten:

std::string text = "Let,me,split,this,into,words";

std::istringstream iss(text);
std::vector<std::string> results((std::istream_iterator<WordDelimitedByComma>(iss)),
                                 std::istream_iterator<WordDelimitedByComma>());

This can be generalized to any delimiter by templating the WordDelimitedByComma class:

template<char delimiter>
class WordDelimitedBy : public std::string
{};

Now to split with semicolon for instance:

std::string text = "Let;me;split;this;into;words";

std::istringstream iss(text);
std::vector<std::string> results((std::istream_iterator<WordDelimitedBy<';'>>(iss)),
                                 std::istream_iterator<WordDelimitedBy<';'>>());

Advantages:

  • allows any delimiter specified at compile time,
  • works on any stream, not just strings,
  • faster than solution 1 (20 to 30% faster)

Drawbacks:

  • delimiter at compile-time
  • not standard, though easy to re-use,
  • still a lot of code for just splitting a string!

Solution 1.3: stepping away from the iterators

The main problem with solution 1.2 is that the delimiter has to be specified at compile-time. Indeed, we couldn’t pass the delimiter to std::getline through the iterators. So let’s refactor solution 1.2 to remove the layers of iterators:

std::vector<std::string> split(const std::string& s, char delimiter)
{
   std::vector<std::string> tokens;
   std::string token;
   std::istringstream tokenStream(s);
   while (std::getline(tokenStream, token, delimiter))
   {
      tokens.push_back(token);
   }
   return tokens;
}

Here we use another feature of std::getline: it returns a stream that’s passed to it, and that stream is convertible to bool (or to void*) before C++11. This boolean indicates if no error has occured (so true is no error has occured, false if an error has occured). And that error check includes whether or not the stream is at an end.

So the while loop will nicely stop when the end of the stream (and therefore of the string) has been reached.

Advantages:

  • very clear interface
  • works on any delimiter
  • the delimiter can be specified at runtime

Drawbacks:

  • not standard, though easy to re-use

Solution 2: Using boost::split

This solution is superior to the previous ones (unless you need it to work on any stream):

#include <boost/algorithm/string.hpp>

std::string text = "Let me split this into words";
std::vector<std::string> results;

boost::split(results, text, [](char c){return c == ' ';});

The third argument passed to boost::split is a function (or a function object) that determines whether a character is a delimiter. For example here, we use a lambda taking a char a returning whether this char is a space.

The implementation of boost::split is fairly simple: it essentially performs multiple find_if on the string on the delimiter, until reaching the end. Note that contrary to the previous solution, boost::split will provide an empty string as a last element of results if the input string ends with a delimiter.

Advantages:

  • straightforward interface,
  • allows any delimiter, even several different ones
  • 60% faster than solution 1.1

Drawbacks:

  • needs access to boost
  • the interface does not output its results via its return type

Solution 3 (for the future): Using ranges

Even if they are not as widely available as standard or even boost components today, ranges are the future of the STL and should be widely available in a couple of years.

To get a glimpse of it, the range-v3 library of Eric Niebler offers a very nice interface for creating a split view of a string:

std::string text = "Let me split this into words";
auto splitText = text | view::split(' ') | ranges::to<std::vector<std::string>>();

And it comes with several interesting features like, amongst others, using a substring as delimiter. Ranges should be included in C++20, so we can hope to be able to use this feature easily in a couple of years.

So, how do I split my string?

If you have access to boost, then by all means do Solution 2. Or you can consider rolling out your own algorithm that, like boost, split strings based on find_if.

If you don’t want to do this, you can do Solution 1.1 which is standard, unless you need a specific delimiter or you’ve been proven this is a bottleneck, in which case Solution 1.3 is for you.

And when you have access to ranges, Solution 3 should be the way to go.

Related posts:

Don't want to miss out ? Follow:   twitterlinkedinrss
Share this post!Facebooktwitterlinkedin

Comments are closed