Jonathan Boccara's blog

How to split a string in C++

Published April 21, 2017 - 14 Comments

How to split a string in C++? That is to say, how to get a collection of substrings representing the words of a sentence, or the pieces of data contained in a CSV entry?

This is a simple question, but one which has multiple answers in C++.

We will see 3 solutions, each one having advantages and drawbacks. Pick the one corresponding best to your needs. The point of this post as an episode of the STL learning resource is also to show you how the iterator interface goes beyond the scope of simple containers. And this illustrates how powerful the design of the STL is.

Solution 1 uses standard components (although solution 1.2 tweaks them a bit). Solution 2 is better but uses boost. And Solution 3 is even better but uses ranges. So the one for you really depends on what you need and what you have access to.

Solution 1: Iterating on a stream

Stepping into the world of streams

A stream is an object that creates a connection with a source or with a destination of interest. A stream can obtain information from the source (std::istream) or provide information to the destination (std::ostream), or both (std::iostream).

The source and destination of interest can typically be the standard input (std::cin) or output (std::cout), a file or a string, but really anything can be connected to a stream, provided that the right machinery is put in place.

The main operations done on a stream are

  • for input streams: draw something from it with operator>>,
  • for output streams: push something into it with operator<<.

This is illustrated in the below picture:

The input stream that connects to a string, std::istringstream, has an interesting property: its operator>> produces a string going to the next space in the source string.

istream_iterator

std::istream_iterator is an iterator that can connect with an input stream.

It presents the regular interface of an input iterator (++, dereferencing), but its operator++ actually draws onto the input stream.

istream_iterator is templated on the type it draws from the stream. We will use istream_iterator<std::string>, that will draw a string from the stream and provide a string when dereferenced:

When the stream has nothing more to extract from its source it signals it to the iterator, and the iterator is flagged as finished.

Solution 1.1

Now with the iterator interface we can use algorithms, and this really shows the flexibility of the design of the STL. To be able to use the STL (see Inserting several elements into an STL container efficiently), we need a begin and an end. The begin would be the iterator on an untouched istreamstream on the string to split: std::istream_iterator<std::string>(iss). For the end, by convention, a default constructed istream_iterator is flagged as finished: std::istream_iterator<string>():

Here is the resulting code:

(the extra parentheses in the first parameter are made to disambiguate from a function call – see the “most vexing parse” in Item 6 of Scott Meyers’ Effective STL)

Advantages:

  • uses standard components only,
  • works on any stream, not just strings.

Drawbacks:

  • it can’t split on anything else than spaces, which can be an issue, like for parsing a CSV,
  • it can be improved in terms of performance (but until your profiling hasn’t proved this is your bottleneck, this is not a real issue),
  • arguably a lot of code for just splitting a string!

Solution 1.2: Pimp my operator>>

The causes of two of above drawbacks lie at the same place: the operator>> called by the istream_iterator that draws a string from the stream. This operator>> turns out to do a lot of things: stopping at the next space (which is what we wanted initially but cannot be customized), doing some formatting, reading and setting some flags, constructing objects, etc. And most of this we don’t need here.

So we want to change the behaviour of the following function:

We can’t actually change this because it is in the standard library. We can overload it with another type though, but this type still needs to be kind of like a string.

So the need is to have a string disguised into another type. There are 2 solutions for this: inheriting from std::string, and wrapping a string with implicit conversion. Let’s choose inheritance here.

Say we want to split a string by commas:

Ok, I must admit that this point is controversial. Some would say: “std::string doesn’t have a virtual destructor, so you shouldn’t inherit from it!” and even, maybe, hypothetically, become a tiny little trifle emotional about this.

What I can say here is that the inheritance does not cause a problem in itself. Granted, a problem will occur if a pointer to WordDelimitedByCommas is deleted in the form of a pointer to std::string. Or with the slicing problem. But we’re not going to do this, as you’ll see when you read on. Now can we prevent someone to go and instantiate a WordDelimitedByCommas and coldly shoot the program in the foot with it? No we can’t. But is the risk worth taking? Let’s see the benefit and you’ll judge by yourself.

Now operator>> can be overloaded with this, in order to perform only the operations we need : getting the characters until the next comma. This can be accomplished with the getline function:

(the return is statement allows to chain calls to operator>> .)

Now the initial code can be rewritten:

This can be generalized to any delimiter by templating the WordDelimitedByComma class:
Now to split with semicolon for instance:
Advantages:

  • allows any delimiter specified at compile time,
  • works on any stream, not just strings,
  • faster than solution 1 (20 to 30% faster)

Drawbacks:

  • not standard, though very easy to re-use,
  • still a lot of code for just splitting a string!

Solution 2: Using boost::split

This solution is superior to the previous ones (unless you need it to work on any stream):

The third argument passed to boost::split is a function (or a function object) that determines whether a character is a delimiter. For example here, we use a lambda taking a char a returning whether this char is a space.

The implementation of boost::split is fairly simple: it essentially performs multiple find_if on the string on the delimiter, until reaching the end.

Advantages:

  • very straightforward interface,
  • allows any delimiter, even several different ones
  • fast: 60% faster than solution 1.1

Drawbacks:

  • not standard: needs access to boost

Solution 3 (for the future): Using ranges

Even if they are not as widely available as standard or even boost components today, ranges are the future of the STL and should be widely available in a couple of years.

To get a glimpse of it, the range-v3 library of Eric Niebler offers a very nice interface for creating a split view of a string:

And it comes with several interesting features like, amongst others, using a substring as delimiter. Ranges should be included in C++20, so we can hope to be able to use this feature easily in a couple of years.

So, how do I split my string?

If you have access to boost, then by all means do Solution 2. Or you can consider rolling out your own algorithm that, like boost, split strings based on find_if.

If you don’t want to do this, you can do Solution 1.1 which is standard, unless you need a specific delimiter or you’ve been proven this is a bottleneck, in which case Solution 1.2 is for you.

And when you have access to ranges, Solution 3 should be the way to go.

Related posts:

Liked it ? Share this post ! Facebooktwittergoogle_plus    Don't want to miss out ? Follow:   twitterrss