Jonathan Boccara's blog

C++ Regex 101

Published February 28, 2020 - 0 Comments

Daily C++

Since C++11, the C++ standard library contains the <regex> header, that allows to compare string against regular expressions (regexes). This greatly simplifies the code when we need to perform such operations.

The <regex> header comes with a lot of features, and it might not be easy to know where to start. The first time I used it, I ended up spending time digging around to understand how it works and its basic use cases, and I wasted time fixing stupid mistakes I made when using it.

This post presents what I learnt from that: how to write simple code that performs simple regex use cases, and a few basic mistakes you want to avoid.

How to write regexes

Before starting to implement your regexes in C++, you need to be able to compose them!

The best reference I know about regexes is Jeffrey Friedl’s book:C++ regular expressions regexI think that every software developer should read this book. It will give you an in-depth understanding of regular expressions, and make you discover regexes you didn’t know existed.

Searching for a regex in C++

Let’s assume now that you know how to craft the regex you need. The simplest test case you can need is probably to check if a piece of text contains your regex.

To do this, you can use std::regex_search. It’s interface is quite simple:

#include <iostream>
#include <regex>

int main()
{
    auto const regex = std::regex("(my|your) regex");
    
    auto const myText = std::string("A piece of text that contains my regex.");
    bool const myTextContainsRegex = std::regex_search(myText, regex);

    auto const yourText = std::string("A piece of text that contains your regex.");
    bool const yourTextContainsRegex = std::regex_search(yourText, regex);

    auto const theirText = std::string("A piece of text that contains their regex.");
    bool const theirTextContainsRegex = std::regex_search(theirText, regex);
    
    std::cout << std::boolalpha
              << myTextContainsRegex << '\n'
              << yourTextContainsRegex << '\n'
              << theirTextContainsRegex << '\n';

}

std::regex represents a regex. It takes the regular expression as a string in its constructor. You can then pass it to std::regex_search along with the text to search in.

The above program then outputs:

true
true
false

Finding the position in the searched text

So far we only know whether or not the text contains the pattern described by the regex. But std::regex_search can also indicate more information about how it matched the pattern, if you also pass it a std::smatch:

auto const regex = std::regex("(my|your) regex");
auto const myText = std::string("A piece of text that contains my regex.");

auto matchResults = std::smatch{};
bool const myTextContainsRegex = std::regex_search(myText, matchResults, regex);

The term “smatch” has nothing to do with Guy Ritchie. It stands for “std::string match“. Indeed, std::smatch is a specialization of std::match_results, a more generic class that works with other representations of strings than std::string (for example const char*, std::wstring, and so on).

After the call to std::regex_search, matchResults is full of information about the pattern matching of our regex. It can amongst other things indicate what part of searched text preceded the match, with the prefix() method:

auto const prefix = matchResults.prefix();

This prefix is of type std::sub_match. It provides itself amongst other things a length() method, that corresponds to the length of the part preceding the match, or said differently, to the position of the match.

To illustrate this with our previous example, consider the following program:

#include <iostream>
#include <regex>

int main()
{
    auto const regex = std::regex("(my|your) regex");

    auto const myText = std::string("A piece of text that contains my regex.");
    auto searchResults = std::smatch{};
    bool const myTextContainsRegex = std::regex_search(myText, searchResults, regex);

    std::cout << "position in searched text: " <<  searchResults.prefix().length() << '\n';
}

Here is its output:

position in searched text: 30

Checking for an exact match

It is important to realize is that there is another function than std::regex_search that checks a regex against a string: std::regex_match.

The difference between std::regex_match and std::regex_search is that std::regex_match checks if the entire searched text matches the pattern of the regex, whereas std::regex_search checks if the searched text contains a sub-part that matches the pattern of the regex.

Put another way, you can use std::regex_match to validate that a string follows a pattern (a date, an email address, and so on) and std::regex_search to perform the equivalent of a grep in a piece of text.

Escaped and non-escaped characters

In the language of regexes, some characters have their actual meaning, such as b that means the character 'b', and some have a special meaning, such as '(' meaning the opening of a group of sub-patterns.

But when we escape these characters, they take a totally different meaning. For example \b means the beginning or the end of a word, and \( means the character '('.

To know exactly the meaning of all characters (escaped and non-escaped), you can consult the regular expression grammar you’re using. The one used by default in C++ regex is the ECMA international one (actually it’s a slightly modified version but the documentation on ECMA is clearer for the main cases). Note that there are ways to use other grammars as well in C++ (with the extended, awk, grep and egrep options).

What this changes in terms of code is that depending on your needs, you may have to escape some characters. When you write string literals in C++, the character '\' is itself a special character that needs to be escaped, with another '\'. Hence the pairs of backslashes \\:

auto const regex = std::regex("(\\bmy\\b|\\byour\\b) regex");

Contrary to our previous regex, this one only matches the entire words “my” and “your”, and wouldn’t match “Amy” or “yourself” for example.

But this added a lot of backslashes to the expression. One way to reduce the amount of backslashes is to use a C++11 raw string literal, which are one of the convenient ways to build strings in C++:

auto const regex = std::regex(R"((\bmy\b|\byour\b) regex)");

This also has the effect of adding parentheses around the expression, so it should be worth it when there are several pairs of escaping backslashes in the expression.

Ignoring case

You can’t specify in the regular expression itself to ignore the case during comparisons. If you want to ignore the case, you need to pass this instruction separately to the regex engine.

In C++, you can pass it as an extra argument to the constructor of std::regex:

auto const regex = std::regex("(MY|your) regex)", std::regex::icase);

Where to learn more

To learn more about regexes, I recommend the great book of Jeffrey Friedl Regular Expressions. It makes the core concepts of regular expressions clear and provides a wealth of practical examples.

For more about the specificities of the C++ functions, you can look them up on cppreference.com.

And for the future of regexes in C++, check out the work of Hana Dusíková on compile-time regular expressions, for example by watching her CppCon 2019 talk.

You will also like

Don't want to miss out ? Follow:   twitterlinkedinrss
Share this post!Facebooktwitterlinkedin