Word Counting in C++: Extracting words from camelCase symbols

Published October 16, 2018 - 5 Comments

Counting words in code, what an exciting topic!

Ok, if you don’t see what exactly is exciting in counting words in code, maybe a little context will help. Word counts can reveal useful information about a piece of code, and with the right tooling, it takes very little time to perform.

Reading code is one of our principal activities as software developers, and being able to quickly make sense of a unknown piece of code in an invaluable skill. I believe that word counts can help doing that. If you’d like to see what sort of things they can reveal about code, you can check out the introductory post about word counts.

And speaking about the right tooling, this post, along with a few others before and after it, is about programming a word counter in C++, which happens to be an interesting task in itself, as it shows practical usages of the STL.

Now are you excited about word counts?

A word counter in camelCase

In the last episode we left off with a word counter that could make a list of the words in a piece of code, with their number of occurrences associated. We will take its implementation as a starting point. Now, we are going to extract the words inside of the symbols in camel case of the piece of code.

A word in camel case is a concatenation of several words all starting with a capital letter, except for the first one. For example, thisIsAWordInCamelCase. But we will also include the symbols that start with a capital letter, which is strict sensu called Pascal case. For example ThisIsAWordInPascalCase.

If the above two examples appeared in a piece of code, with our previous word counter they would have generated the following word count:

ThisIsAWordInCamelCase |         1
thisIsAWordInCamelCase |         1

With the word counter that we will implement now, they would generate the following word count:

A     |         2
Camel |         2
Case  |         2
In    |         2
Is    |         2
Word  |         2
This  |         1
this  |         1

Extracting the words

Let’s start by coding a function that takes a piece of code (represented by a std::string), and extracts all the individual words inside of all the camel (or Pascal) case symbols in it. We will use this function instead of the current code that extracts the words in code which, as a reminder, was this:

auto symbols = std::vector<std::string>{};
boost::split(symbols, code, isDelimiter);
symbols.erase(std::remove(begin(symbols), end(symbols), ""), end(symbols));

To start experimenting with a working word counter, we had used Boost Split even if it forced us to remove the empty words afterwards. Now we will replace these three lines of code with a call to our function extracting words from code in camel case. Here is its interface:

std::vector<std::string> getCamelCaseWordsFromCode(std::string const& code);

The algorithm

To extract a given word inside a piece of code, we need to work out two things: where the word starts, and where it ends. And since we need to do this for each word, there will probably be some kind of loop involved.

So to break down the implementation of the algorithm in small steps, we’ll proceed in two steps:

Step 1: start by writing code to extract the first word,
Step 2: adapt this code to loop over all the words.

Before that, let’s create the return value to output:

std::vector<std::string> getCamelCaseWordsFromCode(std::string const& code)
{
    auto words = std::vector<std::string>{};

Note that another option would have been to follow the conventions of the STL and use an output iterator. We keep this option in mind if we decide later to make our function more generic.

Step 1: locating the first word

To locate the first word, we can use two iterators: beginWord that points to the first letter of the word, and endWord that points to the first letter after the word (or the end of code). This way, we will be able to manipulate the word like a range (a sub-range of code) and use all the interfaces that the STL offers.

The first letter of the first word is not necessarily the first word of the piece of code. Indeed, the code may start with blanks or other characters that are not part of a symbol. The first letter of the word is the first one that is not a delimiter. We can locate it by using the STL algorithm std::find_if_not:

auto const beginWord = std::find_if_not(begin(code), end(code), isDelimiter);

We can use the isDelimiter function we had used in our previous implementation of a simple word counter:

bool isDelimiter(char c)
{
    auto const isAllowedInName = isalnum(c) || c == '_';
    return !isAllowedInName;
}

A delimiter is anything that is not in a name, and names in C++ are made of alphanumerical characters (a-z, A-Z, 0-9) and underscores (_).

Now we need to find the end of the first word. A word can end with two things:

either a delimiter,
or a capital letter, that marks the beginning of a new word inside of a symbol in camel case.

So we’re looking for the first character after beginWord that is either one of these two things. We can use the convenient std::next function to start looking after the first letter of the word:

auto const endWord = std::find_if(std::next(beginWord), end(code), [](char c){ return isDelimiter(c) || isupper(c); });

Indeed, if we had started searching for a capital letter from beginWord, and that beginWord happened to point to a capital letter itself, the search would not have gone past the first letter, which may not be the end of the word.

Also note that if we call std::next on the end of a container, using the returned value leads to undefined behaviour. We therefore have to check that we are not at the end of the piece of code before executing the above line of code.

Combining functions

I don’t know what you think, but I find the expression [](char c){ return isDelimiter(c) || isupper(c); } rather annoying to write and read, because it contains a lot of noise. It would have been nicer to write something like this:

auto const endWord = std::find_if(std::next(beginWord), end(code), isDelimiter || isupper);

But this is not legal C++. Boost Phoenix would have allowed to write something like this, after some declarations involving macros:

auto const endWord = std::find_if(std::next(beginWord), end(code), isDelimiter(arg1) || isupper(arg1));

There may be other ways to write this, but we’d risk straying from our exciting topic of word counting if we go further. We will explore the combinations of functions in another post. You’re welcome to share your suggestions about this topic in the comments section below.

Extracting the word

Now that we have located the word with beginWord and endWord, we need to send it to the output collection, words. To do this, we could use the constructor of std::string that takes two iterators to construct a std::string, and add it to the std::vector by using push_back.

But a more direct way is to use the emplace_back method of std::vector, that accepts constructors arguments to directly construct the new object in the memory space of the vector (using a placement new), thus avoiding a copy:

words.emplace_back(beginWord, endWord);

The compiler may have been able to optimize away the copy, but emplace_back leads to more direct code anyway. emplace_back has been added to the standard in C++11.

Step 2: looping over the words

After a series of trials and errors, I could come up with the following loop: find beginWord before the loop, then repeat the finding of endWord and the beginWord for the next word:

auto beginWord = std::find_if_not(begin(code), end(code), isDelimiter);
while (beginWord != end(code))
{
    auto endWord = std::find_if(std::next(beginWord), end(code), [](char c){ return isDelimiter(c) || isupper(c); });
    words.emplace_back(beginWord, endWord);
    beginWord = std::find_if_not(endWord, end(code), isDelimiter);
}

I don’t claim that it is the optimal solution, in particular because it duplicates the code performing the search of the beginning of a word, and I’d be glad to hear your suggestions to improve it, and potentially to simplify it by using STL algorithms.

We can now integrate this function with our previous word counter. This is done in this coliru, which you can use to play around and count the words in your code using camel and pascal case.

Next up: parametrization

We now have a word counter that counts the words inside of camel case symbols, but that no longer counts the whole words! This also was a valid way to count words.

The next step will be to allow our word counter to perform both types of counts. This will make us reflect on:

how to mutualize code,
how to design an expressive interface that allows to choose between types of treatments.

Stay tuned!

About Jonathan Boccara