Word Counting in C++: Parametrizing the Type of Case

Published October 19, 2018 - 9 Comments

In our first step implementing a word counter in C++, we wrote code that could extract the words inside of a piece of code. In the second step, we changed that code so that it extracted individual words inside of camelCaseSymbols (and also of PascalCaseSymbols), losing the previous feature of counting entire words.

Today, we are going to make the code able to do either type of extraction, entire words or words inside of camel case symbols. This will make us practice two aspects of writing expressive code:

avoiding code duplication,
designing a clear API to choose between various treatments (here, between entire words and camel case).

The reason why we’re building a word counter in the first place is that counting words can reveal useful information about a piece of code, and also because implementing it is an instructive project to improve our coding skills in C++.

Summary of the previous episodes

In the first version of the word counter, we went for the quickest solution to have a working prototype. For this, we used Boost Split to extract entire words, even though it wasn’t the most adapted tool for our purpose, as it needed a second pass to remove the empty words:

auto symbols = std::vector<std::string>{};
boost::split(symbols, code, isDelimiter);
symbols.erase(std::remove(begin(symbols), end(symbols), ""), end(symbols));

Where isDelimiter is a function that determines whether a given character is a delimiter, meaning that it cannot be part of a C++ name:

bool isDelimiter(char c)
{
    auto const isAllowedInName = isalnum(c) || c == '_';
    return !isAllowedInName;
}

C++ names are made of alphanumerical characters and underscores. Other characters (brackets, ., -, >, +, spaces, etc.) are delimiters. And in code with several delimiters in a row (like with ->), that leads to empty words (between - and >)

This solution, although quick to put in place, did not have the flexibility to extract words from symbols in camel or pascal case. So we had to implement our own extraction:

std::vector<std::string> getCamelCaseWordsFromCode(std::string const& code)
{
    auto words = std::vector<std::string>{};
    auto beginWord = std::find_if_not(begin(code), end(code), isDelimiter);
    while (beginWord != end(code))
    {
        auto const endWord = std::find_if(std::next(beginWord), end(code), [](char c){ return isDelimiter(c) || isupper(c); });
        words.emplace_back(beginWord, endWord);
        beginWord = std::find_if_not(endWord, end(code), isDelimiter);
    }
    return words;
}

If you’d like to have more details about how we came up with this function, you can check out the dedicated post. This function essentially locates the beginning and and of each word, and places them inside of the output vector.

From words in camel case to entire words

What is the difference between locating an entire word and locating a word inside of a symbol in camel case?

Both start by a character that is not a delimiter. Where they differ is with their end: words inside of a camel case symbol end when we encounter a capital letter (which is the beginning of the next word) or a delimiter (end of the whole camel case symbol). Entire words can only end with a delimiter.

There is one place in the above function where we check for the end of a word:

std::vector<std::string> getCamelCaseWordsFromCode(std::string const& code)
{
    auto words = std::vector<std::string>{};
    auto beginWord = std::find_if_not(begin(code), end(code), isDelimiter);
    while (beginWord != end(code))
    {
        auto const endWord = std::find_if(std::next(beginWord), end(code), [](char c){ return isDelimiter(c) || isupper(c); });
        words.emplace_back(beginWord, endWord);
        beginWord = std::find_if_not(endWord, end(code), isDelimiter);
    }
    return words;
}

To split on entire words, we therefore only need to change that predicate:

std::vector<std::string> getEntireWordsFromCode(std::string const& code)
{
    auto words = std::vector<std::string>{};
    auto beginWord = std::find_if_not(begin(code), end(code), isDelimiter);
    while (beginWord != end(code))
    {
        auto const endWord = std::find_if(std::next(beginWord), end(code), isDelimiter);
        words.emplace_back(beginWord, endWord);
        beginWord = std::find_if_not(endWord, end(code), isDelimiter);
    }
    return words;
}

From that point it becomes natural to make only one function that takes the predicate identifying the end of a word:

template<typename EndOfWordPredicate>
std::vector<std::string> getWordsFromCode(std::string const& code, EndOfWordPredicate isEndOfWord)
{
    auto words = std::vector<std::string>{};
    auto beginWord = std::find_if_not(begin(code), end(code), isDelimiter);
    while (beginWord != end(code))
    {
        auto const endWord = std::find_if(std::next(beginWord), end(code), isEndOfWord);
        words.emplace_back(beginWord, endWord);
        beginWord = std::find_if_not(endWord, end(code), isDelimiter);
    }
    return words;
}

The client interface

We want the user of our word counter to choose between entire words and words inside camel case. The interface as it is is too low in terms of levels of abstraction: we want the user to express his choice by writing something like EntireWords or WordsInCamelCase, and not by passing a predicate. Therefore we need an additional indirection to raise the level of abstraction.

This higher level of abstraction can consist in a function where the user passes its code, as well as an indication about EntireWords or WordsInCamelCase. The question now is, how to express that latter indication?

The purpose of our function is to take a piece of code and extract the words in it. Its sole natural input is the piece of code. The way we want it to perform that extract is a different form of input. It is more something that parametrize the function than a regular input. As if two types of extraction would really be two different functions.

To express this, I think we should pass the type of extraction to the function via a different channel than its normal input. We have at least two channels for this: template parameters, and currying.

Template parameters

Template parameters incur a constraint: they have to be specified at compile time.

Our template parameter should be able to take two values, one for entire words and one for words in camel case. To represent this, we can use an enum:

enum class HowToDelimitWords
{
    EntireWords,
    WordsInCamelCase
};

Then we use it as a template parameter, in the header file:

template<HowToDelimitWords howToDelimitWords>
std::vector<std::string> getWordsFromCode(std::string const& code);

Note that since we don’t use the template parameter inside of the declaration, we can omit its name, which was redundant:

template<HowToDelimitWords>
std::vector<std::string> getWordsFromCode(std::string const& code);

Also note that if we provide the implementations for both values of the enum class, we don’t have to write them in the header file. We can use a .cpp file and the linker will find them there:

template<HowToDelimitWords>
std::vector<std::string> getWordsFromCode(std::string const& code);

template<>
std::vector<std::string> getWordsFromCode<HowToDelimitWords::EntireWords>(std::string const& code)
{
    return getWordsFromCode(code, isDelimiter);
}

template<>
std::vector<std::string> getWordsFromCode<HowToDelimitWords::WordsInCamelCase>(std::string const& code)
{
    return getWordsFromCode(code, [](char c){ return isDelimiter(c) || isupper(c); });
}

You can find all the code put together in this coliru.

Currying

Currying means partial application of a function. Here we will use currying to choose the type of extraction at runtime.

To do this, we start by passing the type of extraction as a regular function parameter, then we will partially apply the function to fix the type of extraction.

If we pass the enum as a regular function parameter, our function becomes:

std::vector<std::string> getWordsFromCode(std::string const& code, HowToDelimitWords howToDelimitWords)
{
    if (howToDelimitWords == HowToDelimitWords::EntireWords)
    {
        return getWordsFromCode(code, isDelimiter);
    }
    else
    {
        return getWordsFromCode(code, [](char c){ return isDelimiter(c) || isupper(c); });
    }
}

And its declaration in the header file becomes:

std::vector<std::string> getWordsFromCode(std::string const& code, HowToDelimitWords howToDelimitWords);

Since we would like to make the function take only the code as parameter, we can resort to partially apply it with lambdas. Note that we can write the lambdas in the header file, with only the function declaration available:

std::vector<std::string> getWordsFromCode(std::string const& code, HowToDelimitWords howToDelimitWords);

auto const getEntireWordsFromCode = [](std::string const& code){ return getWordsFromCode(code, HowToDelimitWords::EntireWords); };
auto const getWordsInCamelCaseFromCode = [](std::string const& code){ return getWordsFromCode(code, HowToDelimitWords::WordsInCamelCase); };

We now have two functions, getEntireWordsFromCode and getWordsInCamelCaseFromCode, that both take only one parameter, code. And we avoided code duplication.

You can find all the code using currying put together in that coliru.

The option using lambda is perhaps less scalable than the one using templates, if we add other parameters. At this stage though, we don’t know if we will ever need extra parameters. And if we do, we will always be able to adapt the code, or use wrappers in the worst case.

Which option do you prefer?

Next steps

We have now allowed a user of our word counter to choose between counting entire words and counting individual words in camel case.

The next features we will implement include performing case-insensitive word counts as well as word counts on multiple files at the same time. This will let us practice other aspects of code design.

Stay tuned!