Jonathan Boccara's blog

Word Counting in C++: Implementing a Simple Word Counter

Published October 12, 2018 - 6 Comments

Word counts can reveal information about your code, or make an unknown piece of code more expressive to your eyes.

There are online tools to count words in generic text, but most of those I’ve come across are designed around counting words in text and SEO (Search Engine Optimization). Since analysing source code is not the same thing as analysing the text of a blog post, let’s design a tool fit for our needs of counting words in code. This way, we will be able to make it evolve when we discover new tasks to try with our word counter.

Another reason to write our own word counter is that it will let us practice interface design, and also STL algorithms, which are useful to master for coding in C++.

For this first version of our word counter, the objective will be to put together a working prototype. Then we will improve it over future posts, by adding features and by refining its implementation.

Code that counts words in code

To make a function that performs the word counting of a piece of code, let’s start by designing its interface.

The interface

One possible interface for a function counting words in code could be this:

std::map<std::string, size_t> getWordCount(std::string const& code);

The code is input as a std::string, and the output word count associates individual words to their number of occurrences. The individual words can be represented as std::string, and their number of occurrences is a positive number that can be represented by a size_t. It is therefore natural to use a std::map<std::string, size_t>.

However, this natural return type may not be exactly what we want: one of the points of a word count is to identify the frequent words, and a map is not designed for doing this. A more appropriate structure would be a std::vector<std::pair<std::string, size_t>>, because we can sort it by the number of occurrences (the second of its pairs).

Since we see that right from the beginning that defining the type representing the word count is not trivial, let’s not settle for a definitive type. Instead let’s give it a name, WordCount, and use an alias declaration of a std::vector. It will make it easier to change it later if necessary, and the advantage of using an alias over a fully-fledged type is that we benefit from all the interface of std::vector without writing any additional code:

using WordCount = std::vector<std::pair<std::string, size_t>>;
WordCount getWordCount(std::string const& code);

Now that we have an interface to begin with, let’s move on to the implementation. The basic structure of the function will be:

  • identifying all the symbols in the input piece of code,
  • counting the occurrences of each one of them,
  • sorting the results by decreasing order of occurrences.

Identifying all the symbols in the input piece of code

Each programming language defines a set of characters that can be used in symbols. In C++, valid symbols are constituted of alphanumerical characters (a to z, A to Z and 0 to 9), as well as underscores (_). A symbols is a succession of such characters, and it stops at any character that is not in this set. For examples, symbols in C++ code are separated by all sorts of white space (space, new lines, tabs) operators (., +, ->, etc.) and brackets ([], {}, ()).

So identifying the symbols in a piece of code represented by a string consists in splitting the string, by using as a delimiter any character that is not a-z, A-Z, 0-9 or an underscore.

The easiest way to split a string in C++ is to use Boost.Split:

auto symbols = std::vector<std::string>{};
boost::split(symbols, code, isDelimiter);

This outputs in symbols the collection of words in the string code, delimited by characters satisfying the predicate isDelimiter. Let’s now implement isDelimiter:

bool isDelimiter(char c)
{
    auto const isAllowedInName = isalnum(c) || c == '_';
    return !isAllowedInName;
}

A delimiter is a character that is not allowed in a name. And as said above, the characters allowed in names are the alphanumerical ones, identified by the standard function isalnum, and underscores.

We now have list of all symbols between delimiters. However this collection contains too many entries: when there are two consecutive delimiters, such as -> or || or ). for example, it generates an empty string corresponding to the (empty) word between those delimiters.

We therefore have to clear our results from empty strings. If you have wrapped the C++ erase-remove idiom in a function, you can write something like this:

erase(symbols, "");

Otherwise you have to write it out in full:

symbols.erase(std::remove(begin(symbols), end(symbols), ""), end(symbols));

This extra step suggests that Boost.Split may not be the right tool here, and that we will have to write our own function to delimit words at some point. We will do it in a future post, but for the moment let’s move on to have a working version, that we can start to use and unit test. We will come back to it afterwards.

Counting the occurrences of each symbol

At this point of the function, we have a std::vector<std::string> that contains all the symbols in the function, and we need to count the occurrences of each one of them. Let’s create a sub-function in charge of this operation:

std::map<std::string, size_t> countWords(std::vector<std::string> const& words)
{
    auto wordCount = std::map<std::string, size_t>{};
    for (auto const& word : words)
    {
        ++wordCount[word];
    }
    return wordCount;
}

This function iterates over the collection of symbols, and increments the number of occurrences of each symbol that we store in a map. Note that the expression wordCount[word] creates an entry in the map with a key equal to the word if it doesn’t exist in the map already.

Sorting the results by decreasing order of occurrences

Now that we have a map that associates symbols with their number of occurrences, we need to turn it into a WordCount sorted by decreasing number of occurrences.

Since WordCount is a vector of std::pairs, and that a std::map is also a container of std::pair, we can leverage on the range constructor of std::vector. To differentiate the word count that we will sort, let’s call it sortedWordCount (even though it is not sorted just yet):

auto const wordCount = countWords(words);
auto sortedWordCount = WordCount(begin(wordCount), end(wordCount));

We finish up the function by sorting the vector in decreasing order of the .second of its elements:

std::sort(begin(sortedWordCount), end(sortedWordCount), [](auto const& p1, auto const& p2){ return p1.second > p2.second; });

Putting it all together

Here is all the code contributing to the function getWordCount:

bool isDelimiter(char c)
{
    auto const isAllowedInName = isalnum(c) || c == '_';
    return !isAllowedInName;
}

std::map<std::string, size_t> countWords(std::vector<std::string> const& words)
{
    auto wordCount = std::map<std::string, size_t>{};
    for (auto const& word : words)
    {
        ++wordCount[word];
    }
    return wordCount;
}

WordCount getWordCount(std::string const& code)
{
    auto symbols = std::vector<std::string>{};
    boost::split(symbols, code, isDelimiter);
    symbols.erase(std::remove(begin(symbols), end(symbols), ""), end(symbols));

    auto const wordCount = countWords(symbols);
    
    auto sortedWordCount = WordCount(begin(wordCount), end(wordCount));
    std::sort(begin(sortedWordCount), end(sortedWordCount), [](auto const& p1, auto const& p2){ return p1.second > p2.second; });
    
    return sortedWordCount;
}

Constructing a raw string literal from code

If we have a piece of code to analyse with our word counter, how do we make it reach the getWordCount function? In later revisions of the program we will fetch code from a file, end even from multiple files, but for the moment let’s go for the simplest solution possible: putting the input right into code.

This is not the cleanest and definitive solution, but it has the advantage of being immediate and doable on the go, if you are not at home and only have access to online compilers such as coliru.

But copy-pasting a piece of code into a std::string is challenging, because if the code has quotes (") you need to escape them. Also, you have to deal with line returns if your code spreads over several lines (which it likely does).

Fortunately, C++11 raw string literals solve exactly that kind of problems. There are several ways to create a raw string literal, but the simplest one is to write an R before opening the quotes, and putting the string inside parentheses: R"(this is my text with "quotes")".

Here is the raw string literal corresponding to the code we have written so far:

    static constexpr auto code = R"(

bool isDelimiter(char c)
{
    auto const isAllowedInName = isalnum(c) || c == '_';
    return !isAllowedInName;
}

std::map<std::string, size_t> countWords(std::vector<std::string> const& words)
{
    auto wordCount = std::map<std::string, size_t>{};
    for (auto const& word : words)
    {
        ++wordCount[word];
    }
    return wordCount;
}

WordCount getWordCount(std::string const& code)
{
    auto symbols = std::vector<std::string>{};
    boost::split(symbols, code, isDelimiter);
    symbols.erase(std::remove(begin(symbols), end(symbols), ""), end(symbols));

    auto const wordCount = countWords(symbols);
    
    auto sortedWordCount = WordCount(begin(wordCount), end(wordCount));
    std::sort(begin(sortedWordCount), end(sortedWordCount), [](auto const& p1, auto const& p2){ return p1.second > p2.second; });
    
    return sortedWordCount;
}
})";

Printing a word count

To start exploiting the information provided by word counts, we will output them to the console. To do this, let’s write a function that prints a word count as a table of two columns, with the symbols on one side and their numbers of occurrences on the other side.

To do this by using standard C++ (before C++20, that could adopt the popular {fmt} library), we will rely on stream operations, which you can read about in The Complete Guide to Building Strings In C++:

void print(WordCount const& entries)
{
    for (auto const& entry : entries)
    {
        std::cout << std::setw(30) << std::left << entry.first << '|' << std::setw(10) << std::right << entry.second << '\n';
    }
}

This function fixes the sizes of the two columns to 30 and 10 characters respectively. Let’s improve it by adapting the size of the first column to the longest symbol size + 1. To do this we need to locate the longest symbol size. We use std::max_element, by giving it a predicate to compare the sizes of the firsts in the pairs in the vector:

auto const longestWord = *std::max_element(begin(entries), end(entries), [](auto const& p1, auto const& p2){ return p1.first.size() < p2.first.size(); });
auto const longestWordSize = longestWord.first.size();

In an empty collection, std::max_element returns the end of the collection. Since we cannot deference that, we need to deal with this case, for example by using a guard:

void print(WordCount const& entries)
{
    if (entries.empty()) return;
    auto const longestWord = *std::max_element(begin(entries), end(entries), [](auto const& p1, auto const& p2){ return p1.first.size() < p2.first.size(); });
    auto const longestWordSize = longestWord.first.size();
    
    for (auto const& entry : entries)
    {
        std::cout << std::setw(longestWordSize + 1) << std::left << entry.first << '|' << std::setw(10) << std::right << entry.second << '\n';
    }
}

Putting it all together

Here is a working example of word count, on the code of the word counter itself (also available in this coliru):

#include <boost/algorithm/string.hpp>
#include <cctype>
#include <iostream>
#include <iomanip>
#include <map>
#include <vector>

bool isDelimiter(char c)
{
    auto const isAllowedInName = isalnum(c) || c == '_';
    return !isAllowedInName;
}

std::map<std::string, size_t> countWords(std::vector<std::string> const& words)
{
    auto wordCount = std::map<std::string, size_t>{};
    for (auto const& word : words)
    {
        ++wordCount[word];
    }
    return wordCount;
}

using WordCount = std::vector<std::pair<std::string, size_t>>;

WordCount getWordCount(std::string const& code)
{
    auto symbols = std::vector<std::string>{};
    boost::split(symbols, code, isDelimiter);
    symbols.erase(std::remove(begin(symbols), end(symbols), ""), end(symbols));

    auto const wordCount = countWords(symbols);
    
    auto sortedWordCount = WordCount(begin(wordCount), end(wordCount));
    std::sort(begin(sortedWordCount), end(sortedWordCount), [](auto const& p1, auto const& p2){ return p1.second > p2.second; });
    
    return sortedWordCount;
}

void print(WordCount const& entries)
{
    if (entries.empty()) return;
    auto const longestWord = *std::max_element(begin(entries), end(entries), [](auto const& p1, auto const& p2){ return p1.first.size() < p2.first.size(); });
    auto const longestWordSize = longestWord.first.size();
    
    for (auto const& entry : entries)
    {
        std::cout << std::setw(longestWordSize + 1) << std::left << entry.first << '|' << std::setw(10) << std::right << entry.second << '\n';
    }
}

int main()
{
    static constexpr auto code = R"(

bool isDelimiter(char c)
{
    auto const isAllowedInName = isalnum(c) || c == '_';
    return !isAllowedInName;
}

std::map<std::string, size_t> countWords(std::vector<std::string> const& words)
{
    auto wordCount = std::map<std::string, size_t>{};
    for (auto const& word : words)
    {
        ++wordCount[word];
    }
    return wordCount;
}

using WordCount = std::vector<std::pair<std::string, size_t>>;

WordCount getWordCount(std::string const& code)
{
    auto symbols = std::vector<std::string>{};
    boost::split(symbols, code, isDelimiter);
    symbols.erase(std::remove(begin(symbols), end(symbols), ""), end(symbols));

    auto const wordCount = countWords(symbols);
    
    auto sortedWordCount = WordCount(begin(wordCount), end(wordCount));
    std::sort(begin(sortedWordCount), end(sortedWordCount), [](auto const& p1, auto const& p2){ return p1.second > p2.second; });
    
    return sortedWordCount;
}

void print(WordCount const& entries)
{
    if (entries.empty()) return;
    auto const longestWord = *std::max_element(begin(entries), end(entries), [](auto const& p1, auto const& p2){ return p1.first.size() < p2.first.size(); });
    auto const longestWordSize = longestWord.first.size();
    
    for (auto const& entry : entries)
    {
        std::cout << std::setw(longestWordSize + 1) << std::left << entry.first << '|' << std::setw(10) << std::right << entry.second << '\n';
    }
}

})";    
    print(getWordCount(code));
}

Here is the word count output by this program:

std             |        20
auto            |        13
const           |        13
symbols         |         7
return          |         6
wordCount       |         6
string          |         6
entries         |         5
end             |         5
p2              |         4
p1              |         4
first           |         4
sortedWordCount |         4
begin           |         4
WordCount       |         4
c               |         3
size_t          |         3
vector          |         3
entry           |         3
size            |         3
second          |         3
map             |         2
longestWord     |         2
longestWordSize |         2
setw            |         2
word            |         2
words           |         2
isDelimiter     |         2
isAllowedInName |         2
code            |         2
countWords      |         2
for             |         2
erase           |         1
10              |         1
_               |         1
bool            |         1
void            |         1
boost           |         1
using           |         1
char            |         1
split           |         1
cout            |         1
sort            |         1
empty           |         1
1               |         1
getWordCount    |         1
right           |         1
if              |         1
remove          |         1
print           |         1
pair            |         1
n               |         1
max_element     |         1
isalnum         |         1
left            |         1

The most frequent words is std, which reflects that we’ve used the standard library quite intensively. Among the frequent words not related to C++, we find symbols and wordCount, which is indeed what this code is about.

Next steps

Now that we have a working (as far as I know!) word counter, we can make it evolve.

One interesting feature for counting words in code is to extract individual words from camelCaseSymbols. To do this we will implement our own function to extract words from code, and at the same time use an implementation more adapted than Boost.Split.

Stay tuned, and if you see how to improve the word counter or have any other reaction, please leave your feedback below!

This post is sponsored by mesbinocles.com. A big thanks to them!

You will also like

Don't want to miss out ? Follow:   twitterlinkedinrss
Share this post!Facebooktwitterlinkedin

Comments are closed