3 Things That Counting Words Can Reveal on Your Code

Published October 9, 2018 - 2 Comments

Being able to read code and understand it quickly is an invaluable skill for a software developer. We spend way more time reading code that writing it, and being able to make a piece of code expressive to your eyes can make you much more efficient in your daily work.

There is a technique to analyse code that I’ve been very excited about these days: counting words in code. By counting words, I mean:

calculating the number of occurrences of each word in a given piece of code, for example in a function,
then seeing where the most frequent words are located,
use this to deduce information about the function as a whole.

Counting words has been useful to me quite a few times to understand code I didn’t know, but the main reason I’m so excited about it is that I’m sure there are plenty of things to discover about it, and I’d like to exchange with you on the subject.

This post presents three things counting words can reveal about a piece of code, and I’d love to hear your reactions afterwards.

Locating the important objects

Let’s take the example of understanding a function. It is likely that the words that occur the most frequently across that function’s code have some importance in it.

To illustrate, let’s experiment a word count on a function, locate the most frequent words, and see what we can learn from it. We’ll use open-source code hosted on GitHub. For example, consider this function from a C++ repository called Classic-Shell.

You don’t have to read its code, as our purpose is to perform a word count to start with a high level view of the function.

bool CSetting::ReadValue( CRegKey &regKey, const wchar_t *valName )
{
	// bool, int, hotkey, color
	if (type==CSetting::TYPE_BOOL || (type==CSetting::TYPE_INT && this[1].type!=CSetting::TYPE_RADIO) || type==CSetting::TYPE_HOTKEY || type==CSetting::TYPE_HOTKEY_ANY || type==CSetting::TYPE_COLOR)
	{
		DWORD val;
		if (regKey.QueryDWORDValue(valName,val)==ERROR_SUCCESS)
		{
			if (type==CSetting::TYPE_BOOL)
				value=CComVariant(val?1:0);
			else
				value=CComVariant((int)val);
			return true;
		}
		return false;
	}

	// radio
	if (type==CSetting::TYPE_INT && this[1].type==CSetting::TYPE_RADIO)
	{
		ULONG len;
		DWORD val;
		if (regKey.QueryStringValue(valName,NULL,&len)==ERROR_SUCCESS)
		{
			CString text;
			regKey.QueryStringValue(valName,text.GetBuffer(len),&len);
			text.ReleaseBuffer(len);
			val=0;
			for (const CSetting *pRadio=this+1;pRadio->type==CSetting::TYPE_RADIO;pRadio++,val++)
			{
				if (_wcsicmp(text,pRadio->name)==0)
				{
					value=CComVariant((int)val);
					return true;
				}
			}
		}
		else if (regKey.QueryDWORDValue(valName,val)==ERROR_SUCCESS)
		{
			value=CComVariant((int)val);
			return true;
		}
		return false;
	}

	// string
	if (type>=CSetting::TYPE_STRING && type<CSetting::TYPE_MULTISTRING)
	{
		ULONG len;
		if (regKey.QueryStringValue(valName,NULL,&len)==ERROR_SUCCESS)
		{
			value.vt=VT_BSTR;
			value.bstrVal=SysAllocStringLen(NULL,len-1);
			regKey.QueryStringValue(valName,value.bstrVal,&len);
			return true;
		}
		return false;
	}

	// multistring
	if (type==CSetting::TYPE_MULTISTRING)
	{
		ULONG len;
		if (regKey.QueryMultiStringValue(valName,NULL,&len)==ERROR_SUCCESS)
		{
			value.vt=VT_BSTR;
			value.bstrVal=SysAllocStringLen(NULL,len-1);
			regKey.QueryMultiStringValue(valName,value.bstrVal,&len);
			for (int i=0;i<(int)len-1;i++)
				if (value.bstrVal[i]==0)
					value.bstrVal[i]='\n';
			return true;
		}
		else if (regKey.QueryStringValue(valName,NULL,&len)==ERROR_SUCCESS)
		{
			value.vt=VT_BSTR;
			value.bstrVal=SysAllocStringLen(NULL,len);
			regKey.QueryStringValue(valName,value.bstrVal,&len);
			if (len>0)
			{
				value.bstrVal[len-1]='\n';
				value.bstrVal[len]=0;
			}
			return true;
		}
		return false;
	}

	Assert(0);
	return false;
}

The function is called ReadValue. Not being familiar with the project, it’s not easy to understand what value is read, and to do what.

Counting the words of this function (which you can do approximately by using online generic tools for counting words in text, or by coding a tool specially designed for counting words in code, which we will explore in future posts) outputs that the word that occurs the most frequently in this function is value. Let’s highlight the occurrences of value in the function:

word count

The first thing we can note is that the occurrences of value are spread out across the whole function. This suggests that value is indeed an central object of the function. Note that if we had started by reading the code line by line, it would have taken much more time to figure out this piece of information.

We also note that the first time that value appears in the function is not via a declaration. This means that value is presumably a class member of the class containing the method ReadValue (in theory value could also be a global variable, but let’s be optimistic and assume it’s a class member).

Now if we pay a closer look to those occurrences of value, we notice that most of them are assignments. We now have a good assumption about the purpose of the function ReadValue: to fill the class member value (and we also understand the function’s name now).

All these deductions are only based on assumptions, and to be 100% sure there are valid we would have to read the whole function. But having a likely explanation of what the function does is useful for two reasons:

often, we don’t have the time to read every line of each function we come across,
for the functions that we do read in details, starting with a general idea of what the function does helps the detailed reading.

Understanding how inputs are used

A function takes input and produces outputs. So one way to understand what a function does is to examine what it does with its inputs. On a lot of the word counts I’ve ran, the function’s inputs are amongst the most frequently appearing words in its body.

The ReadValue function takes two inputs: regKey and valName. Let’s highlight the occurrences of those words in the function. regKey is in orange, valName in red:

word count

A pattern jumps out of this highlighting: regKey and valName are always used together. This suggests that, to understand them, we should consider them together. And indeed, by looking more closely at one of the lines where they are used, we see that regKey seems to be some sort of container, and valName a key to search into it.

Counting words in code can also provide ideas for refactoring tasks. Since those two objects are always used together in the function, perhaps it could be interesting to group them into one object. Or perhaps, perform the lookup of valName in regKey before calling ReadValue, and make ReadValue take only the result of the search as an input parameter.

Sometimes the input parameters are not used extensively in the function though. For example, consider this other function taken from the same codebase:

word count

However, it is always interesting to see where a function uses its inputs.

Intensive uses of an object

Another pattern that comes up often and that teaches a lot about a piece of code is an intensive use of a word in a portion of the code, and very few usages outside of this portion. This can mean that this portion of code is focused on using a particular object, which clarifies the responsibilities of the portion of code.

Let’s illustrate it on another example:

int CSettingsParser::ParseTreeRec( const wchar_t *str, std::vector<TreeItem> &items, CString *names, int level )
{
	size_t start=items.size();
	while (*str)
	{
		wchar_t token[256];
		str=GetToken(str,token,_countof(token),L", \t");
		if (token[0])
		{
			// 
			bool bFound=false;
			for (int i=0;i<level;i++)
				if (_wcsicmp(token,names[i])==0)
				{
					bFound=true;
					break;
				}
				if (!bFound)
				{
					TreeItem item={token,-1};
					items.push_back(item);
				}
		}
	}
	size_t end=items.size();
	if (start==end) return -1;

	TreeItem item={L"",-1};
	items.push_back(item);

	if (level<MAX_TREE_LEVEL-1)
	{
		for (size_t i=start;i<end;i++)
		{
			wchar_t buf[266];
			Sprintf(buf,_countof(buf),L"%s.Items",items[i].name);
			const wchar_t *str2=FindSetting(buf);
			if (str2)
			{
				names[level]=items[i].name;
				// these two statements must be on separate lines. otherwise items[i] is evaluated before ParseTreeRec, but
				// the items vector can be reallocated inside ParseTreeRec, causing the address to be invalidated -> crash!
				int idx=ParseTreeRec(str2,items,names,level+1);
				items[i].children=idx;
			}
		}
	}
	return (int)start;
}

One of the terms that comes up frequently in the function is token. Let’s see where this term appears in the function’s code:

word count

Since token appears many times in the while loop, it suggests that it has a central role in that loop. This is good to know if we need to understand what the loop does, and it also suggests a refactoring: why not putting some of the body of the loop in a function that takes token as an input parameter?

There is plenty left to discover

The three above techniques help in the understanding of code by quickly giving high level information about it. This big picture of a piece of code also suggests some refactoring tasks to improve it.

But there is more to word counting. Based on the discussions I had with people around me, I’d like to go further by exploring these ideas:

counting the individual words inside of a camelCaseSymbol,
trying word counting with sensitive/insensitive case,
performing word counts at the level of a module, across multiple files.

Also, in future posts we will build our own program designed to count words in code, which is not quite the same as counting words in just any text. We will use the STL algorithms to code up this program.

Do you think counting words can be useful to understand your codebase? How do you think we should improve the above techniques?

Please leave me your feedback below, so that we exchange on this exciting topic.

About Jonathan Boccara

3 Things That Counting Words Can Reveal on Your Code

Locating the important objects

Understanding how inputs are used

Intensive uses of an object

There is plenty left to discover

You will also like

Comments are closed