Jonathan Boccara's blog

RestMyCase: A C++ Library for Formatting String Cases

Published January 8, 2019 - 0 Comments

Today’s guest post is written by guest author JFTInterested in writing on Fluent C++ too? Submit your guest post!

In his 29 June 2018 blog post about C++ summer projects, Jonathan suggested seven more ways to get better at C++ during the summer of 2018. One of these was a project to implement ‘Title Case’ – with extensions to allow different case styles.

On the face of it, this seemed like a fairly easy project – but it isn’t! Once I delved into it, this turned into quite a complex undertaking. Obviously, there are probably as many ways to do this as there are C++ programmers – with probably no ‘best way’, just ‘different ways’.

My code is available here. Note that it cannot be compiled using the on-line compiler if C++17 execution policies are used as the on-line compiler doesn’t yet support these. But a pre-compile test is included for this so the provided code will run as sequential. We will discuss the (surprising!) performance aspects of parallel algorithms at the end of the article

Simple Case

If all that is required is just to have every word as lower case with the initial letter as upper case (ignoring punctuation etc) – then it would be fairly trivial and there are many implementations of this available on the Internet. One such implementation could be:

std::string simpleCase(const std::string& text)
{
      std::stringstream result;
      bool newwrd = true;

      for (const auto ch : text) {
            newwrd = newwrd || std::isspace(ch);
            if (std::isalpha(ch))
                  if (newwrd) {
                        result << static_cast<char>(std::toupper(ch));
                        newwrd = false;
                  } else
                        result << static_cast<char>(std::tolower(ch));
            else
                  result << ch;
      }

      return result.str();
}

Note that static_cast’s are required as the return type from std::toupper() is an int, not a char!

This is basically Step 1 of the challenge – a simple and fast implementation. However, for anything more involved then this algorithm is not really suitable.

Background

Title Case is but one of a number of case styles that are used in different circumstances for readability or effect (or both!). There are a number of well-known case styles used (eg Snake Case, Camel Case etc). However, when I started to dig deeper into the actual meanings of these different styles, it became apparent that there is no universal definition for many of them (eg Camel Case may or may not have the first word capitalised). Hence for the sake of clarity I have summarised my interpretation of the main eight different cases as follows:

Train Case

·         Words separated by _ char

·         No punctuation

·         Initial letter upper-case except first word

·         Example “now_Is_The_Time”

 

Snake Case

·         Words separated by _ char

·         No Punctuation

·         Initial letter lower-case

·         Example “now_is_the_time”

 

Camel Case (Pascal Case or Upper Case Camel)

·         Words are not separated

·         No punctuation

·         Initial letter upper-case (including first word)

·         Example “NowIsTheTime”

Camel Back (Lower Case Camel)

·         Words are not separated

·         No punctuation

·         Initial letter upper-case except first word

·         Example “nowIsTheTime”

Upper Case

·         Words separated by space

·         Punctuation allowed

·         Every letter upper-case

·         Example “NOW IS THE TIME”

Lower Case

·         Words separated by space

·         Punctuation allowed

·         Every letter lower case

·         Example “now is the time”

Start Case

·         Words separated by space

·         Punctuation allowed

·         Every word capitalised

·         Example “Now Is The Time”

 

Title Case

·         Words separated by space

·         Punctuation allowed

·         First word capitalised

·         Other words capitalised except for exemptions

·         Example “Now is the Time”

A Word about Punctuation

Some of the above case types allow punctuation in the result and some don’t. For those that do (eg Title Case), the punctuation has to kept but also the word itself has to be extracted so that it can be compared to lists of exemptions if required (an exemption is a word that should keep its initial capitalization, such as “STL” for example).

Also, there are different ways that words can be punctuated which are immediately obvious to anyone reading the text, but which are not so ‘obvious’ to a program trying to analysis text! Usually punctuation comes immediately after a letter – such as the full-stop at the end of this sentence. But sometimes there is a space(s) before the punctuation for effect (referred to as orphaned or marooned punctuation) – such as here  . But when displayed in a case style that permits punctuation, the punctuation belongs to the adjoining word – so for the previous example this would be displayed as ‘such as here.’

There is also the case where there are no spaces at all around the punctuation and words are separated just by punctuation.Such as here (known as embedded or imprisoned punctuation)! Again, when displayed in a case style that permits punctuation this would be displayed as ‘punctuation. Such as here’.

Then there is the situation when multiple spaces are used    between   words as   here. For the required conversion, these multiple spaces need to be converted into the appropriate separation character. For instance, for Snake Case, this would be displayed as ‘are_used_between_words_as_here’. Another issue is to find where sentences start, as some case styles (eg Title Case) deal differently with the initial letter of the first word in a sentence (for example require it capitalised).

Summary of Features

From the above, the features of such a program to implement not just Title Case but also different case styles, can be summarised as:

  • Capitalization of word and the option to keep it as lower case
  • Transformation of word into all upper case or all lower case (which is optional, see below)
  • Determination of word position (ie if first word in sentence or not)
  • Determination of sentence start
  • Maintenance of a list of words that are not to be capitalized (ie kept as lower case)
  • Maintenance of a list of words that are always to be all upper case
  • Output separator specification (eg _ for Snake Case, none for Camel Case)
  • Sentence termination specification (eg .!?)
  • Punctuation suppression (optional, see below)
  • Elimination of duplicate matching punctuation (eg .. becomes .) (optional, see below)
  • Disregard of upper case words (optional, see below)

Options

In order to accommodate the requirements of the different case styles discussed above, various options have been implemented. The different case styles are then simply a matter of choosing the required option combination to produce the desired result. The implemented options are:

Option Description
IgPun Ignore punctuation
NoDup Remove duplicate punctuation if IgPun not set
MkCaps Make all words upper case
IgCaps Ignore words which are all upper case – leave as upper case
ChkCaps Make word upper case if word in table
FstCaps First word always initial letter upper case (overrides MkLc)
InitCaps Initial letter of words (except first) upper case
MkLC Make lower case if word in table (overrides InitCaps)

So the different case styles can then be obtained by the following option combinations:

  IgPun NoDup MkCaps IgCaps ChkCaps FstCaps InitCaps MkLc Separ
Train Case

under
Snake Case

under
Camel Case

Camel Back

Upper Case

space
Lower case

space
Start Case

space
Title Case

space

 

The following pre-defined styles are available – although different styles could be produced by different combinations of the available options:

constexpr static Opts TitleCase = IgCaps | FstCaps | InitCaps | MkLc | ChkCaps | NoDup;
constexpr static Opts TrainCase = IgPun | InitCaps;
constexpr static Opts SnakeCase = IgPun;
constexpr static Opts CamelCase = FstCaps | InitCaps | IgPun;
constexpr static Opts CamelBack = InitCaps | IgPun;
constexpr static Opts UpperCase = MkCaps | NoDup;
constexpr static Opts LowerCase = NoDup;
constexpr static Opts StartCase = FstCaps | InitCaps | NoDup;

Compile Time vs Run Time

For the different title cases, there are really only two pieces of information needed – the option and the separator. As both of these are known at compile time for each of the various cases, these can be specified as compile-time template parameters rather than run-time function parameters.

As they are specified as template parameters, we can use the if constexpr within the functions to deal with the various different case options – again producing a performance benefit. The other compile-time ‘requirements’ which may need to be changed are the various character/word functions (to check if a word is entirely upper case, to make a word lower case, etc.).

Hence the design is that these specific functions are provided in a class (MyWords in this case) and this class is passed as another parameter to the class template. Hence if other versions of these functions are required, then another class can be provided and the template parameters adjusted accordingly. In other terms, we’re using policies. For an excellent reference on policy based design, check Andrei Alexandrescu’s Modern C++ Design.

Hence the templated functions have a template definition of:

template<Opts OPTS, uint8_t SEPAR, class WRDS>

Where:

  • OPTS is the required options
  • SEPAR is the separation character (note uint8_t and not char as by default char is signed)
  • WRDS the class for the word functions

This then provides for the pre-defined cases as:

// Predefined classes
// Note as TitleCase etc are defined as part of the class, they have to be referenced via a class instance
using DummyCase = RestMyCase<0, 0, MyWords>;  // For option reference via a dummy class

using MyTitleCase = RestMyCase<DummyCase::TitleCase, ' ', MyWords>;
using MyTrainCase = RestMyCase<DummyCase::TrainCase, '_', MyWords>;
using MySnakeCase = RestMyCase<DummyCase::SnakeCase, '_', MyWords>;
using MyCamelCase = RestMyCase<DummyCase::CamelCase,   0, MyWords>;
using MyCamelBack = RestMyCase<DummyCase::CamelBack,   0, MyWords>;
using MyUpperCase = RestMyCase<DummyCase::UpperCase, ' ', MyWords>;
using MyLowerCase = RestMyCase<DummyCase::LowerCase, ' ', MyWords>;
using MyStartCase = RestMyCase<DummyCase::StartCase, ' ', MyWords>;

Those could also be constants outside of the RestMyCase class. The point of putting them inside of the class definition is to show that they belong with RestMyCase.

Approach

Considering the feature requirements, and also that the conversion should be as fast as possible using parallel execution where practical, I decided that the approach to use would be to generate a vector of tokens parsed from the input. This would offer the following benefits:

  • Enable parallel processing of each token
  • Accommodate the punctuation issues discussed above
  • Easily enable different output formats to be produced
  • Split input parsing from output formation so that different methods for each could be used if required

As conversion should be as fast as possible, I also decided that rather than store each token as a std::string, only a pair of iterators would be stored that referenced the text to be converted.  Thus the conversion for each token would be ‘in place’ conversion of the text where each token would be converted independently of each other.

This is ok for parallel execution as simultaneous access to different elements is allowed without requiring locking. The downside of this, of course, is that if different case type conversions are required, the original text has to be specified for each conversion. As this is not expected to be the norm, I considered that the benefits outweigh the drawback.

Originally, I intended to use std::string_view to refer to the appropriate part of the text. However, std::string_view provides read-only access to the underlying data (the clue is in the name – “view”!). Hence I introduced a new class StrAmd that allows read/write access to the underlying data.

This class provides the required sub-set of the std::string_view features but with the ability to amend the underlying data. In this way, the text to be converted is copied only when it is stored in the class and when it is re-constituted for return to the caller – rather than making individual token copies.

This gives the structure of a token as:

// Structure for a word token
struct Token {
       StrAmd word;                // Word
       StrAmd prefix;              // Prefix punctuation
       StrAmd suffix;              // Suffix punctuation
       size_t pos = 0U;            // Word position in line so parallel processing knows which word
       const RestMyCase* const myThis = nullptr;   // This pointer (access member variables from static)

       Token() = delete;           // No default constructor as needs this pointer
       Token(StrAmd w, StrAmd p, StrAmd s, size_t po, const RestMyCase* const th) : word(w), prefix(p), suffix(s), pos(po), myThis(th) {}
};

When a member function is executed in parallel, it has to be defined as static (and also any class function subsequently called) which means that it can’t directly access non-static class member variables. That is why myThis element is used – to enable these static functions to access the non-static member variables.

Punctuation is also stored separately – as prefix and suffix punctuation. This means that the variable word references just the actual word and doesn’t include any punctuation. This makes it easy to allow/disallow punctuation and for quick lookup of the word in tables such as for when the word has to be kept all lowercase or all uppercase etc.

A text line is split into a std::vector of Tokens using the class member function split(). Currently, this analyses the line character by character and builds the appropriate Token for each element. This is where the punctuation issues discussed previously are handled. Ideally, this would be a parallel function, but that’s for another time!

For an input of ”  the,,the. . BOY ,, ???stOOd!! on tHe Burning deck  .  ” 

The resulting token vector would be

Prefix Word Suffix
  the ,,
  the .
    .
  BOY  
    ,,
??? stOOd !!
  on  
  tHe  
  Burning  
  deck  
    .

Which for Title Case gives a result of

The, the. BOY, ?Stood! On the Burning DECK.

On is capitalised as it is the start of a new sentence. DECK is all uppercase as this word is so specified and BOY is all uppercase as it was originally. Multiple punctuation symbols have been reduced to just one symbol with spaces removed and multiple spaces between words have been compacted to one.

Once the std::vector of Tokens has been created, then these are processed in parallel (process() ) according to the required case style. Then finally the required result string is produced using make() from the processed tokens.

As the main purpose of make() is to concatenate various pieces of text together (from the info provided by the Tokens vector) to produce the final string, this process needs to be as fast as possible. C++17 helpfully provides an overload for string += with std::string_view and casting a StdAmd class to std::string_view is trivial. This avoids the overhead of having to first convert to a temporary string before the concatenation.

In a previous version of this code, the make() function was taking a disproportionate amount of the overall time. I eventually traced it down to the string concatenation operation += which was being used with the class StrAmd. This class originally had a conversion to string:

operator std::string() const { return std::string(str, send); }

But no conversion to std::string_view. Hence += was first creating a temporary std::string object before this was used with the concatenation and then deleted. This construction/destruction of the temporary string object was what was causing the performance issue.

Providing a std::string_view conversion:

operator std::string_view() const noexcept {return std::string_view(reinterpret_cast<char*>(str), send - str); }

allowed the std::string concatenation to be performed without the creation of any temporary objects and hence improved the performance.

Using the library

The main class provided is RestMyCase. This provides the ability to specify and process the text as required. It also provides the means to allow certain words to be excluded from processing etc (depending upon options). If this ability is required then pointer(s) to a class instance that provides the necessary word processing functions are specified. In the provided code, these functions are provided in the MyWords class (along with the character processing functions such as isupper() ).

However it is not necessary that these two provisions (the character processing functions and the word processing functions) are part of the same class and can easily be separated into two classes if needed.

The provided MyWords class gives the functionality of adding/removing words from the list (singularly or from an initializer-list) and for saving/loading the list to/from a file. All words are stored as lowercase so that the comparison is fast to determine if a specified word is present or not. This can be done with a simple .count() for the unordered set – although there is a slight performance issue as the type of the unordered set key is std::string so the type of the parameter to .count() has to be the same – hence a cast is required here. There are ways of avoiding this if needed – such as reading the word file(s) into memory and only storing pointers in the unordered set rather than the word itself.

Here is a simple example of use:

MyWords ucwds {"deck"};
MyWords lcwds {"a", "an", "the", "at", "by", "for", "in", “is”, "of", "on", "to", "and", "as", "or"};

cout << MyTitleCase(&lcwds, &ucwds).myCase(“this is a title case example”) << endl;

The main processing is provided by the function myCase() which is simply:

std::string RestMyCase::myCase(std::string_view ln)
{
      line = ln;  // Words will be converted in-situ

      // Split line into words on white-space and ignore multi-white space chars
      auto tkns = split();

      // Process each word in parallel
      std::for_each(std::execution::par_unseq, tkns.begin(), tkns.end(), process);

      // Make required case string
      return make(tkns);
}

Where split() is the function that splits the line into a vector of tokens, process() is the function that processes each token in-situ according to the required case style and make() is the function that produces the required case style string from the tokens.

As some of the different cases treat differently those words that start a sentence – either because it is the first word of the word following an end-of-sentence punctuation, then it is necessary to specify what constitutes end-of-sentence punctuation. By default these characters are “! ? .”. If these are required to be changed, then .setTermPunc() can be used to specify the new end-of-line characters and .getTermPunc() to obtain the current end-of-line characters.

Test cases

For examples of the different case styles, consider:

const string text = "   tHe   BOY stOOd  On The deck  ..  .. the Deck waS buRniNg ! ! ";

cout << "Original text\n\"" << text << "\"" << endl;

cout << "\nAs Title case\n";
cout << MyTitleCase(&lcwds, &ucwds).myCase(text) << endl;

cout << "\nAs Start Case\n";
cout << MyStartCase(&lcwds, &ucwds).myCase(text) << endl;

cout << "\nAs Train Case\n";
cout << MyTrainCase(&lcwds, &ucwds).myCase(text) << endl;

cout << "\nAs Snake Case\n";
cout << MySnakeCase(&lcwds, &ucwds).myCase(text) << endl;

cout << "\nAs Camel Case\n";
cout << MyCamelCase(&lcwds, &ucwds).myCase(text) << endl;

cout << "\nAs Camel Back\n";
cout << MyCamelBack(&lcwds, &ucwds).myCase(text) << endl;

cout << "\nAs Upper Case\n";
cout << MyUpperCase(&lcwds, &ucwds).myCase(text) << endl;

cout << "\nAs Lower Case\n";
cout << MyLowerCase(&lcwds, &ucwds).myCase(text) << endl;

Which produces the output:

Original text
"   tHe   BOY stOOd  On The deck  ..  .. the Deck waS buRniNg ! ! "

As Title case
The BOY Stood on the DECK. The DECK Was Burning!

As Start Case
The Boy Stood On The Deck. The Deck Was Burning!

As Train Case
the_Boy_Stood_On_The_Deck_The_Deck_Was_Burning

As Snake Case
the_boy_stood_on_the_deck_the_deck_was_burning

As Camel Case
TheBoyStoodOnTheDeckTheDeckWasBurning

As Camel Back
theBoyStoodOnTheDeckTheDeckWasBurning

As Upper Case
THE BOY STOOD ON THE DECK. THE DECK WAS BURNING!

As Lower Case
the boy stood on the deck. the deck was burning!

Timing

In order to obtain timing info, I perform a conversion to TitleCase 300,000 times to get a reasonable measurable time.

const size_t numloop = 300'000;
string tc;

auto startt = std::chrono::high_resolution_clock::now();

MyTitleCase rm(&lcwds, &ucwds);

for (size_t i = 0; i < numloop; ++i)
      tc = rm.myCase(text);

auto diff = std::chrono::high_resolution_clock::now() - startt;
std::cout << std::chrono::duration<double, milli>(diff).count() << " ms" << std::endl;

cout << tc << endl;

The timings obtained are very interesting:

Code Variation Time (laptop) Time (coliru)
Code as provided (sequential,  MyWords::toupper() etc) 310 506
Using std::toupper() etc 409 635
Using std::execution::par_unseq ( MyWords::toupper() etc) 1,0213 N/A

Note that all timings are in ms. The laptop uses Windows 7 with MS VS 2017 15.8.5

This is very instructive. Using parallel execution is about 33 times slower than sequential execution – which might not have been expected, and I certainly didn’t when the program was designed to use parallel execution!

However, investigation shows that there are a large number of very small parallel loops – with each parallel loop potentially using a thread. Creating/deleting threads has an overhead. Not as much as creating a new process – but an overhead nevertheless. In the case of this code, the overhead of continually creating and destroying multiple threads is much greater than the time saved using parallel execution – hence using a parallel policy in this case makes the performance worse and not better!

The conclusion from this is clear: don’t always assume that just because some code can be parallelised then it should be. Always do performance testing to determine the best scenario.

The other performance related conclusion is that the implementations of std::toupper() etc are not the most efficient. Using the equivalent functions in the code gives about 25% (MSVS) and 20% (coliru) performance improvement over the standard CRT ones – although the provided functions don’t support locale etc.

The ones in the code are based simply upon a look-up table with one entry for each of the 256 possible characters in the ASCII character set (hence the requirement to use unsigned char [uint8_t] and not char (which is signed by default) which gives a value range of -128 to +127 rather than the required 0 – 256) – giving a true/false result for the character used as an index. Currently the data provides for ispunct(), isspace(), isupper(), islower() and isdigit() – although it is very easy to extend for others as required. The table is based around the struct isa:

struct isa {
      bool isapunct = false;
      bool isaspace = false;
      bool isaupper = false;
      bool isalower = false;
bool isadigit = false;
};

Where each element of the struct represents a required character trait. The table is then:

constexpr isa chars[std::numeric_limits<uint8_t>::max() + 1] {
{0, 0, 0, 0, 0},        //   0      0    NUL
{0, 0, 0, 0, 0},        //   1      1    SCH         CTRL A
{0, 0, 0, 0, 0},        //   2      2    STX         CTRL B//… etc
//...
{1, 0, 0, 0, 0},        //  46      2e    .
{1, 0, 0, 0, 0},        //  47      2f    /
{0, 0, 0, 0, 1},        //  48      30    0
//...
};

The look-up function is then trivial. For example:

constexpr static inline bool isspace(uint8_t ch) noexcept {return chars[ch].isaspace; }

Just add to struct isa as required and then provide the necessary 0’s and 1’s for the new entry(s) in the array chars – the code for the new lookup is then as easy as the above.

With all performance related issues, however, you need to determine first that you have a performance issue, then second to establish where are the performance bottleneck(s) through profiling and third to ascertain if the bottleneck(s) is caused by the algorithm or the code implementation.

Conclusion

This has been a very interesting summer project. The more I delved into it, the more complex it became. If nothing else comes of this, I hope it encourages thought and discussion. As always, any issues found with the code are attributable to A. N. Other to whom I will pass any such reported misguided comments! Adios summer 2018. Roll on summer 2019.

You will also like

Don't want to miss out ? Follow:   twitterlinkedinrss
Share this post!Facebooktwitterlinkedin