Jonathan Boccara's blog

An Extraterrestrial Guide to C++ Formatting

Published December 4, 2018 - 0 Comments

Today’s guest post is written by Victor Zverovich. Victor is a software engineer at Facebook working on the Thrift RPC framework and the author of the popular {fmt} library, a subset of which is proposed into C++20 as a new formatting facility. He is passionate about open-source software, designing good APIs, and science fiction as you can guess from the current post. Victor presents us an overview of the {fmt} library that brings expressive and concise text formatting to C++. You can find Victor online on Twitter, StackOverflow, and GitHub.

Interested in writing on Fluent C++? Submit your guest post too!

 

C++ text formattingConsider the following use case: you are developing the Enteropia[2]-first Sepulka[3]-as-a-Service (SaaS) platform and have a server code written in C++ that checks the value of sepulka’s squishiness received over the wire and, if the value is invalid, logs it and returns an error to the client. Squishiness is passed as a single byte and you want to format it as a 2-digit hexadecimal integer, because that is, of course, the Ardrite[1] National Standards Institute (ANSI) standard representation of squishiness. Let’s implement the logging part using different formatting facilities provided by C++.

Here’s an iostreams version:

#include <cstdint>
#include <iomanip>
#include <ostream>

void log_error(std::ostream& log, std::uint_least8_t squishiness) {
    log << "Invalid squishiness: "
        << std::setfill('0') << std::setw(2) << std::hex
        << squishiness << "\n";
}

The code is a bit verbose, isn’t it? You also need to pull in an additional header, <iomanip> to do even basic formatting. But that’s not a big deal.

However, when you try to test this code (inhabitants of Enteropia have an unusual tradition of testing their logging code) you find out that the code doesn’t do what you want. For example,

log_value(std::cout, 10)

prints

Invalid value: 0

which is surprising for two reasons: first it prints one character instead of two and second the printed value is wrong. After a bit of debugging, you figure out that iostreams treat the value as a character on your platform and that the extra newline in your log is not a coincidence. An even worse scenario is that it works on your system, but not on the one of your most beloved customer.

So you add a cast to fix this which makes the code even more verbose:

log << "Invalid squishiness: "
<< std::setfill('0') << std::setw(2) << std::hex
<< static_cast<unsigned>(squishiness) << "\n";

Can the Ardrites do better than that?

Yes, they can.

Format strings

Surprisingly, the answer comes from the ancient 1960s (Gregorian calendar) Earth technology, format strings (in a way, this is similar to the story of coroutines). C++ had this technology all along in the form of the printf family of functions and later rediscovered in several places: std::put_time, std::chrono::format.

What makes format strings so useful is expressiveness. With a very simple mini-language you can easily express complex formatting requirements. To illustrate this let’s rewrite the example above using printf:

#include <cstdint>
#include <cstdio>

void log_error(std::FILE* log, std::uint_least8_t squishiness) {
    std::fprintf(log, "Invalid squishiness: %02x\n", squishiness);
}

Isn’t it beautiful in its simplicity? Even if you’ve somehow never seen printf in your life, you can learn the syntax in no time. In contrast, can you always remember which iostreams manipulator to use? Is it std::fill or std::setfill? Why std::setw and std::setprecision and not, say, std::setwidth or std::setp?

A lesser-known advantage of printf is atomicity. A format string and arguments are passed to a formatting function in a single call which makes it easier to write them atomically without having interleaved output in the case of writing from multiple threads.

In contrast, with iostreams each argument and parts of the message are fed into formatting functions separately which makes synchronization harder. This problem was only addressed in C++20 with the introduction of an additional layer of std::basic_osyncstream.

However, the C printf comes with its set of problems which iostreams addressed:

  • Safety: C varargs are inherently unsafe and it is a user’s responsibility to make sure that the type information is carefully encoded in the format strings. Some compilers issue a warning if the format specification doesn’t match argument types, but only for literal strings. Without extra care, this ability is often lost when wrapping printf in another API layer such as logging. Compilers can also lie to you in these warnings.
  • Extensibility: you cannot format objects of user-defined types with printf.

With the introduction of variadic templates and constexpr in C++11 it has become possible to combine the advantages of printf and iostreams. One attempt at this is {fmt}, a popular open-source formatting library.

The {fmt} library

Let’s implement the same logging example using {fmt}:

#include <cstdint>
#include <fmt/ostream.h> // for std::ostream support

void log_error(std::ostream& log, std::uint_least8_t squishiness) {
    fmt::print(log, "Invalid squishiness: {:02x}\n", squishiness);
}

As you can see, the code is similar to that of printf with notable difference being {} used as delimiters instead of %. This allows us and the parser to easily find format specification boundaries and is particularly important for more sophisticated formatting (e.g. formatting of date and time).

Unlike standard printf, {fmt} supports positional and named arguments i.e. referring to an argument by its index or name with an argument id (index or name) separated from format specifiers by the : character:

fmt::print(log, "Invalid squishiness: {0:02x}\n", squishiness);
fmt::print(log, "Invalid squishiness: {squ:02x}\n",
fmt::arg("squ", squishiness));

Both positional and named arguments allow using the same argument multiple times. Named arguments are particularly useful if your format string is stored elsewhere, e.g. in a translation database.

Otherwise the format syntax of {fmt} which is borrowed from Python is very similar to printf‘s. In this case format specifications are identical (02x) and have the same semantics, namely format a 2-digit integer in hexadecimal with zero padding.

But because {fmt} is based on variadic templates instead of C varargs and is fully type-aware (and type-safe), it simplifies the syntax even further by getting rid of all the numerous printf specifiers that only exist to convey the type information. The printf example from earlier is in fact incorrect. Strictly speaking it should have been

std::fprintf(log, "Invalid squishiness: %02" PRIxLEAST8 "\n", squishiness);

which doesn’t look as appealing.

Here is a (possibly incomplete) list of specifiers made obsolete: hh, h, l, ll, L, z, j, t, I, I32, I64, q, as well as a zoo of 84 macros:

intx_t int_leastx_t int_fastx_t intmax_t intptr_t
d PRIdx PRIdLEASTx PRIdFASTx PRIdMAX PRIdPTR
i PRIix PRIiLEASTx PRIiFASTx PRIiMAX PRIiPTR
u PRIux PRIuLEASTx PRIuFASTx PRIuMAX PRIuPTR
o PRIox PRIoLEASTx PRIoFASTx PRIoMAX PRIoPTR
x PRIxx PRIxLEASTx PRIxFASTx PRIxMAX PRIxPTR
X PRIXx PRIXLEASTx PRIXFASTx PRIXMAX PRIXPTR

where x = 8, 16, 32 or 64.

In fact even x in the {fmt} example is not an integer type specifier, but a hexadecimal format specifier, because the information that the argument is integer is preserved. This allows omitting all format specifiers altogether to get the default (decimal for integers) formatting:

fmt::print(log, "Invalid squishiness: {}\n", squishiness);

Following a popular trend in the Ardrite software development community, you decide to switch all your code from std::uint_least8_t to something stronger-typed and introduced the squishiness type:

enum class squishiness : std::uint_least8_t {};

Also you decide that you always want to use ANSI-standard formatting of squishiness which will hopefully allow you to hide all the ugliness in operator<<:

std::ostream& operator<<(std::ostream& os, squishiness s) {
    return os << std::setfill('0') << std::setw(2) << std::hex
              << static_cast<unsigned>(s);
}

Now your logging function looks much simpler:

void log_error(std::ostream& log, squishiness s) {
    log << "Invalid squishiness: " << s << "\n";
}

Mixing formats in the string

Then you decide to add another important piece of information, sepulka security number (SSN) to the log, although you are afraid it might not pass the review because of privacy concerns:

void log_error(std::ostream& log, squishiness s, unsigned ssn) {
    log << "Invalid squishiness: " << s << ", ssn=" << ssn << "\n";
}

To your surprise, SSN values in the log are wrong, for example

log_error(std::cout, squishiness(0x42), 12345);

gives

Invalid squishiness: 42, ssn=3039

After another debugging session you realize that the std::hex flag is sticky and SSN ends up being formatted in hexadecimal. So you have to change your overloaded operator<< to

std::ostream& operator<<(std::ostream& os, squishiness s) {
    std::ios_base::fmtflags f(os.flags());
    os << std::setfill('0') << std::setw(2) << std::hex
       << static_cast<unsigned>(s);
    os.flags(f);
    return os;
}

A pretty complicated piece of code just to print out an ssn in decimal format.

{fmt} follows a more functional approach and doesn’t share the formatting state between the calls. This makes reasoning about formatting easier and brings performance benefits because you don’t need to save/check/restore state all the time.

To make squishiness objects formattable you just need to specialize the formatter template and you can reuse existing formatters:

#include <fmt/format.h>

template <>
struct fmt::formatter<squishiness> : fmt::formatter<unsigned> {
    auto format(squishiness s, format_context& ctx) {
        return format_to(ctx.out(), "{:02x}", static_cast<unsigned>(s));
    }
};

void log_error(std::ostream& log, squishiness s, unsigned ssn) {
    fmt::print(log, "Invalid squishiness: {}, ssn={}\n", s, ssn);
}

You can read the message "Invalid squishiness: {}, ssn={}\n" as a whole, not interleaved with <<, which is more readable and less error-prone.

Now you decide that you don’t want to log everything to a stream but use your system’s logging API instead. All your servers run the popular on Enteropia GNU/systemd operating system where GNU stands for GNU’s not Ubuntu, so you implement logging via its journal API. Unfortunately the journal API is very user-unfriendly and unsafe. So you end up wrapping it in a type-safe layer and making it more generic:

#include <systemd/sd-journal.h>
#include <fmt/format.h> // no need for fmt/ostream.h anymore

void vlog_error(std::string_view format_str, fmt::format_args args) {
    sd_journal_send("MESSAGE=%s", fmt::vformat(format_str, args).c_str(), "PRIORITY=%i", LOG_ERR, NULL);
}

template <typename... Args>
inline void log_error(std::string_view format_str, const Args&... args) {
    vlog_error(format_str, fmt::make_args(args...));
}

Now you can use log_error as any other formatting function and it will log to the system journal:

log_error("Invalid squishiness: {}, ssn={}\n", squishiness(0x42), 12345);

The reason why we don’t call directly call sd_journal_send in log_error, but rather have the intermediary vlog_error is because vlog_error is not a template and therefore it is not instantiated for all the combinations of argument types. This dramatically reduces binary code size. log_error is a template but because it is inlined and doesn’t do anything other than capturing the arguments, it doesn’t add much to the code size either.

The fmt::vformat function performs the actual formatting and returns the result as a string which you then pass to sd_journal_send. You can avoid string construction with fmt::vformat_to but this code is not performance critical, so you decide to leave it as is.

Exploring {fmt}

In the process of developing your SaaS system you’ve learnt about the most fundamental features of {fmt}, namely format strings, positional and named arguments, extensibility for user-defined types as well as different output targets and statelessness, and how they compare to the earlier formatting facilities.

More advanced features include compile-time format string processing, user-defined format string syntax, control over the use of locales, and Unicode support, but you decide to explore those another time.

Glossary

[1] Ardrites – intelligent beings, polydiaphanohedral, nonbisymmetrical and pelissobrachial, belonging to the genus Siliconoidea, order Polytheria, class Luminifera.

[2] Enteropia – 6th planet of a double (red and blue) star in the Calf constellation

[3] Sepulka – pl: sepulki, a prominent element of the civilization of Ardrites from the planet of Enteropia; see “Sepulkaria

[4] Sepulkaria – sing: sepulkarium, establishments used for sepuling; see “Sepuling

[5] Sepuling – an activity of Ardrites from the planet of Enteropia; see “Sepulka

The picture and references come from the book Star Diaries by Stanislaw Lem.

Don't want to miss out ? Follow:   twitterlinkedinrss
Share this post!Facebooktwitterlinkedin