Evaluating user-defined logical expressions

Published May 8, 2020

This is a guest post by Marin Peko. Marin is a Software Engineer working at Cellusys, a company providing telecommunication solutions, and follower of Fluent C++. You can find him on LinkedIn and Github.

Logical expressions are probably one of the most used concepts in computer science and certainly a large part of each code base. In essence, each statement that can either be true or false is a logical expression.

But… things can get a bit more complicated…

What if we want to let the users define a logical expression that will be used in our application?

One example of such an application of parsing strings is tcpdump, a powerful CLI network packet analyzer. It gives you the ability to specify a filter expression like src 127.0.0.1 which will filter out all the network packets with the source IP address 127.0.0.1. You can imagine this filter expression to be implemented in the source code like:

if (source_ip == "127.0.0.1") {
    // pass the packet
}

A new C++17 library, booleval, allows you to achieve exactly that, for boolean expressions as strings. It gives you the ability to specify a logical expression and get your objects evaluated according to it.

In this post I will show some rationale that I found instructive for the design of the library and for the evolution of its interfaces.

How does the tokenization work?

Once the end-user specifies the logical expression (through the command line interface or GUI) the first step is to tokenize that expression. Tokenization itself is performed in two steps:

1. extracting the tokens out of the logical expression
2. injecting a default EQUAL TO operators to where they belong to

That being said, expression (field_a foo and field_b 123) or field_b 456 would consist of the following tokens:

Token	Type
(	LEFT parentheses
field_a	FIELD
eq	EQUAL TO operator *
foo	FIELD
and	AND operator
field_b	FIELD
eq	EQUAL TO operator *
123	FIELD
)	RIGHT parentheses
or	OR operator
field_b	FIELD
eq	EQUAL TO operator *
456	FIELD

* EQUAL TO operator is an optional operator which means that you can but you don’t need to specify it in the logical expression. This means that the above expression could be also written like “(field_a eq foo and field_b eq 123) or field_b eq 456”

Interface for tokenizing the boolean expression

The utility function for splitting up the logical expression has the following signature:

[[nodiscard]] std::vector<std::string_view> split(std::string_view strv,
                                                  std::string_view delims,
                                                  split_options const options);

where split_options is an enumeration defined as:

enum class [[nodiscard]] split_options : uint8_t {
    off&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; = 0x00,
    include_whitespace = 0x01,
    include_delimiters = 0x02,
    exclude_delimiters = 0x04
};

Now, we can use the split function like:

auto raw_tokens = split(
    "(field_a foo and field_b 123) or field_b 456",
    "<>()",
    split_options::include_whitespace |
    split_options::include_delimiters
);

The above call would split the expression (field_a foo and field_b 123) or field_b 456 by whitespace character (because of the include_whitespace option) as well as by <, >, ( and ) characters, and it would result in the following std::vector of std::string_view:

{ "(", "field_a", "foo", "and", "field_b", "123", ")", "or", "field_b", "456" }

Please, notice that the result contains delimiters too. That’s because include_delimiters option was specified in the function call.

Implementing the tokenization

The initial implementation of the booleval library was using C++ regex library for tokenizing the logical expression. However, this turned out to be an overkill and was decreasing the performance significantly. On the other hand, simple splitting up the logical expression by whitespaces and single character symbols (like (, ), < and >) would have the same result and much better performance.

The following figure shows the performance difference between a regex-based split function and an ordinary split function:

The full implementation of the split functions looks like:

std::vector < std::string_view > split(std::string_view strv,
                                       std::string_view delims,
                                       split_options const options) {

    std::string delims_impl { delims };

    if (is_set(options, split_options::include_whitespace)) {
        delims_impl.append(1, ' ');
    }

    std::vector < std::string_view > tokens;
    auto first = std::begin(strv);

    while (first != std::end(strv)) {
        auto const second = std::find_first_of(
            first, std::cend(strv),
            std::cbegin(delims_impl), std::cend(delims_impl)
        );

        if (first != second) {
            tokens.emplace_back(
                strv.substr(
                    std::distance(std::begin(strv), first),
                    std::distance(first, second)
                )
            );
        }

        if (std::end(strv) == second) {
            break;
        }

        if (is_set(options, split_options::include_delimiters)) {
            std::string_view delim { second, 1 };
            if (!is_empty(delim)) {
                tokens.emplace_back(delim);
            }
        }
        first = std::next(second);
    }
    return tokens;
}

I got the inspiration for the above split function implementation from the following Bartek’s article here.

Now when the logical expression is being successfully tokenized, let’s see what is the next step…

Expression tree and node visitors

Once the expression is tokenized, the expression tree is being built by using the recursive descent parser, a top-down parser which consists of mutually recursive procedures. Since the recursive descent parser topic is pretty extensive itself, I won’t go any further with it in this article. At the end, the expression tree looks like the following:

Now when we have the expression tree, a visitor design pattern is used for calculating the end result of a logical expression.

template <typename T>
[[nodiscard]] constexpr bool result_visitor::visit(tree_node const& node, T const& obj) {
    if (nullptr == node.left || nullptr == node.right) {
        return false;
    }

    switch (node.token.type()) {
    case token::token_type::logical_and:
        return visit_logical(node, obj, std::logical_and<>());
    case token::token_type::logical_or:
        return visit_logical(node, obj, std::logical_or<>());
    case token::token_type::eq:
        return visit_relational(node, obj, std::equal_to<>());
    case token::token_type::neq:
        return visit_relational(node, obj, std::not_equal_to<>());
    case token::token_type::gt:
        return visit_relational(node, obj, std::greater<>());
    case token::token_type::lt:
        return visit_relational(node, obj, std::less<>());
    case token::token_type::geq:
        return visit_relational(node, obj, std::greater_equal<>());
    case token::token_type::leq:
        return visit_relational(node, obj, std::less_equal<>());
    default:
        return false;
    }
}

template <typename T, typename F>
[[nodiscard]] constexpr bool visit_logical(tree_node const& node, T const& obj, F&& func) 
    return func(visit(*node.left, obj), visit(*node.right, obj));
}

template <typename T, typename F>
[[nodiscard]] constexpr bool visit_relational(tree_node const& node, T const& obj, F&& func) {
    auto key = node.left->token;
    auto value = node.right->token;
    return func(fields_[key.value()].invoke(obj), value.value());
}

In the above code, fields_ is an std::map where keys are the names of the class members (like field_a and field_b) and values are pointers to class member functions. Furthermore, obj is the object to be evaluated and whose member functions will be called.

Since C++ logical operators && and || are used, short-circuiting is guaranteed so there should be no fear that some nodes are visited even though they shouldn’t be.

The evolution of an interface

In versions of my library prior to v1.2, user would need to do something like:

booleval::evaluator evaluator;
evaluator.expression("field_a foo and field_b 123");

evaluator.evaluate({
    { "field_a", obj_1.field_a() },
    { "field_b", obj_1.field_b() }
});

evaluator.evaluate({
    { "field_a", obj_2.field_a() },
    { "field_b", obj_2.field_b() }
});

// ...

evaluator.evaluate({
    { "field_a", obj_n.field_a() },
    { "field_b", obj_n.field_b() }
});

You can see that there is much repetitive work here since the user needs to create a key-value map each time he wants to evaluate a certain object. This isn’t that much pretty and is improved in the v1.2 of booleval library.

With the v1.2 and thanks to suggestions from Reddit post, user is allowed to specify member function pointers which will be used in the evaluation, like:

booleval::evaluator evaluator;
evaluator.expression("field_a foo and field_b 123");

evaluator.map({
    { "field_a", &obj_1::field_a },
    { "field_b", &obj_1::field_b }
});

evaluator.evaluate(obj_1);
evaluator.evaluate(obj_2);

// ...

evaluator.evaluate(obj_n);

This approach looks less error-prone and much prettier.

Now, how did I store member function pointers of different signatures into the container such as std::map? There is a class called any_mem_fn that looks like the following:

class any_mem_fn {
public:
    any_mem_fn() = default;
    any_mem_fn(any_mem_fn&& rhs) = default;
    any_mem_fn(any_mem_fn const& rhs) = default;

    template <typename Ret, typename C>
    any_mem_fn(Ret (C::*m)()) {
        fn_ = [m](std::any a) {
            return (std::any_cast<C>(a).*m)();
        };
    }

    template <typename Ret, typename C>
    any_mem_fn(Ret (C::*m)() const) {
        fn_ = [m](std::any a) {
            return (std::any_cast<C>(a).*m)();
        };
    }

    any_mem_fn& operator=(any_mem_fn&& rhs) = default;
    any_mem_fn& operator=(any_mem_fn const& rhs) = default;
    ~any_mem_fn() = default;

    template <typename T>
    any_value invoke(T obj) {
        return fn_(obj);
    }

private:
    std::function<any_value(std::any)> fn_;
};

Some might say that using std::function is too expensive but I could not figure out any better/less expensive way of doing the same. So, if anyone has a suggestion on how could I improve this part, please let me know 🙂

Way forward

As a next step, it would be interesting to compare the performance of my small booleval library with other libraries in the same field. First that comes up to my mind is Boost.Spirit library. Do you know of any other library that I can make part of my benchmark?

If you have any other improvement suggestions, please let me know!