Structured Validation Rules in C++

Posted Sep 8, 2024

By Matt Bolitho 12 min read

Validation is a common problem when dealing with any data provided by a user. Even when developing a library, you likely provide custom data types for consuming developers, which are equally (if not more!) untrustworthy.

In a personal project, I was dealing with validating user provided numerical optimization problems. For the sake of brevity in this post, let’s assume these problems look like this:

  
struct OptimizationProblem
{
  std::vector<double> XLb;
  std::vector<double> XUb;
  std::function<double(std::span<double const>)> Eval;
};

OptimizationProblem contains upper and lower bounds for the primal variables we are optimizing over. There is also an evaluation callback which takes a readonly span of the current point and returns the objective value. We assume we want to minimize this objective value.

OptimizationProblem is a simple aggregate type, which means any data can easily be passed when brace-initializing new instances. This gives great flexibility for consumers (which can be important in numerical compute domains), but we have no idea whether the data in the instance will be valid. There are two standard tools to reach for in these scenarios:

Constructors - Validate inputs to guarantee correctness by construction. This doesn’t always scale well for large objects with many parameters.
Builder Pattern - Provide a different type with a fluent API to build up objects that require complex initialization, potentially with validation on the fly.

In an ideal world, all objects would be correct by construction and we wouldn’t have to spend time validating them. In reality, human beings use code and all of them are capable of making mistakes when doing so. Consequently, we should not trust any user inputs – even those provided with the best of intentions!

For the aforementioned project, I ended up choosing to use builder pattern because I like fluent API designs and it scales to meet the potentially complex definitions of optimization problems. However, having spent some time thinking about and some interesting conversations around how to structure validation rules, I thought I’d write up this blog post anyway! It might still prove useful for thinking about applying a series of validation rules, even if only as an implementation detail. Whilst this post focusses on C++, I think many of the concepts discussed herein are language agnostic.

Our Toy Problem

The rules I am going to use to validate problem instances are:

The size of the bounds vectors must be the same.
Every lower bound must be less than or equal to (within machine precision) the corresponding upper bound.
The evaluation function must be set to something.

I’ll continue to use the example of validating our OptimizationProblem instances (as opposed to a correctness by construction approach), purely because I have already admitted that there are cleaner ways to achieve this and that the point of this post is to discuss the rules-based approach. With a constructor-based approach, we could check these values in the body or use a constructor try block with throwing factory functions creating instances of each field. Using a builder type, we would check each condition in the functions that set or overwrite the data in question. Putting all that aside, let’s take a look at how one might start to validate the data contained within our OptimizationProblem.

For reference, I am compiling this code with clang++-18 and libcxx with import std; support, so no explicit STL include directives will appear in code samples. I’ve also aliased some namespaces for brevity: namespace stdr = std::ranges; and, dependently, namespace stdv = stdr::views;.

Inline Validation

When starting with validation, a common approach is to run series of conditional checks on an object combined with throwing exceptions.

Particularly, the C++ standard library provides the std::logic_error base exception type which has further derived logical exception types, such as std::invalid_argument and std::out_of_range.

  
if (Problem.XLb.size() != Problem.XUb.size()) {
  throw std::invalid_argument(
    "Variable upper and lower bounds must be the same size.");
}

// Ignoring intricacies of floating point comparisons here...
for (std::size_t i = 0; i < Problem.XLb.size(); ++i) {
  if (Problem.XLb[i] < Problem.XUb[i]) {
    throw std::invalid_argument(
      std::format("Variable bounds at index {} are invalid", i));
  }
}

if (Problem.Eval == nullptr) {
  throw std::invalid_argument("Evaluation callback must be set.");
}

There are some benefits to this simple approach. It is easy to understand and we can rely on previous conditional checks in subsequent onces.

However, in my opinion, there are a few problems with this.

The validation has explicit control flow, which hurts extensibility.
The conditions are not re-usable or composable.
We immediately throw upon any issue occurring. It would be ideal to present all issues to a consumer at once.

These issues can be solved by some kind of rule based system. With this approach, we will apply one or more rules to instances of the data to validate. These rules may produce diagnostics, which can be aggregated into a single structure to return to our consumers. The structured rules can be re-usable, composed, and shared easily.

For the remainder of this post, let’s assume we are trying to implement the following function signature which uses std::string for our diagnostics:

  
std::vector<std::string> getErrors(OptimizationProblem const& Problem);

There are certainly nicer interfaces to validation and much richer structured data structures that could be used for diagnostics. However, this is not difficult reach from our simple signature and we can focus more on the implementation of the rules for the sake of this post.

Eagerly Evaluated Rules

The simplest phrasing of a rule is to evaluate some boolean expression and associate a diagnostic with its false case.

  
struct ValidationRule { bool Result{}; std::string_view Message{}; };

We can instantiate a collection of these rules and eagerly evaluate them against instances of our problem. To find out if we encountered any errors, we can check the result values and append the associated error message.

  
std::vector<std::string> getErrors(OptimizationProblem const& Problem) {
  std::vector<ValidationRule> Rules {
    {
      Problem.XLb.size() != Problem.XUb.size(),
      "Variable lower and upper bounds must be the same size."
    },
    {
      stdr::any_of(
        stdv::zip(Problem.XLb, Problem.XUb),
        [](auto const& [lb, ub]) { return lb > ub; }),
      "Variable lower bounds must be less or equal to their respective upper bound."
    },
    {
      Problem.Eval == nullptr,
      "Evaluation callback must be set."
    },
  };

  std::vector<std::string> Diagnostics{};
  for (auto& Rule : Rules) {
    if (Rule.Result) {
      Diagnostics.push_back(std::string{Rule.Message});
    }
  }
  return Diagnostics;
}

This is some improvement – a simple to implement, understand and extend implementation for validating OptimizationProblem instances.

If OptimizationProblem grows in future and we have new validation rules, they can simply be added to the list of rules. Furthermore, if the user encounters validation issues, they will get a list of every issue in the provided data. Whilst it might be considered a little ceremonial, the implementation is also pretty straightforward which is a bonus when new contributors read a piece of code. Another benefit is that most of this code is constexpr – in fact the full implementation might even be constexpr as long as the resulting vector of diagnostics is used and destructed in the same expression.

There are some drawbacks to this approach though. The main limitation is that the rules aren’t really sharable due to their eager evaluation. This limits their usefulness to tidying rules definitions within a single area of an implementation. We do also have one small logic issue here, wherein the second rule for the bounds value checking will fire when the bounds vectors are different sizes. Whilst this is technically valid, it is probably not quite the reporting that we want.

Functional Rules

We can address some of the shortcomings of the eagerly evaluated rules by deferring the execution of the rule. If we store the validation logic within a std::function instance, then we can turn our validation rules into sharable instances of logic. Unfortunately, we will not be able to capture OptimizationProblem anymore, so we will need to pass this in to both our function fields to allow access to its data. This is also a good opportunity to genericize our rules over any type we may wish to validate.

Our new rule type is looking like this:

  
template <typename T>
struct ValidationRule {
    std::function<bool(T const&)> Rule{};
    std::function<std::string(T const&)> Message{};
};

but this seems inefficient…

If Rule(Problem) returns false, then we may end up needing to re-do work in Message(Problem) in order to generate a nice error message. This will lead to messy, repetitive code and negatively affect performance.

Take for example our check that all lower bounds are less than or equal to the corresponding upper bound. If we wanted to print a nice message informing the user at which indices the bounds are invalid, then we need to iterate all that data again. This might be OK for small problems, but if we have a lot of bounds to validate, this might be slow.

So ideally, we would be able to have a single function that returns something indicating whether or not a diagnostic is produced. This sounds like a job for std::optional. We can return a std::optional instance that is set to std::nullopt when there is no diagnostic, or a value containing the diagnostic if the rule fails. Whilst we’re making this change, we can also genericize over the diagnostic type (although we will continue to use std::string for our diagnostic purposes).

  
template <typename TData, typename TDiagnostic>
struct ValidationRule {
  std::function<std::optional<TDiagnostic>(TData const&)> Rule{};
};

There are still a few improvements that could be made to this. For starters, we could constraint the template parameters with concepts. We could also introduce a custom type in place of std::optional<TDiagnostic> that better captures the semantics of rule evaluation. Currently, we will be checking for the negative case of the optional instance to convey a positive outcome from the rule evaluation, which doesn’t aid code readability. Perhaps this could be addresses with a type that can be constructed with functions like Result::Success() or Result::Diagnostic("Don't do that!"). If our design entailed having some important return value from a rule, we could also consider using std::expected.

Let’s see how to implement each of our three validation rules under this new rule type. Starting with the requirement that the variable bounds size match, we can implement a simple lambda that outputs the sizes when they are not equal. We actually could have also done this earlier with our eagerly evaluated capturing rule approach too!

  
[](auto const& Problem) -> std::optional<std::string> {
  auto const lbSize = Problem.XLb.size();
  auto const ubSize = Problem.XUb.size();
  if (lbSize == ubSize) {
    return std::nullopt;
  }
  return std::format("Number of variable bounds must be the same size. "
                     "Lower bounds size: {}, Upper bounds size: {}",
                     lbSize, ubSize);
}

The requirement that the lower bounds are less than (or equal to) the upper bounds starts with a repeat of the length check. As previously noted, this is a limitation of our flat structure for rules. In future, we could always nest rules within each other to execute dependently if we wanted to. Continuing with the implementation, we want to iterate across all bounds and their indices and filter the ones that do not conform with this rule. This is straightforward with basic C++. If we have erroneous bounds, we can use C++23’s formatting ranges feature to create a nice error message containing the indices.

  
[](auto const& Problem) -> std::optional<std::string> {
  if (Problem.XLb.size() != Problem.XUb.size()) {
    return "Failed to validate bounds - bounds vectors are not the same size.";
  }
  std::vector<std::size_t> invalidBoundIndices{};
  for (std::size_t i = 0; i < Problem.XLb.size(); ++i) {
    if (Problem.XLb[i] > Problem.XUb[i]) {
      invalidBoundIndices.push_back(i);
    }
  }
  if (invalidBoundIndices.empty()) {
    return std::nullopt;
  }
  return std::format("Invalid variable bounds at indices: {}", invalidBoundIndices);
}

Finally, the requirement that the evaluation callback is set is as trivial as before:

  
[](auto const& Problem) -> std::optional<std::string> {
  if (Problem.Eval != nullptr) {
    return std::nullopt;
  }
  return "Evaluation callback must be set.";
}

We also need to modify the aggregation of diagnostics:

  
std::vector<std::string> getErrors(OptimizationProblem const& Problem) {
  std::vector<ValidationRule<OptimizationProblem, std::string>> Rules {
    /* as defined above */
  };

  std::vector<std::string> Diagnostics{};
  for (auto& Rule : Rules) {
    if (auto Diagnostic = Rule.Rule(Problem); Diagnostic.has_value()) {
      Diagnostics.push_back(std::move(Diagnostic.value()));
    }
  }
  return Diagnostics;
}

Now we have composable and sharable rule type that we could use throughout a codebase. This would require further genericization, polishing, and better vocabulary types, but it’s a great start.

We can also see some design trade-offs. The eagerly evaluated rules read much simpler and for single point of implementation detail, they may well be the best bet. With the more functional system, the code has got a bit more complicated but the power of that pattern across an entire codebase is much greater. As always with design considerations, a pragmatic decision based on your architecture and domain is required!

Conclusion

However you decide to implement data validation, using a system of rules is a great idea even if you only have even a small number of conditions to enforce. This pattern is extensible, sharable, provides great UX, and prevents the need to maintain a flow of control as data in a codebase changes.

The implementation notes contained herein are by no means exhaustive, one can easily add more functionality and more coherent types to our toy example to improve the solution further. Furthermore, we’ve only considered a flat structure of (mostly) independent rules, but some rules are only valid if other preconditions are met. There are many facets to this topic, but hopefully this post gave you some practical thoughts that you can apply to your rules-based validation today!

cpp

This post is licensed under CC BY 4.0 by the author.