Here's how to validate experiments even better

Can you actually rely blindly on the one metric from an A/B test? And can an organization stay afloat by not looking at certain business metrics for once? 

In general, you can say that organizations have a strong need for metrics. Without the right metrics, they are fairly directionless. "Missing is guessing" they sometimes say. This is certainly true for commercial companies. Determining the right key metrics does not always prove easy there, and that is certainly true of getting an internal consensus on them. Key metrics may ultimately be "imposed" by and for an organization but may also be created to measure specific performance of a team or product.

What exactly about online experiments such as A/B testing? What metrics play an important role in validating a hypothesis? And then how can you validate those metrics?

'The invisible metric'

Metrics bring structure, provide focus and help you make the right decisions. In doing so, they contribute to change and make a positive contribution to an organization's strategy and direction. Especially when your organization is data-driven and uses experiments to validate ideas, metrics are indispensable to demonstrate success or loss. 

But even when metrics of an A/B test do not convincingly shoot into the green or red, and thus no significant effect can be demonstrated (if an effect is present at all), a metric is still a fine straw that you can cling to. Especially if the hypothesis of an experiment is aimed at influencing that metric. 

But is that justified? Does such a metric give you the correct or complete picture at all times? 

Possibly not always.

Still, with the validation of a new idea, something may have unintentionally crept into the experiment that you had not anticipated or could not suspect beforehand. For example, the variation may have negatively affected another metric (still invisible to you) and, at the same time, your most important metric.

For example, keeping the number of visitors with an order the same could simultaneously mean an increase in the average order value (AOV). These kinds of effects we might still be able to explain (chance, outliers, etc.) or substantiate somewhat. But there are always test results that you want to investigate further because they are much more difficult to interpret.

For example, a non-significant result does not necessarily mean that you had a bad idea. Bad ideas don't exist anyway because they can lead back to the right ideas. "Sometimes you win. Sometimes you learn."

The point I want to make here is that a negative or non-significant result could also have had an underlying cause other than just the influence of the metric you were optimizing for. 

What if the test page does not perform as well due to longer load time? What if more complex code doesn't execute properly in a particular browser type or version? And what if there are more repeat visits in the test variant due to an improperly designed online campaign? These are just examples of factors that can impact results. Well, but what do you do then? 

'Slow down experiments'

Product teams in organizations are often developing new features for the website or app. This is often done with the goal of positively influencing key metrics. For example, the organization is pushing for more orders. Or perhaps more engagement is what is being sought. In any case, there is a chance that you will lose sight of the fact that new features can imperceptibly influence more than just the key metrics that serve for success. Or it could backfire.

Introducing changes (a new feature, new code, etc.) doesn't have to immediately mean a better experience and also better conversion. We know as optimizers like no other that you have to validate everything first. We have also known for years that technology and speed (or rather inertia) on all fronts are conversion killers are. There are even powerful ways to test this in the form of slow down experiments

Amazon once did this and showed with such an experiment that a deliberately built-in delay of only 100 milliseconds resulted in an 1% sales drop. That involves a lot of money at Amazon. A lot of money.

This is proof that even non-visible elements (e.g., a page load metric) can have an impact on key metrics and thus on the entire business!

Goals & drivers and guardrails

If you are already experimenting heavily, you probably already work a lot with goal and driver metrics. I list them again below to indicate the difference in metrics, and their relationship to each other.

Goal metrics

Goal metrics are success metrics and deal with key objectives (goals). They are usually linked to the mission statement of an organization and are about things that people really care about. The best-known example of a goal metric (also called key metric) is the order or transaction metric.

Driver metrics

Driver metrics give indications that we are moving in the right direction to achieve our goals. They therefore contribute directly to the goal metrics and are mostly about "user engagement" and "user retention" and the like. Examples of driver metrics are the net promoter score (NPS) and the proportion of returning visitors or new registrations to the Web site.

How can you ensure that the above metrics are reliable? Guardrail metrics can help you with this.

Guardrail metrics

Guardrail metrics have the main purpose of giving you support. They ensure the reliability of the result of the goal metric and warn if something is not right. Incidentally, they do not contribute to business value the way a goal metric does. 

Because of their "sensitive nature," guardrail metrics have a lower statistical variance, making them more likely to provide significance. As a result, errors can be detected more quickly. Examples of guardrail metrics are SRM (Sample Ratio Mismatch), or metrics that include, for example latency or page load time monitoring. 

Guardrail metrics are crucial but not yet always embraced by organizations that are experimenting, even though they directly impact goal and driver metrics. Using guardrail metrics, by the way, does come with a certain level of maturity within a CRO team or CRO program.

The big tech companies, which have elevated experimentation to an art, sometimes employ dozens to hundreds of guardrail metrics per online experiment. Then it suddenly becomes apparent that changes in specific cases can have an unexpected impact on the business. 

Overall Evaluation Criteria.

How great would it be if we could capture everything we do in one metric. A single metric that makes all other metrics obsolete? Unfortunately, that's not possible. After all, a cockpit of an airplane or a dashboard of a car also needs multiple metrics. It is irresponsible to fly, sail or drive on only one metric.

Still, there is a method you can apply to validate the success of an online experiment better than just looking at a goal or key metric. You can do this by combining several metrics into one metric.

In our field, this is called an OEC, or the Overall Evaluation Criteria of an experiment. Multiple metrics are involved and taken into account in the result of an A/B test. The final judgment on this result is thus better substantiated and provides additional assurance for implementation. 

In an OEC, usually one or more key metrics are brought together as one KPI and these are supplemented with some guardrail metrics. The big difference between an OEC and a guardrail is that an OEC does contribute to the business value (lifetime value) of an organization.

Online Dialogue worked on an OEC for one of the largest e-commerce companies in the Netherlands. Among others, my colleague Anouk Erens was closely involved. 

Anouk: "For a large e-commerce player, we helped define the OEC. What metrics belong in this OEC and are they all important? In what way can you measure them reliably? And what relationship exists with the other key metrics? Setting up an OEC is a process in which a lot of coordination takes place between different departments. You capture, as it were, your entire business goal under one overarching metric, then this metric must be well thought out."

Establishing an OEC

It is really important to keep raising awareness of the most important metric(s) internally at your organization. The fact that you can then guarantee an additional level of reliability at the same time is another great thing. Really not everyone in an organization is already familiar with this. To know is to measure! Yes yes, you read it right. Sometimes you really have to think back....

In addition, this also reflects on your work. An organization can only be happy about that, right? Therefore, I think that in the end, everyone can only welcome the existence of a well-appointed OEC. Whether that's the COE or perhaps the CEO. Everyone benefits!