Services
- - Conversion optimization (CRO)Guidance for your experimentation program, from idea to analysis. More experiments, better conversions and informed decisions.
  - Improve your experimentation programMake experimentation a structural part of your organization. We help teams grow with advice, support and training.
  - Product experimentationDevelop products that really work. Validate assumptions and make informed choices.
- - Qualitative researchUnderstand thresholds, motivations and behaviors of your users and discover the biggest optimization opportunities.
  - AutomationsSpeed up your optimization process with automations. Less manual work, more time for insight and strategy.
  - Process MiningDiscover hidden paths, bottlenecks and opportunities to improve the online experience, backed by data.
- See all services
Training
- - Training Conversion SpecialistWant to learn all about conversion optimization? In this training you will learn to test, analyze and optimize like a pro. From knowledge of A/B testing to psychology and UX.
  - Consumer Psychology TrainingTo improve the conversion rate of your website, online store or product, it is essential to understand your customers' drivers. During the Consumer Psychology course, you will learn five optimization strategies to take your CRO program to the next level.
  - Data analysis training for CROPractical training in which you will learn to analyze, validate and present data convincingly within CRO.
- - Incompany TrainingCustomized training for teams looking to grow in experimentation, validation and conversion optimization.
- View all courses
Customers and Results
- - How Bidfood used process mining to discover blind spots in customer behaviorWant to know how customers really move through your platform? Bidfood discovered it through process mining and targeted improvements to their blind spots.
  - A successful experimental program according to DPG MediaHow DPG Media is building a mature culture of experimentation with a strong team, smart tools and training.
  - How AW Lab took the first step toward a validation-driven culture
- - CRO at Beter BedWhat can you learn from 10 years of CRO at Beter Bed? Insights, pitfalls and how they firmly anchored their testing culture.
  - How NS builds a successful CRO strategy with Online DialogueRead how a large organization like NS laid the foundation for a successful validation culture. And grew from 27 to 145 tests in one year.
  - Learning-by-doing: 4 years of working together on VodafoneZiggo's CRO programHow VodafoneZiggo grew in 4 years from gut testing to a robust CRO program with 20 teams and a Center of Excellence.
- See all customer cases
Blogs
CRO toolkit
About us
- - TeamDriven specialists who learn by experimenting
  - EventsSharing knowledge and inspiring each other. We bring peers together to learn and grow.
  - Partners Online DialogueTools from our partners that help translate behavior and data into impact.
- - Get in touchInterested in our approach or a collaboration? We'd love to engage with you.
  - Sign up for the newsletterStay tuned for the latest news on CRO and experimentation!
  - VacanciesCome work at the agency for data-driven optimization.

Here's how to validate experiments even better

Reinier Koolmees

Data Analyst

12-06-2020 - minutes reading time

Can you actually rely blindly on the one metric from an A/B test? And can an organization stay afloat by not looking at certain business metrics for once?

In general, you can say that organizations have a strong need for metrics. Without the right metrics, they are fairly directionless. "Missing is guessing" they sometimes say. This is certainly true for commercial companies. Determining the right key metrics does not always prove easy there, and that is certainly true of getting an internal consensus on them. Key metrics may ultimately be "imposed" by and for an organization but may also be created to measure specific performance of a team or product.

What exactly about online experiments such as A/B testing? What metrics play an important role in validating a hypothesis? And then how can you validate those metrics?

'The invisible metric'

Metrics bring structure, provide focus and help you make the right decisions. In doing so, they contribute to change and make a positive contribution to an organization's strategy and direction. Especially when your organization is data-driven and uses experiments to validate ideas, metrics are indispensable to demonstrate success or loss.

But even when metrics of an A/B test do not convincingly shoot into the green or red, and thus no significant effect can be demonstrated (if an effect is present at all), a metric is still a fine straw that you can cling to. Especially if the hypothesis of an experiment is aimed at influencing that metric.

But is that justified? Does such a metric give you the correct or complete picture at all times?

Possibly not always.

Still, with the validation of a new idea, something may have unintentionally crept into the experiment that you had not anticipated or could not suspect beforehand. For example, the variation may have negatively affected another metric (still invisible to you) and, at the same time, your most important metric.

For example, keeping the number of visitors with an order the same could simultaneously mean an increase in the average order value (AOV). These kinds of effects we might still be able to explain (chance, outliers, etc.) or substantiate somewhat. But there are always test results that you want to investigate further because they are much more difficult to interpret.

For example, a non-significant result does not necessarily mean that you had a bad idea. Bad ideas don't exist anyway because they can lead back to the right ideas. "Sometimes you win. Sometimes you learn."

The point I want to make here is that a negative or non-significant result could also have had an underlying cause other than just the influence of the metric you were optimizing for.

What if the test page does not perform as well due to longer load time? What if more complex code doesn't execute properly in a particular browser type or version? And what if there are more repeat visits in the test variant due to an improperly designed online campaign? These are just examples of factors that can impact results. Well, but what do you do then?

'Slow down experiments'

Product teams in organizations are often developing new features for the website or app. This is often done with the goal of positively influencing key metrics. For example, the organization is pushing for more orders. Or perhaps more engagement is what is being sought. In any case, there is a chance that you will lose sight of the fact that new features can imperceptibly influence more than just the key metrics that serve for success. Or it could backfire.

Introducing changes (a new feature, new code, etc.) doesn't have to immediately mean a better experience and also better conversion. We know as optimizers like no other that you have to validate everything first. We have also known for years that technology and speed (or rather inertia) on all fronts are conversion killers are. There are even powerful ways to test this in the form of slow down experiments.

Amazon once did this and showed with such an experiment that a deliberately built-in delay of only 100 milliseconds resulted in an 1% sales drop. That involves a lot of money at Amazon. A lot of money.

This is proof that even non-visible elements (e.g., a page load metric) can have an impact on key metrics and thus on the entire business!

Goals & drivers and guardrails

If you are already experimenting heavily, you probably already work a lot with goal and driver metrics. I list them again below to indicate the difference in metrics, and their relationship to each other.

Goal metrics

Goal metrics are success metrics and deal with key objectives (goals). They are usually linked to the mission statement of an organization and are about things that people really care about. The best-known example of a goal metric (also called key metric) is the order or transaction metric.

Driver metrics

Driver metrics give indications that we are moving in the right direction to achieve our goals. They therefore contribute directly to the goal metrics and are mostly about "user engagement" and "user retention" and the like. Examples of driver metrics are the net promoter score (NPS) and the proportion of returning visitors or new registrations to the Web site.

How can you ensure that the above metrics are reliable? Guardrail metrics can help you with this.

Guardrail metrics

Guardrail metrics have the main purpose of giving you support. They ensure the reliability of the result of the goal metric and warn if something is not right. Incidentally, they do not contribute to business value the way a goal metric does.

Because of their "sensitive nature," guardrail metrics have a lower statistical variance, making them more likely to provide significance. As a result, errors can be detected more quickly. Examples of guardrail metrics are SRM (Sample Ratio Mismatch), or metrics that include, for example latency or page load time monitoring.

Guardrail metrics are crucial but not yet always embraced by organizations that are experimenting, even though they directly impact goal and driver metrics. Using guardrail metrics, by the way, does come with a certain level of maturity within a CRO team or CRO program.

The big tech companies, which have elevated experimentation to an art, sometimes employ dozens to hundreds of guardrail metrics per online experiment. Then it suddenly becomes apparent that changes in specific cases can have an unexpected impact on the business.

Overall Evaluation Criteria.

How great would it be if we could capture everything we do in one metric. A single metric that makes all other metrics obsolete? Unfortunately, that's not possible. After all, a cockpit of an airplane or a dashboard of a car also needs multiple metrics. It is irresponsible to fly, sail or drive on only one metric.

Still, there is a method you can apply to validate the success of an online experiment better than just looking at a goal or key metric. You can do this by combining several metrics into one metric.

In our field, this is called an OEC, or the Overall Evaluation Criteria of an experiment. Multiple metrics are involved and taken into account in the result of an A/B test. The final judgment on this result is thus better substantiated and provides additional assurance for implementation.

In an OEC, usually one or more key metrics are brought together as one KPI and these are supplemented with some guardrail metrics. The big difference between an OEC and a guardrail is that an OEC does contribute to the business value (lifetime value) of an organization.

Online Dialogue worked on an OEC for one of the largest e-commerce companies in the Netherlands. Among others, my colleague Anouk Erens was closely involved.

Anouk: "For a large e-commerce player, we helped define the OEC. What metrics belong in this OEC and are they all important? In what way can you measure them reliably? And what relationship exists with the other key metrics? Setting up an OEC is a process in which a lot of coordination takes place between different departments. You capture, as it were, your entire business goal under one overarching metric, then this metric must be well thought out."

Establishing an OEC

It is really important to keep raising awareness of the most important metric(s) internally at your organization. The fact that you can then guarantee an additional level of reliability at the same time is another great thing. Really not everyone in an organization is already familiar with this. To know is to measure! Yes yes, you read it right. Sometimes you really have to think back....

In addition, this also reflects on your work. An organization can only be happy about that, right? Therefore, I think that in the end, everyone can only welcome the existence of a well-appointed OEC. Whether that's the COE or perhaps the CEO. Everyone benefits!

Reinier Koolmees

Data Analyst

On a daily basis, Reinier is an analyst at Online Dialogue working on data-driven conversion optimization and translating data into insights. Online Dialogue has been a leading CRO agency for 10 years, where people, quality and knowledge transfer are key. We help our clients make better and more reliable decisions using data and psychology.

You may also find these blogs interesting

November 24, 2025

Will AI make us smarter or dumber? The insights of Klöpping, Scherder and Online Dialogue

Reflection on Klöpping × Scherder by Simon Buil (Data Analyst at Online Dialogue)

Simon Buil

November 18, 2025

Event Report: Dialogue Thursday #49: The Learning Era, Data Trends & Experimentation within Product

Last Thursday, the 49th edition of Dialogue Thursday (DiDo) took place. DiDo is the knowledge event for professionals in conversion optimization, product development and data-driven working.

Maud Vermeulen

August 11, 2025

6 trends shaping experimentation in 2025

Over the past few weeks, I attended five leading events, spoke on stages and had countless conversations. One clear conclusion: our field is coming of age. Six themes stood out. In this blog I share them, with quotes and test ideas.

Ruben de Boer