Simulating an A/B Experiment on Online Retail with Databricks + SparkSQL 🧪

Practicing experimentation workflows even when you don't have a real treatment flag

I wanted a portfolio project to practice the bread-and-butter of experimentation work (population definition, cohort assignment, metric design, and statistical testing) while also getting hands-on with Databricks Free Edition. The UCI Online Retail dataset does not include any real intervention, so this notebook uses a simulated holdout assignment to demonstrate the workflow. The outcome is not causal, but the process is real.

I like to keep a small set of portfolio projects that are less about chasing novelty and more about practicing fundamentals. This is one of those.

My objective here was simple:

Practice the techniques I would use in an experimentation context (eligibility, assignment, metrics, inference).
Try Databricks Free Edition as a learning environment.
Keep the project realistic enough to be educational, but small enough to finish.

The catch (and the reason this notebook is very explicit about limitations) is that it is genuinely hard to practice real experimentation without a dataset that contains a real intervention: some feature rollout, a campaign, a policy change, a randomized flag... something.

The Online Retail dataset is great for practicing data work, but it is not aligned with experimentation out of the box. So, I made a tradeoff: simulate the holdout assignment to practice the workflow, and be honest about what that means.

Experimental Design - Source: Aaron Bacall

You can find the full project (Databricks notebook + README + setup notes) on GitHub: ig-perez/retail-ab-experiment.

Let's get started.

Why experimentation is hard to practice with public datasets
The dataset: UCI Online Retail
Tech stack
A simulated holdout: design overview
Implementation detail #1: cohort assignment should be reproducible
Metrics: conversion and demand per member
A quick word on inference
What I learned from Databricks Free Edition
If I had one more weekend...

Why experimentation is hard to practice with public datasets

In a company setting, an A/B test typically has:

A clear treatment (what changed).
A clear assignment mechanism (who got it, and when).
A unit of randomization (user, customer, session, store).
And ideally: logging that lets you compute outcomes and guardrails.

Public datasets often skip the key ingredient: a reliable treatment flag. Without it, you can still do analysis, but you cannot do causal claims responsibly.

So the goal of this project is not: "prove that treatment increased conversion".

The goal is: "practice how to build an experiment table and evaluate it end-to-end".

The dataset: UCI Online Retail

The UCI Online Retail dataset contains historical transactional data for a UK-based online retailer (2010-2011).

In practice, that means:

You only observe customers who have transactions.
You see invoices, items, quantities, prices, timestamps, and customer IDs.
You also see returns (negative quantities), which makes data cleaning non-trivial.

Even without a treatment, this dataset is perfect to practice:

Building customer cohorts from event data.
Designing customer-level metrics.
Working with heavy-tailed spend distributions.

Tech stack

I intentionally kept the stack close to what I'd use in a real experimentation workflow:

Platform: Databricks Free Edition (notebook + Spark runtime)
Compute/Query: Spark SQL (cohorting + metric tables)
Python: Pandas (inspection/IO), NumPy (numeric helpers), SciPy (stat tests), Matplotlib (basic plots)
IO: openpyxl for reading the original Excel file
Output: a customer-level metric table that is easy to reuse for more analysis

A simulated holdout: design overview

The notebook simulates a "holdout-style" experiment:

Choose an eligible population based on pre-period activity.
Pick a future test window and compute outcomes in that window.
Assign customers deterministically to treatment/control using a hash of CustomerID.
Aggregate metrics at the customer level.
Compare groups using summary statistics and a Welch t-test.

This gives a realistic workflow, but it is crucial to interpret it correctly:

Because there is no real intervention, any observed differences are sampling noise and/or time effects, not treatment impact.

Implementation detail #1: cohort assignment should be reproducible

If you are simulating a holdout, it is worth doing it carefully:

The assignment should be stable across reruns.
It should not depend on partitioning or execution order.
Ideally, it should be stable across Spark versions too.

In Spark SQL, HASH() can change across versions and is not ideal for reproducible cohorting. In the notebook I use XXHASH64 and PMOD to get a stable 0-99 bucket:

CASE
  WHEN PMOD(XXHASH64(CAST(CustomerID AS STRING)), 100) < 50 THEN 'treatment'
  ELSE 'control'
END AS variant

This is a small detail, but it is the kind of detail that keeps experiments debuggable.

Metrics: conversion and demand per member

The notebook builds a customer-level metric table for the test month:

converted: 1 if the customer purchased in the window, else 0
demand_per_customer: sum of (quantity * unit_price) during the window

I like this pairing because it highlights a classic experimentation reality:

Conversion tends to be lower variance and easier to estimate precisely.
Revenue/spend is often heavy-tailed and harder to estimate without large samples or variance reduction.

A quick word on inference

For a first pass, I used Welch's t-test to compare means between cohorts (unequal variances). This is fine as a starting point, especially for large-ish samples.

One practical detail: some SciPy versions (common in older managed runtimes) don't support alternative=... in ttest_ind, so the notebook uses the default two-sided test for compatibility.

I also added basic guardrails in the helper function so it fails gracefully if the sample size is too small or if the metric has zero variance.

What I learned from Databricks Free Edition

My favorite part of this project was how quickly I could go from "raw file" to "SQL metric table":

Pandas is great for quick inspection and local-style transformations.
Spark SQL is great for building the cohort and metric tables in a way that reads like a clear recipe / step-by-step plan.
Jumping into Python for stats and plotting is a nice workflow for an end-to-end notebook.

For learning and small experiments, Databricks Free Edition feels surprisingly powerful.

If I had one more weekend...

This notebook is intentionally a "minimum viable" experimentation workflow, but there are several natural improvements:

Bootstrap confidence intervals for heavy-tailed demand (more robust than a t-test here).
CUPED using pre-period covariates to reduce variance and increase power.
Difference-in-differences with pre/post windows (still non-causal in this dataset, but a good technique to practice).
Use a dataset with a true treatment flag to practice real causal evaluation and validation checks. A few candidates I want to explore:
- Criteo Uplift: large-scale uplift dataset with a treatment indicator and conversion outcome (much closer to a real experiment setup).
- causaldata::snow: tiny but very clean Difference-in-Differences dataset (John Snow / cholera) with an explicit treatment indicator and outcome.
- “Waking Up the Golden Dawn”: a nice real-world example for DID.
- MOVER (UCI): large clinical time-series dataset (requires a data usage agreement). Potentially interesting for causal inference practice.

If/when I revisit this project, the last bullet is the biggest upgrade: choosing the right dataset is half the battle.

CUPED! - Source: Own/IA

Databricks
SparkSQL
A/B Testing
Experimentation
Pandas
Statistics