Cloud Robotics Testing Infrastructure — Parallel at Scale

Industry Deep Tech / Robotics

Region United States

Team 5 developers

Stack AKS, Kubernetes Jobs, .NET Core 3.1 / C#, Azure Container Registry, Azure Blob Storage

Engagement Type Embedded Partner — 11 months

Completed 2020

The Problem

The standard development loop for robotics software looks deceptively simple: change an algorithm, flash the firmware, run the robot, observe the result, repeat. In practice, that loop is bottlenecked by one variable that appears in no specification document: access to physical hardware.

The client — a US company building next-generation autonomous systems — had a testing environment that had grown organically alongside their product. Early in the company's life, three physical test rigs were sufficient. As the engineering team scaled and algorithm development became parallel across multiple specialisations — path planning, sensor fusion, obstacle avoidance, multi-robot coordination — the rig count didn't scale with it.

The queuing system started as a shared calendar. Engineers booked hour-long slots. When contention grew, the slots became shorter and the handoffs more disruptive. An engineer investigating a subtle edge case in their path-planning algorithm would finish their hour and hand the rig to the next team — regardless of whether the investigation was complete. Context-switching mid-enquiry became the norm. A bug found in hour one was a bug that might not be reproduced until tomorrow's slot.

Multi-robot scenario testing existed in a different problem category entirely. Running a coordinated swarm test on physical hardware required setting up multiple rigs simultaneously, networking them, and hoping no single rig failure corrupted the shared run. In practice, serious multi-robot integration testing almost never happened — the logistics overhead made it impractical at the team's size and facility constraints. Algorithms were tested individually and then observed in production formation. Production was, in effect, part of the test environment.

The engineering organisation's throughput was materially lower than the headcount suggested. The bottleneck was not the engineers' ability to write algorithms — it was the physical infrastructure available to validate them. Adding more rigs had been modelled: the economics didn't hold. Each rig required calibration, maintenance, floor space, and a technician. And adding physical rigs wouldn't solve the multi-robot scenario problem at all — quantity was not the constraint.

The Constraints

Simulation fidelity vs spin-up time. Higher-fidelity simulation is slower to initialise. Engineers needed sub-minute spin-up for iterative algorithm work; the physics engine had to be configurable, not fixed at max fidelity.
Multi-tenant isolation. One engineer's failing scenario — an infinite loop, a runaway resource consumer — must not affect any other running scenario. Namespace-level isolation was a hard requirement from day one.
Deterministic replay. Reproducing a failure must require only the original scenario YAML and the random seed. Any non-determinism in the simulation made bug investigation probabilistic rather than reliable.
Per-team compute quotas. Cloud compute is not free. Each team needed a configurable budget expressed in concurrent scenarios and monthly core-hours, enforced at the orchestrator level — not reliant on engineer discipline.
No Kubernetes expertise required from engineers. Algorithm engineers are not infrastructure operators. The submission API had to abstract the cluster entirely — YAML scenario in, structured results out.

The Architecture

The system has three distinct layers: a submission surface that engineers and CI pipelines interact with, an orchestration layer that enforces quotas and schedules work onto the cluster, and an execution layer where the simulations themselves run.

Scenarios are defined as YAML files checked into the same repository as the algorithm code. A scenario specifies the environment (map, physics seed, obstacle configuration), the robots participating (model, start position, algorithm variant to test), and the checkpoints that define pass/fail (no collision at 30 seconds; all robots at goal by 60 seconds). The YAML is the contract between the engineer and the harness — and it is also the replay artefact. Given the same YAML and seed, the simulation produces the same result every time.

Scenario lifecycle: engineers and CI pipelines submit YAML scenario definitions to the Submission API; the Orchestrator (highlighted) enforces per-team quotas and templates Kubernetes Jobs — one container per robot, each network-isolated but connected within the scenario graph via a shared namespace. All robot telemetry streams are captured by the Telemetry Aggregator, which stores time-indexed results alongside the originating scenario YAML for deterministic replay.

The Orchestrator is the central authority. It receives a validated scenario from the API, checks the submitting team's remaining quota for the billing period, provisions a Kubernetes namespace for the scenario (isolating it completely from all other concurrent scenarios), and creates one Kubernetes Job per robot. Each Job runs the same simulation container image — the robot model and algorithm variant are injected as environment variables. The seed is injected identically across all robots in the scenario, which is what makes replay deterministic: the same inputs, the same sequence.

Within the scenario namespace, robots communicate over a Kubernetes Service — a simple TCP socket abstraction that the simulation engine uses to exchange position and state updates. This is the "scenario network" in the diagram. The namespace boundary means that a runaway robot in scenario A cannot flood the network for scenario B — they are as isolated as containers on different hosts.

Azure Kubernetes Service Kubernetes Jobs .NET Core 3.1 / C# Azure Container Registry Azure Blob Storage YamlDotNet

Telemetry capture runs as a sidecar container in each robot's pod. Every position update, sensor reading, and decision event is written to a time-indexed stream in Azure Blob Storage, tagged with the scenario ID and robot ID. When a scenario fails at a checkpoint, an engineer can download the telemetry stream, point the replay tool at it alongside the original scenario YAML, and reproduce the failure frame-by-frame — without touching the cluster.

Implementation Highlights

Scenario YAML Schema

The scenario definition format was designed for algorithm engineers, not infrastructure operators. It describes what to run, not how to run it — the orchestrator handles all Kubernetes concerns. A multi-robot swarm scenario with three robots, a shared warehouse environment, and two timed checkpoints fits in under 30 lines.

# scenario-obs-avoidance-swarm.yaml
scenario:
  id: obs-avoidance-swarm-3
  environment:
    map: warehouse-grid-40x40
    seed: 42                        # deterministic replay key
    physics_fidelity: standard      # standard | high | exact
  robots:
    - id: robot-a
      model: platform-v2
      start: [2, 3]
      algorithm: obstacle_avoidance_v7
    - id: robot-b
      model: platform-v2
      start: [8, 3]
      algorithm: obstacle_avoidance_v7
    - id: robot-c
      model: platform-v2
      start: [5, 10]
      algorithm: obstacle_avoidance_v7
  checkpoints:
    - time_s: 30
      assert: no_collision
    - time_s: 60
      assert: all_robots_reached_goal
  telemetry:
    capture: full
    interval_ms: 100

C# Scenario Loader

The loader deserialises the YAML into a typed ScenarioGraph that the orchestrator works with. Keeping the YAML schema separate from the internal domain model means the schema can evolve without changing orchestration logic, and vice versa.

public sealed class ScenarioLoader
{
    private static readonly IDeserializer _yaml =
        new DeserializerBuilder()
            .WithNamingConvention(UnderscoredNamingConvention.Instance)
            .Build();

    public ScenarioGraph Load(string yamlContent)
    {
        var def = _yaml.Deserialize<ScenarioDefinition>(yamlContent);

        var graph = new ScenarioGraph
        {
            Id          = def.Scenario.Id,
            Seed        = def.Scenario.Environment.Seed,
            Fidelity    = Enum.Parse<PhysicsFidelity>(def.Scenario.Environment.PhysicsFidelity, ignoreCase: true),
            Environment = new EnvironmentConfig(def.Scenario.Environment.Map)
        };

        foreach (var r in def.Scenario.Robots)
            graph.Robots.Add(new RobotNode(r.Id, r.Model, r.Start, r.Algorithm));

        foreach (var cp in def.Scenario.Checkpoints)
            graph.Checkpoints.Add(new Checkpoint(cp.TimeSeconds, cp.Assert));

        return graph;
    }
}

Kubernetes Job Templating

For each robot in the scenario graph, the orchestrator generates a typed V1Job using the Kubernetes .NET client, injecting the robot's identity, algorithm, and seed as environment variables. Setting BackoffLimit = 0 is intentional — a retry here would mask an algorithm bug, not recover from a transient infrastructure failure.

public V1Job BuildRobotJob(RobotNode robot, ScenarioGraph scenario, string imageTag)
{
    return new V1Job
    {
        Metadata = new V1ObjectMeta
        {
            Name      = $"robot-{robot.Id}-{scenario.Id}",
            Namespace = scenario.KubeNamespace,
            Labels    = new Dictionary<string, string>
            {
                ["scenario"] = scenario.Id,
                ["robot"]    = robot.Id,
                ["team"]     = scenario.TeamId
            }
        },
        Spec = new V1JobSpec
        {
            BackoffLimit = 0,   // fail fast — retries mask algorithm bugs
            Template = new V1PodTemplateSpec
            {
                Spec = new V1PodSpec
                {
                    RestartPolicy = "Never",
                    Containers    = new List<V1Container>
                    {
                        new V1Container
                        {
                            Name  = "sim",
                            Image = $"acr.azurecr.io/robot-sim:{imageTag}",
                            Env   = BuildEnvVars(robot, scenario)
                        }
                    }
                }
            }
        }
    };
}

The Outcome

∞

Test Parallelism

No hard ceiling — concurrent scenarios scale with AKS node capacity, not rig count.

<2 min

Scenario Spin-up

From YAML submit to first telemetry tick — vs. 30–90 minutes of rig prep previously.

First

Multi-Robot Scenarios

The engineering team ran their first coordinated swarm test within two weeks of go-live.

The hardware queue disappeared within the first month. Engineers stopped booking calendar slots and started submitting scenarios from their local machines between commits. The CI pipeline was wired into the submission API in week three — every pull request now triggers a regression suite against a fixed set of canonical scenarios, with results posted back to the PR before review.

The operational effect was larger than the direct time saving. Engineering streams that had been serialised through hardware contention — path planning blocked until obstacle avoidance freed the rig — became truly parallel. Teams could run experiments simultaneously without coordination overhead. The multi-robot team, which had previously been unable to run integration tests at all, ran 140 distinct swarm scenarios in their first month of access. Several failure modes they'd theorised about appeared immediately in the first two weeks of testing.

What We'd Do Differently

We accepted a deliberate fidelity trade-off at launch: the standard physics tier was fast (under two minutes to first results) but omitted some sensor noise characteristics that were present in the physical environment. For the first six months, this was the right call — engineers needed iteration speed, not physical accuracy, for the class of algorithms they were working on.

A year in, the obstacle avoidance team hit a category of edge cases that only manifested under realistic sensor noise. The high fidelity tier existed but hadn't been used much — and when the team switched to it, they discovered that some of their test assertions were written for standard-fidelity behaviour and gave false passes under higher fidelity. The fidelity tier choice was buried in the YAML and in no way forced teams to think about it before submitting.

If we were building this today, the fidelity tier would be a required field with no default — it would force an explicit choice at submission time, making the assumption visible. We'd also build the exact fidelity tier (full physical accuracy, slower) from day one, not as a retrofit, because retrofitting it required changes to how the simulation container was configured — changes that affected all tiers.

The multi-tenant network isolation worked well but cost us more engineering time than anticipated. Kubernetes namespace isolation does not automatically isolate network traffic — you need NetworkPolicy resources applied to every scenario namespace, and the policy templating had to be tested carefully to avoid both under-isolation (scenarios seeing each other's traffic) and over-isolation (robots within the same scenario unable to communicate). We'd invest in a dedicated namespace provisioner service from day one rather than generating the NetworkPolicy in the same path as the Job template.

If You're Solving This Today

In 2026 we'd look seriously at Azure Container Apps Jobs as a simpler alternative to raw AKS for the execution layer — the consumption-plan billing and built-in job lifecycle management remove a meaningful slice of the orchestration complexity we wrote ourselves. For the deterministic replay requirement, the current approach (seed + YAML → identical output) remains the right model, but we'd evaluate whether storing the full telemetry stream in Azure Data Explorer instead of Blob Storage would make post-run analysis meaningfully faster.

Related Case Studies

Common Questions

Questions about cloud robotics test infrastructure.

How does cloud-hosted robotics testing compare to hardware test beds?: Hardware test beds run one scenario at a time and require physical setup between runs. The cloud architecture in this case study runs hundreds of scenarios in parallel, with each test environment provisioned on-demand in under 90 seconds. Hardware validation is still required for final sign-off, but the majority of regression testing moves to cloud — reducing the hardware queue from days to minutes.
What cloud infrastructure is required for robotics simulation at scale?: This engagement uses Azure Kubernetes Service for container orchestration, Azure Service Bus for scenario dispatch, and a containerised simulation runtime. The harness provisions, runs, and tears down test environments automatically — engineers submit scenario YAML and receive structured results without interacting with the infrastructure.
Can this architecture work for robotics platforms other than the one described?: Yes, with adaptation. The orchestration layer is simulation-agnostic — it dispatches scenarios and collects results without knowledge of the simulation runtime. Porting requires wrapping the target simulator in a container conforming to the harness's interface contract, which is documented as part of the standard engagement deliverable.
How long does it take to implement a cloud robotics test harness?: The engagement described here ran approximately 18 weeks from diagnostic through production deployment. Most of that time was spent containerising the simulation runtime and defining the scenario schema — the orchestration infrastructure itself was production-ready significantly earlier.

Testing Infrastructure That's Holding You Back?

The fastest way to know if the same pattern fits your bottleneck is a two-week Discovery Sprint — a fixed-price engineering diagnosis, no long-term commitment required.

Book a Discovery Sprint