Retail Catalog Automation Software — 50 Staff to 0

Industry Retail / E-Commerce

Region Europe

Team 3 developers

Stack Azure Service Bus, Azure Functions v3, .NET 5 / C#, Cosmos DB, Redis

Engagement Type Embedded Partner — 14 months

Completed 2021

The Problem

Large European retailers refresh their product catalogs multiple times per year — at season changeovers, on the back of new supplier contracts, and in response to regulatory requirements around pricing and labeling accuracy. For a retailer carrying hundreds of thousands of SKUs across multiple regions, each refresh is a project in its own right, not a routine operation.

Before this engagement, the client's catalog process worked like this: dozens of suppliers would send product data in whatever format each vendor had standardised on — CSV exports, XML files, proprietary flat files, occasionally half-populated spreadsheets. A central operations team would receive these feeds, manually review each one for obvious errors, then mobilise fifty operations staff to key the data into the product information management system. The cycle ran for three to four consecutive days, always timed to a weekend to minimise disruption to live trading.

Product information management platforms are the industry's standard answer to catalog complexity — but they manage structured data that is already clean and inside the system. They solve nothing upstream. The real cost sat in the gap between what vendors sent and what the PIM could ingest: schema variance across suppliers, missing mandatory fields, currency formatting inconsistencies, duplicate SKUs introduced when suppliers recycled identifiers across seasons, and locale-specific pricing that needed mapping to the retailer's internal taxonomy. Manual entry existed because the upstream data was untrustworthy, and trusting it required human judgement applied row by row.

That human judgement added errors of its own. A QA cycle ran after each entry batch — and QA sometimes uncovered problems that required a partial re-run, adding a full day to an already expensive process. The hidden cost was structural: fifty experienced operations staff — people with institutional knowledge of supplier relationships and the product hierarchy — were unavailable for anything else during the cycle, multiple times per year. Across the multi-day cycle, each refresh consumed roughly 3,000 person-hours. Run three times a year, that exceeded 9,000 staff-hours of lost capacity, conservatively measured, before accounting for the QA cycles and error rework.

The business had attempted automation twice before. Once with a bespoke ETL pipeline that broke every time a vendor changed their file format, and once with a middleware platform that proved too slow to clear a full refresh inside the weekend trading window. The appetite for a third attempt was cautious. The goal was a system resilient to vendor schema drift, fast enough to complete inside a weekend, and accurate enough to eliminate the QA cycle entirely.

The Constraints

Fixed weekend processing window. Peak-season refreshes had to complete before trading opened Monday morning — a hard deadline, not a target. Every architectural decision had to account for this constraint explicitly.
Customer-facing pricing accuracy. A price error that reached the live catalog created immediate consumer-facing and regulatory exposure. Pricing fields required explicit validation, not assumed correctness from the source feed.
No schema control over vendor feeds. Vendors set their own data formats and changed them without notice. The system had to absorb format variance gracefully and map it to a canonical internal schema, not demand compliance.
Downstream systems must receive consistent state. Search indexing, inventory management, and the sales platform all consume catalog data. They could not see partial updates or mid-refresh state — every event they received had to represent a complete, valid product record.
Full audit lineage required. Every SKU in every batch needed a traceable record from vendor feed to catalog record to downstream event, queryable after the fact for regulatory and supplier dispute purposes.

The Architecture

The pipeline follows a fan-in, process, fan-out pattern with two clearly defined processing boundaries: one at ingestion, one at publication to downstream systems.

On the inbound side, each vendor's feed arrives as a batch file dropped into Azure Blob Storage. A lightweight blob-triggered Function detects new arrivals and publishes individual product records as messages onto Azure Service Bus — one topic per feed type. This topic-per-feed-type design was a deliberate early decision: it confines the blast radius of any single vendor's malformed feed to exactly one topic. Its dead-letter queue fills, its consumers retry, and no other vendor's data is affected. Replaying one vendor's batch is a two-click operation; it never touches anyone else's.

Event-driven ingestion pipeline: vendor feeds enter via Azure Service Bus (one topic per feed type), are validated, deduplicated, and enriched in a Function App (highlighted), then written to the catalog store and published to downstream systems — search, inventory, and sales — as independent events. Invalid records are routed to a structured poison-message queue, not silently dropped.

The core processing runs in a Function App bound to each Service Bus topic. Before any business logic executes, the handler performs an idempotency check: each inbound message carries a composite dedup key derived from vendor_id + sku + revision. The key is checked against a Redis cache with a 24-hour TTL. Azure Service Bus guarantees at-least-once delivery — meaning the same message can arrive more than once under failure or retry conditions. The dedup check collapses those duplicates to a single write without needing distributed transaction semantics.

Validation runs as a pipeline of small, composable IValidationRule implementations registered in the DI container. Each rule receives the payload and returns either a pass or a structured error record. The first failure short-circuits the chain and routes the message to the dead-letter queue with the error reason attached as a message property — not as a logged exception. The operations team can query the dead-letter queue directly in Service Bus Explorer and see exactly which field failed, for which SKU, from which vendor, at what time.

After validation passes, an enrichment step decorates the canonical product record with data from internal reference sources: category hierarchy, regional VAT rates, locale-specific labels. The enriched record is written to Cosmos DB and an outbound event is published for downstream consumers. Three systems subscribe: search indexing, inventory management, and the pricing platform. Each is independently deployed with its own subscription and dead-letter queue — a failure in search indexing never holds up the inventory update.

Azure Service Bus Azure Functions v3 (in-process) .NET 5 / C# Azure Cosmos DB Azure Cache for Redis Azure Blob Storage

The entire pipeline runs on consumption-plan Azure Functions, so cost aligns directly with volume: near-zero during off-peak months, predictable burst cost during refresh cycles — with no pre-provisioned capacity sitting idle. The design we rejected was a traditional batch ETL job. The client's previous failed attempt used exactly that model, and its fundamental weakness was sequential processing: one large vendor feed would hold up all subsequent feeds behind it in the queue. An event-driven architecture eliminates head-of-line blocking entirely — every product record becomes an independent unit of work, processed in parallel up to the Function App's concurrency limit.

Implementation Highlights

Idempotent Message Handler

Azure Service Bus guarantees at-least-once delivery. Under normal conditions, each message arrives exactly once. Under failure — a Function host restarting mid-processing, a network partition during the acknowledgement — the same message can be redelivered. Without an idempotency guard, this produces duplicate catalog writes that corrupt the state downstream systems depend on. The guard uses a composite key checked against a Redis entry with a 24-hour TTL, which safely covers the maximum retry window for any single refresh cycle.

[FunctionName("CatalogIngestion")]
public async Task Run(
    [ServiceBusTrigger("catalog-feeds", "vendor-a",
        Connection = "ServiceBus:ConnectionString")] ServiceBusReceivedMessage message,
    ServiceBusMessageActions messageActions,
    ILogger log)
{
    var dedupeKey = message.ApplicationProperties
        .GetValueOrDefault("dedup_key")?.ToString();

    if (string.IsNullOrWhiteSpace(dedupeKey))
    {
        await messageActions.DeadLetterMessageAsync(message, "MissingDedupeKey",
            "No dedup_key property.");
        return;
    }

    if (await _cache.ExistsAsync(dedupeKey))
    {
        log.LogInformation("Duplicate discarded. {DedupeKey}", dedupeKey);
        await messageActions.CompleteMessageAsync(message);
        return;
    }

    var payload = message.Body.ToObjectFromJson<VendorProductPayload>();
    var result  = _validator.Validate(payload);

    if (!result.IsValid)
    {
        await messageActions.DeadLetterMessageAsync(message, result.ErrorCode,
            result.ErrorMessage);
        log.LogWarning("Rejected. {DedupeKey} {Error}", dedupeKey, result.ErrorCode);
        return;
    }

    await _catalog.UpsertAsync(payload.ToCanonical());
    await _cache.SetAsync(dedupeKey, TimeSpan.FromHours(24));
    await messageActions.CompleteMessageAsync(message);
    log.LogInformation("Ingested. {DedupeKey} {Sku}", dedupeKey, payload.Sku);
}

Composable Validation Chain

Rather than a monolithic validator, the pipeline registers a sequence of IValidationRule implementations in the DI container. Adding a new rule — say, when a new vendor introduces a field the others don't carry — requires one new class and one DI registration. No existing code changes. Cheap checks (required field presence) are registered first; expensive checks (image URL reachability) run last. The first failure short-circuits the chain immediately.

public interface IValidationRule
{
    ValidationResult Validate(VendorProductPayload payload);
}

// Registered in DI in order; first failure short-circuits.
public sealed class ValidationPipeline
{
    private readonly IEnumerable<IValidationRule> _rules;
    public ValidationPipeline(IEnumerable<IValidationRule> rules) => _rules = rules;

    public ValidationResult Validate(VendorProductPayload payload)
    {
        foreach (var rule in _rules)
        {
            var result = rule.Validate(payload);
            if (!result.IsValid) return result;
        }
        return ValidationResult.Pass();
    }
}

public sealed class RequiredFieldsRule : IValidationRule
{
    private static readonly string[] _fields =
        { "Sku", "Title", "CategoryCode", "PriceGbp" };

    public ValidationResult Validate(VendorProductPayload payload) =>
        _fields.Select(f => (Field: f, Value: payload.GetField(f)))
               .Where(x => string.IsNullOrWhiteSpace(x.Value))
               .Select(x => ValidationResult.Fail("MISSING_FIELD", x.Field))
               .FirstOrDefault() ?? ValidationResult.Pass();
}

public sealed class PriceRangeRule : IValidationRule
{
    public ValidationResult Validate(VendorProductPayload payload) =>
        payload.PriceGbp is >= 0.01m and <= 99_999.99m
            ? ValidationResult.Pass()
            : ValidationResult.Fail("PRICE_OUT_OF_RANGE",
                $"Price {payload.PriceGbp:C} outside accepted range.");
}

Dead-Letter as Structured Audit Trail

The conventional approach to validation failure is to log an exception and discard the message. We rejected this because it fragments the audit record: the error reason lives in the log stream, the original payload is gone, and correlating the two for an ops investigation requires cross-referencing timestamps across systems. Instead, rejected messages are dead-lettered with the full original payload and a structured error envelope attached as message properties. The dead-letter queue becomes a first-class data store. The operations team can review rejections in Service Bus Explorer, fix the upstream vendor data, and replay the message directly from the queue — without needing to regenerate the original feed file.

private async Task DeadLetterWithContextAsync(
    ServiceBusMessageActions messageActions,
    ServiceBusReceivedMessage message,
    ValidationResult failure,
    ILogger log)
{
    var envelope = new
    {
        ErrorCode    = failure.ErrorCode,
        ErrorMessage = failure.ErrorMessage,
        ReceivedAt   = DateTimeOffset.UtcNow,
        MessageId    = message.MessageId,
        DedupeKey    = message.ApplicationProperties.GetValueOrDefault("dedup_key"),
        OriginalBody = message.Body.ToString()
    };

    await messageActions.DeadLetterMessageAsync(
        message,
        deadLetterReason:           failure.ErrorCode,
        deadLetterErrorDescription: JsonSerializer.Serialize(envelope));

    log.LogWarning("Dead-lettered {MessageId} Reason={Reason}",
        message.MessageId, failure.ErrorCode);
}

The Outcome

97%

Time Reduction

3–4 days of manual effort compressed to under 3 hours of automated processing.

50 → 0

Manual Operators

Zero human touchpoints in the standard catalog refresh flow.

9,000+

Staff-Hours Reclaimed

Per year: ~3,000 hrs per refresh × 3 annual cycles — before QA cycle savings.

The first full production run completed in 2 hours 47 minutes — inside the weekend window with time to spare. The QA cycle that had previously added a day to every refresh did not run, because there was nothing to QA: every rejected item had a machine-readable reason in the dead-letter queue, and every accepted item had been validated by the pipeline before it reached the catalog.

The fifty staff who had previously been mobilised for the refresh cycle were not made redundant — they were reassigned. The three days of their calendar that had been blocked every quarter returned to regular work. The downstream effect was subtler but equally real: the fear around catalog refreshes disappeared. Refreshes stopped being events that the CEO's calendar blocked around. In the first full year after deployment, the client ran six catalog refreshes where they had previously run three. Supplier onboarding accelerated. New vendor relationships that had been avoided because the manual process couldn't absorb another feed source became feasible. The pipeline has been running without structural changes for more than four years.

What We'd Do Differently

The vendor validation rules are implemented as individual C# classes — one class per rule, one class per vendor quirk. This worked well for the first twelve months while the rule set was relatively stable. As new vendors were onboarded and existing vendors changed their feed schemas, the rule set grew. By the end of the first year, the repository had over forty validation rule classes, each requiring a code change and a deployment cycle to add or modify.

In retrospect, a small configuration-driven rule definition — something as simple as a JSON schema specifying field names, required/optional, type, and range constraints — would have paid back inside six months. The C# class-per-rule approach is transparent and unit-testable, properties we still value. A configuration-driven engine doesn't have to sacrifice those — it just needs the engine itself to be tested against its interpretation of the config. The operations team could have responded to a vendor schema change with a config deploy rather than a code deploy and a full release cycle.

The Service Bus topic-per-feed-type design has aged extremely well. Under peak load with five concurrent vendor feeds processing simultaneously, the isolation meant one feed's retry storm had no effect on the others. We'd make the same call again without hesitation.

The Cosmos DB partition key choice has not aged as well. We partitioned the catalog store by category code, which made sense when the largest vendor's catalog had a roughly even distribution across categories. Eighteen months in, one new vendor's catalog skewed heavily toward a single category — creating a hot partition that raised RU consumption by 30% and required a re-partitioning exercise. If designing this today, we'd partition by a compound key from the start, or take a harder look at Azure SQL with a regional read replica — which for this access pattern may be simpler and cheaper.

If You're Solving This Today

This was built on the Azure Functions v3 in-process model — the standard in 2021, using [FunctionName] attributes and DI wired directly through Startup.cs. In 2026 we'd start with the isolated worker model (.NET 8 default, [Function] attribute) — it gives cleaner dependency injection, proper middleware pipelines, and dramatically easier integration testing because the host no longer owns your DI container. We'd also evaluate Durable Functions for the orchestration layer rather than binding directly to Service Bus topics: the fan-out/fan-in pattern would have simplified the batch-completion signalling we implemented as a Redis counter. And we'd look at Azure Container Apps for the host — for workloads with predictable burst patterns, the per-replica pricing often beats the consumption plan once you factor in cold-start latency at scale.

Related Case Studies

Common Questions

Questions about retail catalog automation.

How long does a retail catalog automation project take to implement?: The first production deployment shipped in under 12 weeks from initial diagnosis; the embedded engagement then continued for 14 months as new vendors and validation rules were added. Timeline depends on the complexity of source data formats and the number of upstream systems requiring integration.
Can catalog automation integrate with an existing ERP or PIM system?: Yes. The pipeline in this engagement was designed as a layer on top of the existing ERP — receiving events from upstream systems and writing to the catalog without replacing any existing infrastructure. No rip-and-replace was required.
What Azure technologies are used for retail catalog automation?: Azure Service Bus for event coordination, Azure Functions for stateless processing, Redis Cache for deduplication, Cosmos DB for catalog state, and Azure Blob Storage for raw ingestion. The architecture is event-driven and idempotent — each step can safely re-run under at-least-once delivery without producing duplicates.
What ROI should we expect from catalog automation?: The client in this case study reclaimed over 9,000 staff-hours per year from manual catalog refresh cycles. For any operation involving more than ten people across multiple days, full automation typically pays for itself within the first cycle and enables more frequent refreshes — which this client used to double their annual refresh count.

Solving a Problem Like This?

The fastest way to know if the same approach fits your context is a two-week Discovery Sprint — a fixed-price backend diagnosis with no long-term commitment required.

Book a Discovery Sprint