Data Ops · 2024

Metadata automation, 60× faster

OneStreamMetadataCaching

Quick Facts

Industry: AdTech / data-connectivity SaaS
Role: Senior Consultant — OneStream
Timeline: Custom build, then optimised in stages
Team: Solo build, architect-reviewed
Impact: Full metadata-governance cycle cut from ~2 hours to under 2 minutes by replacing full reloads with cache-backed selective updates — with a large drop in compute and a much faster, more responsive system for routine updates.

Overview

The client maintained large OneStream dimensions — on the order of 5,000–6,000 members each — and every metadata run rebuilt all of them from scratch. A fresh, full reload of every member on every change set is correct, but it is also slow and expensive: the cycle took around two hours and put heavy, repeated load on the system to re-process members that hadn't changed at all. We replaced that with a caching layer plus selective updates: keep the known member state in a backend table, diff each incoming change set against it in memory, and only touch the members whose relationships or attributes actually changed. The cycle dropped from ~2 hours to under 2 minutes, and the compute saving was just as valuable as the wall-clock one.

The Problem

Automating metadata governance is the easy part; making it fast is the hard part. The first working version did the whole job correctly but treated every run as a clean slate.

Full reload, every time. Each run re-read and rebuilt every member in the dimension, whether or not anything about that member had changed. Across dimensions of 5,000–6,000 members, that meant thousands of member operations per run even when the actual change set was a handful of members.
~2-hour cycle. The redundant work dominated the runtime. The cost scaled with the size of the dimension, not the size of the change — which is exactly backwards for a process that mostly applies small, incremental updates.
Heavy, repeated compute. Reprocessing unchanged members burned compute and kept the system busy and sluggish during what should have been routine updates.

The brief: keep the automation correct, but make its cost track the size of the change rather than the size of the dimension.

Process

Cache the known state in a backend table

The root inefficiency was statelessness — each run started with no memory of what the dimension looked like last time, so it had no choice but to rebuild everything. The first move was to give the process a memory: a backend table holding the current member state (relationships and the properties that matter for governance). At the start of a run that table is loaded into memory, so the comparison work happens against an in-memory snapshot rather than repeated round-trips.

Diff the incoming set, update only what changed

With the prior state in hand, each run now diffs the incoming change set against the cached table in memory and identifies exactly which members are new, which have a changed relationship (a moved parent, a re-pointed rollup), and which have updated attributes. Everything else — the overwhelming majority of the dimension on a typical run — is left untouched. The backend table is then updated to reflect the new state so the next run has an accurate baseline to diff against.

This is the whole unlock: the process stopped doing work proportional to the dimension and started doing work proportional to the change. The first cut of caching-plus-diff brought the cycle down to roughly ten minutes; tightening exactly what counts as a "changed" member — so the selective path skipped everything genuinely unchanged — brought it under two minutes.

Make failures self-explaining

A faster process that fails opaquely isn't an improvement. Every run emits a detailed log of what it did — what it compared, what it decided to update, and what it skipped. If a run breaks, the log points to exactly where in the process it broke, so whoever picks it up can debug from evidence instead of guesswork. On failure the log is automatically attached and sent to the OneStream admin for review, so a broken run surfaces to the right person immediately rather than sitting silent until someone notices a stale hierarchy downstream.

Solution

Three pieces, all running inside OneStream with no external middleware:

1. Backend state table

A table that holds the current member state — relationships and governance-relevant properties — so the process always has a baseline to compare against. Loaded into memory at the start of each run for fast comparison.

2. In-memory diff + selective update

The core of the framework. Diffs the incoming change set against the cached state and applies updates only to members that are new or whose relationships/attributes changed. Unchanged members are skipped entirely, which is what turns a ~2-hour full reload into a sub-2-minute incremental update — and saves the matching compute.

3. Detailed logging + admin alerting

Every run produces a detailed, reviewable log. On failure, the log pinpoints where the process broke and is automatically sent to the OneStream admin, so issues are caught and debugged fast.

Results

Metric	Before	After	Change
Metadata cycle runtime	~2 hours	< 2 min	~60×
Members processed per run	All (~5–6k per dimension)	Only changed members	work scales with the change, not the dimension
Compute per run	High — full rebuild every time	Low — selective updates	large saving
System responsiveness during updates	Sluggish under full reload	Fast — minimal touch	routine updates stopped being disruptive
Failure visibility	—	Detailed log auto-sent to OneStream admin	issues surface immediately

Soft outcomes:

Updates stopped being a chore to schedule around. A sub-2-minute, low-compute cycle meant metadata changes could be applied whenever needed instead of being batched to avoid the load.
The system stayed responsive. Because routine updates no longer reprocessed the whole dimension, the platform wasn't tied up doing redundant work.
Breakages became debuggable. The detailed log plus the automatic admin alert turned an opaque failure into something with an obvious next step.

Learnings

What worked. The selective-update mental model — cache the prior state, diff against it, and only do work proportional to what actually changed — is the most portable idea from this project. It applies to any batch process that re-does expensive work it didn't need to repeat, and it's the first thing I now look for when something runs slower than the size of its input warrants.

What I'd do differently. Build the caching layer from the start rather than shipping the full-reload version first. The correct-but-slow version made the eventual optimisation feel risky — changing how an already-live process touches metadata is scarier than designing it that way up front.

Skill developed. Treating "is this work necessary?" as the first optimisation question, before "how do I make this work faster?" The biggest win here wasn't speeding anything up — it was not doing the work at all for the thousands of members that hadn't changed.