Inkky: The Data Production Platform Behind Welo Data’s Frontier AI Programs

Why most data pipelines fail frontier AI programs — and what Welo Data built to fix it.

May 11, 2026

6 Minutes

Blog Posts

When a frontier AI program produces unexpected model behavior, the instinct is to look at architecture, training configuration, or data volume. The question that gets asked last, if it gets asked at all, is: what was the data production operation that generated the training set, and was it built to catch errors before they shipped?

In practice, data quality problems almost always trace back further than clients expect. The breakdown is rarely one catastrophic failure. It is a series of small gaps: guidelines that covered the happy path but not the edge cases, quality controls that were not configured before the program launched, contributors who received unclear instructions and made reasonable guesses that happened to be wrong. By the time those decisions compound into something visible in a model, the cause is buried several steps back.

The most common version looks like this. A client provides guidelines built around a handful of representative examples. Contributors interpret the ambiguous cases in different ways because the guidelines did not anticipate them. The output looks fine in aggregate but carries inconsistency in exactly the cases the model will eventually struggle with.

“A lot of the time these guidelines are created on a handful of examples. The client thinks of a few happy path cases and a few bad cases, but never the areas in between. Once we have that data in the platform and can aggregate the scores and pull out the outliers, we can have the real conversation: your guidelines are not covering these cases in the way you thought they were.”
MK Blake, Welo Data

The task has changed. Most of the tooling has not.

The data annotation industry has a long history, and most of the infrastructure in it was built for work that looked very different from what frontier AI programs now require. Fifteen years ago, the canonical task was binary classification: is this a picture of a dog? Objective, scalable, easy to quality-check at volume.

That work still exists. But it sits at one end of a spectrum that has stretched enormously. The programs that matter most right now ask contributors to evaluate whether factual claims in a paragraph are accurate, assess whether a voice model responded correctly, or grade a prompt-and-response pair against a complex rubric. These tasks require real judgment. They are subjective in ways that no guideline can fully resolve.

“What even some of our clients forget is that the people doing these tasks are people. They have differences of opinion and see things in different ways. That is a really difficult thing to manage versus training someone to recognize a picture of a dog, which is relatively simple and straightforward.”
MK Blake, Welo Data

This shift changes what the entire data production operation needs to be. A platform designed to maximize throughput on simple classification tasks is not the right infrastructure for programs that require evaluative judgment, multi-stage review, and individualized contributor feedback. The tools shape the output. If the tools are not built for the complexity of the work, the output reflects that, regardless of how capable the contributors are.

What a properly configured pipeline actually does.

The difference between a data production pipeline that enforces quality and one that does not is not about which features appear on a spec sheet. It is about whether those controls are structural or optional, whether they are built into the workflow or depend on someone configuring them correctly under time pressure every time.

Consider what happens when a new program needs to launch quickly. If standing up quality controls requires an engineering ticket, and the client’s timeline does not accommodate that queue, the program launches without them. Not because anyone decided quality did not matter. Because the infrastructure made quality controls a bottleneck rather than a default.

“We receive a customer request today and if you do not have it ready today, you are delayed. Sometimes that forces a difficult decision: we launch it as it is. And that results in deliveries where someone followed the guidelines correctly and someone else missed them entirely.”
Daisy Alvarez, Welo Data

The benchmark gate system addresses this at the contributor level. Before anyone touches live production tasks, they complete a set of tasks that appear identical to real work but have known correct answers. If they do not reach the passing threshold, they are removed from the program before reaching the queue. The gate does not just protect one task. It protects the consistency of the entire batch.

“Those tasks are like test tasks that contributors have to complete, and we have the correct answer. If they do not reach the passing threshold, they are removed from the project. They never get into production.”
Daisy Alvarez, Welo Data

Multi-stage review adds another layer. Completed tasks move to a reviewer queue of more experienced contributors who can edit and advance a task or reject it back with specific comments the original contributor receives. The system is not just catching errors in the current batch. It is closing the loop so the same errors are less likely in the next one. As patterns emerge in where contributors are struggling, the task interface itself can be updated without an engineering ticket.

“The platform allows us to give contributors individualized feedback on what they are doing wrong and where they can improve. Especially as guidelines change from clients and we have to adapt quickly.”
MK Blake, Welo Data

Introducing Inkky

Everything described above represents how data production should work. Building the infrastructure to make it work that way, consistently, at the scale and complexity of the programs frontier AI teams run today, is what Inkky is.

Inkky is Welo Data’s proprietary data production platform, built from the ground up and launched in 2026. It consolidates the full production lifecycle into a single configurable system: project setup, contributor assignment, quality controls, review layers, real-time visibility, and export. Ops teams own every stage through a purpose-built interface, without routing changes through engineering. The platform is fully owned by Welo Data. No third-party dependencies. No outside visibility into client data.

The capabilities are not features added to complete a spec sheet. They are the mechanisms that make the quality arguments in this piece possible to actually deliver on, program after program, at any volume:

Benchmark gates and linters, built into the workflow

Quality controls run automatically on every project. They are not optional and do not require a separate setup process. A program cannot launch without them in place.

Multi-stage review with contributor feedback loops

Review layers are configurable to the specific needs of each program. Feedback reaches contributors at the task level, closing the loop between output quality and individual performance.

Real-time production visibility across every active program

Batch status, task queues, throughput, and export readiness are visible in one place. Blockers are visible while there is still time to act on them.

Ops-owned pipeline and task UI configuration

Pipeline stages, task interfaces, and assignment logic are configured through a GUI by the Ops teams who run the programs. No engineering queue.

Multi-modal data production in a single system

Text, image, audio, and video workflows run together. Schema-driven task interfaces support the complex instruction sets that frontier AI programs require.

Full IP ownership, no third-party platform in the pipeline

Welo Data owns Inkky outright. The roadmap answers to one set of requirements: the operational demands of the programs we run.

The platform is also what makes the Welo Works contributor community reliable at scale. Welo Works is a managed, known group of contributors, not anonymous crowd labor. Inkky routes the right task to the right person, verifies they are ready before they reach production, surfaces their performance data to Ops in real time, and closes the feedback loop at the individual level. The two things work together. Neither is sufficient without the other.

The question that should come before anything else.

Before pipeline configuration, before contributor assignment, before any quality control is set up, the most important question is the one that is asked least often: what is this data actually for?

Not the immediate deliverable. The downstream purpose. What behavior is the model being trained toward? What failure modes is the client trying to prevent? What does a correct response look like in the cases the guidelines have not covered yet?

The answer to that question should drive everything else: the task design, the review structure, the benchmark criteria, the way outliers get surfaced back to the client for guideline refinement. Without it, the program optimizes for delivery against a specification that may not reflect what the client actually needs.

“Having a good alignment at the very beginning does transfer into the quality of the data matching your expectations. Having the feedback loops in the middle, vendor to customer, is what actually improves the quality you get at the end.”
Daisy Alvarez, Welo Data

Most vendors scope the work and deliver against the brief. The feedback loop between what was delivered and what the model needed is something the client constructs on their own, after the fact, when the problem is already visible.

Inkky is built to make that feedback loop part of the production process itself. The outliers, the disagreement rates, the contributor performance patterns, the places where the guidelines did not cover the real distribution of cases: all of it is visible during the program, when it can still change the outcome. That is what we built it for.

AI Training

Model Evaluation

Our Technology

Our Expertise