AI Content Evaluation at Scale: Three Task Types, Zero Rework
Welo Data partnered with a leading global travel platform to validate AI-generated property descriptions, score room copy quality, and benchmark a proprietary Judge LLM, all simultaneously within a single three-week window.
870
Evaluation tasks
delivered
3
Task types run
simultaneously
100%
QC coverage across all tasks
3 weeks
Sourcing through final delivery
A leading global travel platform selected Welo Data to run a human evaluation pilot across three structurally distinct AI content quality tasks: factual accuracy labelling, multi-dimensional copy quality scoring, and Judge LLM validation, each with its own guidelines, scoring logic, and quality benchmarks.
The challenge wasn’t just executing three task types at once. It was doing so with a single cross-trained contributor pool, in a client-specified environment, while maintaining 100% QC coverage across all tasks and delivering a structured findings report that would inform both model improvement and the client’s ongoing AI content production roadmap.
The Challenge
The client’s Scaled Copy team produces large volumes of AI-generated property and room descriptions used to inform booking decisions across millions of listings. Three failure modes required human evaluation before the AI content pipeline could be trusted at scale.
First, generative models were hallucinating or misrepresenting property amenities and distances, producing descriptions that set guest expectations the property couldn’t meet. Second, room description copy was failing to prioritise high-value selling points within strict 15-word limits, reducing conversion effectiveness. Third, a proprietary Judge LLM was inconsistently flagging forbidden information across 12 policy categories, creating risk of prohibited content entering live production.
Each failure mode required a different evaluation approach. The urgency was real: AI-generated descriptions are customer-facing, directly influence booking decisions, and, when wrong, drive guest complaints, refunds, and policy violations. The pilot needed to validate baselines across all three problems simultaneously, not sequentially.
The Approach
Welo Data implemented a single cross-trained contributor pool operating across all three task types in parallel, avoiding the time and cost of building separate specialist teams for each evaluation format.
For Task 1 (Labeling), raters reviewed AI-generated property level descriptions against a structured source-of-truth dataset, identifying and labelling accuracy issues across six categories: facility hallucination, wrong descriptor, geo bias, incorrect distances, guest opinion framing, and other. For Task 2 (Quality Assessment), raters scored AI-generated room value proposition copy across a 6-pillar framework (Language Accuracy, Clarity & Readability, Brand, Use Case, Project Style Requirements, and Project Content Needs) on a 1-4 scale with issue flagging and commentary. For Task 3 (Judge LLM Validation), raters evaluated a proprietary LLM’s ability to detect 12 categories of forbidden information in property descriptions, recording human annotations independently before comparing against Judge outputs and flagging misalignments.
All contributors completed 7-point identity verification, four threshold-gated application assessments (English Language & Proofreading, Reading & Comprehension, Entity Tagging, and Attention to Detail), task-specific embedded training videos produced by Welo’s internal quality team, and calibration testing against pre-validated golden responses before entering production. Annotators scoring below 75% accuracy on preliminary assessments were removed from the production pipeline immediately.
A three-layer QA model governed output throughout: rater production with assessment gating at entry; continuous QC audit and correction at 100% coverage; and internal expert validation by Welo’s Quality & Operations team at final delivery. An active real-time FAQ feedback loop resolved edge cases as they arose, with decisions communicated immediately to the full contributor pool to prevent compounding inconsistencies.
Key Project Components
- Task types: Task 1: Accuracy labeling (200 items); Task 2: 6-pillar QA scoring (540 items); Task 3: Judge LLM validation (200 items)
- Locale: English (UK), en-GB; native/near-native evaluators based in the Philippines
- Workforce: 13 production raters + 6 QCs; contributors cross-trained across all three task types
- Identity verification: 7-point identity verification per contributor; 4 threshold-gated assessments (90%+ required); 75% production gate
- Quality model: 3-layer QA; 100% QC coverage; calibration against golden responses; real-time FAQ edge case resolution
- Tooling: Google Sheets (per client specification); proprietary onboarding and identity verification platform
- Timeline: Full pilot (sourcing, onboarding, training, production, QC, and final reporting) delivered within a single March–April 2026 window
Outcomes and Impact
The pilot delivered 870 evaluation tasks across all three types, on schedule, with 100% QC coverage. Quality results at delivery reflected both the rigor of the contributor pipeline and the impact of calibrated label methodology: Task 2 achieved 100% alignment on Language Accuracy and Project Style Requirements, 95.93% on Brand, and 92.78% on Use Case. Task 3 reached 88.5% calibrated human label match, the highest inter-rater alignment across all tasks, and identified an 87% Judge LLM label accuracy rate.
The calibrated label approach proved particularly valuable. Raw alignment scores (Task 1: 59.5%, Task 3: 31%) reflected genuine guideline ambiguities rather than annotator error. Calibrated alignment scores (Task 1: 76.5%, Task 3: 88.5%) isolated true performance from instruction gaps, producing metrics that accurately represented contributor quality and giving the client a reliable foundation for model improvement decisions.
Beyond the numbers, the program surfaced three critical AI model failure modes with specific, actionable recommendations: systematic geo bias in property description generation (landmark distances over 2km consistently framed as ‘nearby’), routine over-generalisation of ‘additional amenities’ to all units, and Judge LLM false negatives on implicit schedule language. These findings, along with documented guideline improvement recommendations, were delivered in a structured report alongside the labelled data.
Key Results
- 870 evaluation tasks delivered across 3 task types, on schedule, within a 3-week window
- 100% QC coverage maintained across all tasks throughout production
- Task 2: 100% alignment on Language Accuracy and Project Style Requirements; 95.93% Brand; 92.78% Use Case
- Task 3: 88.5% calibrated human label alignment (highest inter-rater agreement across all tasks)
- Calibrated label methodology improved Task 1 alignment from 59.5% to 76.5%; Task 3 from 31% to 88.5%
- 3 critical AI model failure modes identified and documented with specific remediation recommendations
- Judge LLM false negative patterns identified with red-teaming recommendations for model improvement
- Structured delivery report produced, including alignment statistics, issue distribution analysis, edge case resolutions, and model improvement recommendations
Why It Matters
AI content that reaches customers at scale (property descriptions, room copy, booking information) carries direct business risk when it’s wrong. Factual inaccuracies drive complaint tickets and refunds. Prohibited content creates compliance exposure. Copy that fails to lead with the right selling points reduces conversion. Human evaluation isn’t a quality-check formality; it’s the mechanism that makes AI content trustworthy enough to deploy at scale.
What this pilot demonstrated is that rigorous multi-task human evaluation doesn’t require months of setup or siloed specialist teams for each task type. By designing a cross-trained contributor pipeline from the start, with threshold-gated onboarding, calibrated quality benchmarks, and real-time edge case resolution. Welo Data delivered validated baselines across three structurally different evaluation problems within a single pilot window.
The calibrated label approach also showed something important: raw alignment scores and true annotator performance are not the same thing when guidelines contain ambiguities. Separating genuine annotation errors from instruction gaps yields actionable metrics, along with a structured record of which guidelines need improvement before the next production phase.
For the client, this pilot establishes a validated evaluation capability that is ready to be extended across additional copy types, locales, and ongoing production batches. The infrastructure (the contributor pool, the QA model, the calibration framework) doesn’t need to be rebuilt. It scales.
Welo Data
The human layer behind enterprise AI evaluation.