Improving LLM Reasoning Through Expert-Level Research Prompts and Structured Evaluation

Discover how a top AI company used expert prompts and structured evaluation to quickly test model reasoning at a deeper level, delivering faster insights with fewer, high-value tasks.

May 14, 2025

3 Minutes

Case Studies

A leading AI-powered answer engine set out to test a new research-driven capability within its large language model. The goal was to move beyond surface-level summaries and evaluate how well the model could navigate complex, multi-step questions that demand expert analysis and critical thinking.

To support this initiative, our team created a pilot program focused on deep research tasks. We developed structured prompts and detailed rubrics across multiple professional domains such as finance, healthcare, lifestyle (travel), and engineering.

This allowed the client to test the model’s ability to provide well-supported, logically sound responses that mirror human expertise. The project showed how clear inputs and thoughtful evaluation can guide model behavior without the need for extensive retraining or traditional fine-tuning.

The Client

The client is a leading AI research company developing advanced language models used across consumer and enterprise products. They were building a new model aimed at deeper research capabilities, with the goal of delivering responses that feel less like summaries and more like the work of a subject matter expert.

To prove this model could work, they needed a way to simulate real-world research tasks and a consistent method to evaluate how well the model performed.

The Challenge

While the model was strong at generating fluent responses, the client wanted to test whether it could handle questions that require two to three hours of human research. The difficulty was creating a system that could:

Deliver realistic prompts that would challenge the model’s reasoning
Provide a measurable, consistent way to evaluate complex outputs
Do all of this without expensive fine-tuning or retraining

They needed a scalable and repeatable method for evaluating their new deep research capability, one that could be trusted across multiple professional domains.

The Solution

We designed and delivered a two-week pilot focused on high-effort task creation and evaluation. Key actions included:

1. Expert-Led Prompt Design

Contributors from fields such as consulting, analytics, gaming, healthcare, finance, business, writing & media, and engineering created prompts requiring advanced knowledge and structured problem-solving. Prompts were realistic, scoped tightly, and grounded in current industry needs.

2. Rubric-Based Evaluation

Each task included a custom rubric with 20 to 40 detailed criteria, assigning point values based on relevance and depth. Every item included:

A clear, verifiable fact
A public source link
A supporting quote
A justification for its inclusion

3. Review and Feedback

A review team ensured prompt clarity, verified sources, and aligned rubric items to expected answers. Feedback loops helped contributors quickly improve quality and consistency.

Results / Business Outcomes

The project showed clear operational and performance benefits:

Task time reduced from 4 hours to 2.5 hours on average
50 expert-level tasks delivered, including:
- Health: 7
- Finance: 18
- Lifestyle: 10
- STEM: 15
20% of contributors met all requirements on their first attempt
80% of contributors hit quality standards within 3 to 5 iterations

These tasks are now being used internally to benchmark LLM output and refine deep research capabilities, offering a lightweight alternative to full retraining.

Faster Evaluation of LLM Reasoning at Expert Level

Key Challenges

Fluent but shallow responses from LLMs
Lack of consistent evaluation for deep reasoning
Difficulty simulating real-world research questions

Welo Data Solutions

Expert-designed prompts grounded in professional domains
Custom rubrics with detailed factual criteria
Iterative review and contributor feedback
Lightweight approach avoiding traditional fine-tuning

Why It Matters

This pilot gave the client a reliable and innovative way to test their model’s reasoning logic, not just its language fluency.

Key Takeaways

Improved evaluation workflows: The rubric structure offers a reusable template for future assessments
Faster iteration cycles: Feedback and calibration drove rapid quality gains
Reduced training needs: Tasks helped shape the model without extensive data or retraining
Higher trust in model output: Deep research tasks simulate real-world complexity, showing how the model handles serious questions

Gen AI

AI/ML Models

Model Assessment Suite | Evaluation Tools

Research Lab

About Us