Improving LLM Reasoning Through Expert-Level Research Prompts and Structured Evaluation
Discover how a top AI company used expert prompts and structured evaluation to quickly test model reasoning at a deeper level, delivering faster insights with fewer, high-value tasks.

A leading AI-powered answer engine set out to test a new research-driven capability within its large language model. The goal was to move beyond surface-level summaries and evaluate how well the model could navigate complex, multi-step questions that demand expert analysis and critical thinking.

To support this initiative, our team created a pilot program focused on deep research tasks. We developed structured prompts and detailed rubrics across multiple professional domains such as finance, healthcare, lifestyle (travel), and engineering.
This allowed the client to test the model’s ability to provide well-supported, logically sound responses that mirror human expertise. The project showed how clear inputs and thoughtful evaluation can guide model behavior without the need for extensive retraining or traditional fine-tuning.
The Client
The client is a leading AI research company developing advanced language models used across consumer and enterprise products. They were building a new model aimed at deeper research capabilities, with the goal of delivering responses that feel less like summaries and more like the work of a subject matter expert.
To prove this model could work, they needed a way to simulate real-world research tasks and a consistent method to evaluate how well the model performed.
The Challenge
While the model was strong at generating fluent responses, the client wanted to test whether it could handle questions that require two to three hours of human research. The difficulty was creating a system that could:
- Deliver realistic prompts that would challenge the model’s reasoning
- Provide a measurable, consistent way to evaluate complex outputs
- Do all of this without expensive fine-tuning or retraining
They needed a scalable and repeatable method for evaluating their new deep research capability, one that could be trusted across multiple professional domains.
The Solution
We designed and delivered a two-week pilot focused on high-effort task creation and evaluation. Key actions included:
1. Expert-Led Prompt Design
Contributors from fields such as consulting, analytics, gaming, healthcare, finance, business, writing & media, and engineering created prompts requiring advanced knowledge and structured problem-solving. Prompts were realistic, scoped tightly, and grounded in current industry needs.
2. Rubric-Based Evaluation
Each task included a custom rubric with 20 to 40 detailed criteria, assigning point values based on relevance and depth. Every item included:
- A clear, verifiable fact
- A public source link
- A supporting quote
- A justification for its inclusion
3. Review and Feedback
A review team ensured prompt clarity, verified sources, and aligned rubric items to expected answers. Feedback loops helped contributors quickly improve quality and consistency.
Results / Business Outcomes
The project showed clear operational and performance benefits:
- Task time reduced from 4 hours to 2.5 hours on average
- 50 expert-level tasks delivered, including:
- Health: 7
- Finance: 18
- Lifestyle: 10
- STEM: 15
- 20% of contributors met all requirements on their first attempt
- 80% of contributors hit quality standards within 3 to 5 iterations
These tasks are now being used internally to benchmark LLM output and refine deep research capabilities, offering a lightweight alternative to full retraining.
Faster Evaluation of LLM Reasoning at Expert Level
Key Challenges
- Fluent but shallow responses from LLMs
- Lack of consistent evaluation for deep reasoning
- Difficulty simulating real-world research questions
Welo Data Solutions
- Expert-designed prompts grounded in professional domains
- Custom rubrics with detailed factual criteria
- Iterative review and contributor feedback
- Lightweight approach avoiding traditional fine-tuning
Why It Matters
This pilot gave the client a reliable and innovative way to test their model’s reasoning logic, not just its language fluency.
Key Takeaways
- Improved evaluation workflows: The rubric structure offers a reusable template for future assessments
- Faster iteration cycles: Feedback and calibration drove rapid quality gains
- Reduced training needs: Tasks helped shape the model without extensive data or retraining
- Higher trust in model output: Deep research tasks simulate real-world complexity, showing how the model handles serious questions