Improving LLM Reasoning Through Expert-Level Research Prompts and Structured Evaluation

Discover how a top AI company used expert prompts and structured evaluation to quickly test model reasoning at a deeper level, delivering faster insights with fewer, high-value tasks.

3 Minutes

A leading AI-powered answer engine set out to test a new research-driven capability within its large language model. The goal was to move beyond surface-level summaries and evaluate how well the model could navigate complex, multi-step questions that demand expert analysis and critical thinking.

To support this initiative, our team created a pilot program focused on deep research tasks. We developed structured prompts and detailed rubrics across multiple professional domains such as finance, healthcare, lifestyle (travel), and engineering. 

This allowed the client to test the model’s ability to provide well-supported, logically sound responses that mirror human expertise. The project showed how clear inputs and thoughtful evaluation can guide model behavior without the need for extensive retraining or traditional fine-tuning.

The client is a leading AI research company developing advanced language models used across consumer and enterprise products. They were building a new model aimed at deeper research capabilities, with the goal of delivering responses that feel less like summaries and more like the work of a subject matter expert.

To prove this model could work, they needed a way to simulate real-world research tasks and a consistent method to evaluate how well the model performed.

While the model was strong at generating fluent responses, the client wanted to test whether it could handle questions that require two to three hours of human research. The difficulty was creating a system that could:

They needed a scalable and repeatable method for evaluating their new deep research capability, one that could be trusted across multiple professional domains.

We designed and delivered a two-week pilot focused on high-effort task creation and evaluation. Key actions included:

1. Expert-Led Prompt Design

Contributors from fields such as consulting, analytics, gaming, healthcare, finance, business, writing & media, and engineering created prompts requiring advanced knowledge and structured problem-solving. Prompts were realistic, scoped tightly, and grounded in current industry needs.

2. Rubric-Based Evaluation

Each task included a custom rubric with 20 to 40 detailed criteria, assigning point values based on relevance and depth. Every item included:

3. Review and Feedback

A review team ensured prompt clarity, verified sources, and aligned rubric items to expected answers. Feedback loops helped contributors quickly improve quality and consistency.

The project showed clear operational and performance benefits:

These tasks are now being used internally to benchmark LLM output and refine deep research capabilities, offering a lightweight alternative to full retraining.

Key Challenges

  • Fluent but shallow responses from LLMs
  • Lack of consistent evaluation for deep reasoning
  • Difficulty simulating real-world research questions

Welo Data Solutions

  • Expert-designed prompts grounded in professional domains
  • Custom rubrics with detailed factual criteria
  • Iterative review and contributor feedback
  • Lightweight approach avoiding traditional fine-tuning

This pilot gave the client a reliable and innovative way to test their model’s reasoning logic, not just its language fluency.

Key Takeaways