AI Output Grader · LLM Evaluator · Data Labeling Specialist
Hands-on evaluator. Real projects across evaluation, annotation, tagging, and QA.
Reproducible results. Clear rubrics, versioned guidelines, audit trails.
Edge-case mindset. Stress tests for hallucinations, safety, and grounding.
Fast loops. Pilot → calibrate → scale, with daily notes and metrics.
Multilingual. English, Spanish, French.
Easy to work with. Europe/Madrid time zone, responsive comms, clean deliverables.
Below is an example of the kind of evaluation documentation and project work I do for LLM teams.
Goal
Design and document a robust evaluation workflow so a distributed team can reliably grade LLM responses and surface weaknesses before deployment.
What I did
Translated product requirements into concrete scoring rubrics (clear criteria, weights, and severity levels).
Defined error categories (from minor issues to blocking failures) so raters could choose the right severity consistently.
Wrote step-by-step instructions for annotators: how to read the prompt, how to judge the response, and how to pick the final score.
Created realistic user prompts and scenarios to test the model on both typical and edge-case behavior.
Added examples of “good” vs “bad” ratings to calibrate the team and reduce disagreement.
Iterated on the guidelines based on pilot results, clarifying ambiguous cases and tightening definitions.
Impact
A self-contained evaluation pack (rubrics + guidelines + examples) that other raters could use without extra training calls.
More consistent scores across annotators, especially on subtle issues like partial correctness or borderline safety concerns.
Faster feedback loops to the model team, with clear evidence for where and why the model failed.
Have a project in mind?
moreno@modelevaluator.com · Madrid, Spain
© 2025 Alejandro Moreno-Ramos