---
canonical: "https://yuanhaochen.dev/work/nitro-judge"
path: "/work/nitro-judge"
section: "Work"
title: "nitro-ai-judge"
language: "en"
agentUse: "summary, retrieval, citation, hiring evaluation"
---

# nitro-ai-judge

A baseline-discipline project: reproducible reading-time prediction, local evaluation, acceptance rules, target audit, and visible experiment boundaries.

Why this article exists

This project is useful because it shows restraint under leaderboard pressure. The repository keeps the baseline, data contract, local estimates, failed experiments, and promotion rules visible so model progress is not confused with wishful scoring.

Problem

A competition pipeline can look successful while hiding whether a score came from local validation, official feedback, leakage, oracle ceilings, or a lucky upload. That makes model claims hard to inspect.

What shipped

CSV data contract, `solution.py`, generated submission, cross-validation evaluator, acceptance criteria, baseline design docs, target audit, and documented Transformer/BERT/semantic experiments.

Evidence

The README distinguishes local estimates from hidden Nitro leaderboard scores, documents rejected or unpromoted experiments, and states when an experiment is not allowed to replace the baseline.

Inspect path

Inspect `solution.py`, `evaluate.py`, `docs/submission_pipeline_design.md`, `ACCEPTANCE_CRITERIA.md`, experiment reports, and target-audit commands.

Boundary

Local validation is not the hidden leaderboard, and experimental models are not promoted unless local or official evidence supports them. This repo proves evaluation discipline more than final model superiority.

What changed

The baseline discipline became clearer: a plain model is useful when it protects error analysis, promotion rules, and evidence quality from leaderboard noise.

Next question

Which failure class would justify a more complex model instead of cleaner data, features, validation, or target analysis?

Open public repository

https://github.com/89325516/nitro-ai-judge
