Scoring

Gradient’s scoring engine uses an LLM-as-judge approach to evaluate candidate submissions across multiple dimensions. Our scoring framework is informed by Anthropic’s AI Fluency Index, which identifies the core competencies that distinguish effective AI-augmented knowledge workers. The result is a detailed score breakdown with percentile rankings and suggested follow-up interview questions.

Scoring categories

The default rubric (v5) evaluates candidates across five categories. The overall score out of 100 is a weighted blend of the five, using these fixed weights:

Category	Weight	What it measures
Correctness	20%	Deterministic. Following instructions and fact accuracy: required elements present, facts correct, sound judgment on the task
Deliverable Quality	20%	How good the output is (polish, clarity, insight), measured as how far it improves on a first-pass AI draft
Reflection Quality	10%	Whether the candidate reads their own work honestly and takes something from it (only when the Reflection phase is on)
AI Fluency	25%	How well the candidate used AI, across four dimensions: AI Use, Steering, Critical Use, and Creative Framing
Prioritized Skills	25%	One sub-criterion per approved priority skill on the role, so scoring reflects what the job needs

Category weights are fixed by the rubric version. What you tune per assessment is the sub-criteria inside Correctness, Deliverable Quality, Reflection, and Prioritized Skills (see Custom Rubrics). Prioritized Skills is built from the role’s priority skills.

AI Fluency is managed centrally by Gradient. It is auto-graded from session behavior, always locked, and not admin-editable or manager-calibratable, so this dimension stays consistent across every organization and assessment.

How scoring works

Trigger

After a candidate submits, an admin triggers scoring via the dashboard or the Scoring API. Scoring can also be re-run on already-scored sessions.

Analysis

The scoring engine runs multiple analysis modules in parallel. Each module examines different aspects of the submission: the deliverable itself, the conversation transcript, event patterns, and the AI configuration snapshot.

Scoring

Each sub-criterion receives a score using one of these methods:

Deterministic: Ground-truth checks, used for Correctness (are the required facts and elements present and right?)
LLM Judge: An AI evaluator assesses quality against rubric criteria (used for Deliverable Quality, Reflection, and Prioritized Skills)
Event Analysis: Automated analysis of behavioral patterns (for example, did they verify a figure before using it?)
Hybrid: Combines approaches. AI Fluency is graded this way, centrally, by Gradient

Percentiles

Scores are benchmarked against three baselines: other candidates on the same assessment, candidates in the same role category, and all candidates globally.

Follow-up questions

The scoring engine also generates suggested follow-up questions for the hiring manager. These are specific, evidence-based questions tied to things the candidate did (or didn’t do) during the assessment. For example, if a candidate accepted incorrect data from the AI without verification, the engine might suggest:

“I noticed you included a revenue figure that doesn’t match the source document. Walk me through your verification process for AI-generated data.”

Each question includes:

The question itself
Rationale: Why this question matters (what the scoring engine observed)
What to look for: Strong vs. weak answers

Calibration

Gradient learns your standards over time. Reviewers grade candidates on each sub-criterion using its 0-4 anchored scale (the same ladder the AI grades against). When a reviewer’s grade disagrees with the AI, that disagreement feeds calibration, which sharpens future scoring for the assessment. Calibration routes each item to the right channel:

Durable and process items (Prioritized Skills, Reflection) refine the sub-criterion’s 0-4 anchor descriptors, so the ladder the judge reads captures what “good” means for this role. See anchored scales.
Deliverable Quality tunes a learned preference profile rather than anchor text, because that category is scored by comparison, not by a written ladder.

Because calibration works by sharpening a sub-criterion’s anchors, an item with no anchors cannot be calibrated. AI Fluency is deliberately excluded from calibration: it is centrally managed and stays fixed. See the Scoring API for the calibration endpoints.

Admin review

After automated scoring, admins can:

Review the detailed score breakdown per category and sub-criterion
Replay the candidate’s session (every action, message, and edit)
Adjust individual category scores with the Override Score API
Add review notes
Release feedback to the candidate (immediately or after a configurable delay)

​Scoring

​Scoring categories

​How scoring works

​Follow-up questions

​Calibration

​Admin review

Scoring

Scoring categories

How scoring works

Follow-up questions

Calibration

Admin review