Documentation Index
Fetch the complete documentation index at: https://trygradient.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Scoring
Gradient’s scoring engine uses an LLM-as-judge approach to evaluate candidate submissions across multiple dimensions. Our scoring framework is informed by Anthropic’s AI Fluency Index, which identifies the core competencies that distinguish effective AI-augmented knowledge workers. The result is a detailed score breakdown with percentile rankings and suggested follow-up interview questions.Scoring categories
The default rubric evaluates candidates across five categories totaling 100 points:| Category | Points | What it measures |
|---|---|---|
| Deliverable Quality | 40 | Content accuracy, analytical depth, structure, visual quality |
| Delegation | 10 | How effectively the candidate directed the AI |
| Description | 20 | Clarity of prompts and instructions given to the AI |
| Discernment | 20 | Ability to evaluate AI output, catch errors, and verify information |
| Diligence | 10 | Thoroughness in workspace setup, data source usage, and iteration |
These categories can be customized per assessment using the scoring rubric configuration. See the Custom Rubrics guide.
How scoring works
Trigger
After a candidate submits, an admin triggers scoring via the dashboard or the Scoring API. Scoring can also be re-run on already-scored sessions.
Analysis
The scoring engine runs multiple analysis modules in parallel. Each module examines different aspects of the submission: the deliverable itself, the conversation transcript, event patterns, and the AI configuration snapshot.
Scoring
Each sub-criterion receives a score using one of three methods:
- LLM Judge: An AI evaluator assesses quality against rubric criteria
- Event Analysis: Automated analysis of behavioral patterns (e.g., did they connect data sources?)
- Hybrid: Combines both approaches
Follow-up questions
The scoring engine also generates suggested follow-up questions for the hiring manager. These are specific, evidence-based questions tied to things the candidate did (or didn’t do) during the assessment. For example, if a candidate accepted incorrect data from the AI without verification, the engine might suggest:“I noticed you included a revenue figure that doesn’t match the source document. Walk me through your verification process for AI-generated data.”Each question includes:
- The question itself
- Rationale: Why this question matters (what the scoring engine observed)
- What to look for: Strong vs. weak answers
Admin review
After automated scoring, admins can:- Review the detailed score breakdown per category and sub-criterion
- Replay the candidate’s session (every action, message, and edit)
- Adjust individual category scores with the Override Score API
- Add review notes
- Release feedback to the candidate (immediately or after a configurable delay)