Prompt
Review
Who reviewed this prompt change?
Versioning
Which prompt version produced this response?
Experiments
Are prompt experiments affecting users unevenly?
Inference
Model Routing
Which model should handle each request?
Fallback Chain
What happens when model providers fail or degrade?
Model API
How consistent is behavior across model providers?
Serving
Hosting
Where should hosted or open-weight models run?
Runtime
Which runtime serves model tokens?
Hardware
Which hardware should back inference?
Workflows
Can predefined multi-step AI work run reliably?
Agent Harness
Which runtime coordinates model loops and tools?
Agent Platform
How are multiple agents run and coordinated?
Context
Relevance Evals
Is retrieved context relevant?
Completeness Evals
Is retrieved context complete enough?
Faithfulness Evals
Are answers citing unsupported facts?
Memory
Working Memory
Is the current conversation carrying the right working state?
Long-Term Memory
Should useful knowledge persist across conversations?
Summarization
Can long conversations be compressed without losing essentials?
Retrieval
When should remembered information be retrieved?
Conflicts
What happens when memories disagree?
Memory Evals
Is memory making future answers better?
Guardrails
Prompt Injection Defense
Can users override system instructions?
Indirect Injection Defense
Can retrieved content inject malicious instructions?
PII Redaction
Is sensitive data being sent to model providers?
Output Guardrails
Can sensitive data leak back to users?
Budgets and Rate Limits
Can abuse or runaway loops create cost and safety risk?
Actions
What should happen when a guardrail fires?
Tool Guardrails
Can attacked context trigger unsafe tool calls?
Agents
Tool Registry
Which tools exist, what do they accept, and when are they allowed?
Tool Selection
Which tools should be loaded for this step?
Approval
Which agent actions require a human decision?
Tool Storage
Can tool outputs be stored for later lookup?
Browser
Can the agent operate web apps through a browser?
Subagents
Can work be delegated to specialized agents?
Protocols
How do agents connect to tools, data, and other agents?
Sandbox
Can the agent run untrusted code?
Permissions
Do agent permissions exceed task scope?
Interface
Input
Can users provide the input format the task needs?
Streaming
Can users follow long answers while they are being generated?
Reasoning
Can the interface show useful reasoning or progress state?
Tool UI
Can tool calls render as usable product UI?
Citations
Can users inspect the sources behind an answer?
Files
Can users upload and manage files in the agent workflow?
Stop
Can users stop long-running answers or actions?
Branching
Can users fork from an earlier conversation state?
Async
Does the UI survive disconnects and background work?
Feedback
Annotation
Can domain experts review outputs without engineering help?
Ingestion
Does user feedback improve eval sets?
Quality
Regression
Did output quality regress?
CI
Will this model upgrade break existing workflows?
Datasets
Do test cases reflect real user behavior?
Workflow
Can production failures become reusable eval cases?
Scorers
Which scorer should judge the answer?
Eval Types
Should this eval be rule-based, vector-based, or model-based?
Matrix
Which model and prompt combination works best?
Platforms
Where should eval runs, datasets, and review live?
Cost
Attribution
Who owns token spend by user, team, or workflow?
Units
Which unit should cost be measured against?
Forecasting
How will model spend change as usage grows?
Alerting
Will the team notice sudden cost spikes?
Observability
Tracing
Why did this answer happen?
Diagnosis
Did the failure come from prompt, retrieval, tool, model, or policy?
Replay
Can this agent run be reconstructed later?
Drift
Is production behavior drifting over time?
Performance
Monitoring
Why are responses too slow for the product experience?
Caching
Which repeated requests are wasting tokens?
Segments
Which model, prompt, or user segment is slow?
Concurrency
How many users can stream at the same time?
Throughput
How quickly does each model produce tokens?
Governance
Audit Logging
Is there an audit trail for AI decisions?
Evidence Collection
Is compliance evidence scattered across tools?
Data Retention
Are vendor data-retention rules clear?