LLM Cards

A curated map of tools for LLM operations: gateways, evals, deploy, guardrails, agents, and UI.

Prompt

Review

Who reviewed this prompt change?

Versioning

Which prompt version produced this response?

Experiments

Are prompt experiments affecting users unevenly?

Inference

Model Routing

Which model should handle each request?

Fallback Chain

What happens when model providers fail or degrade?

Model API

How consistent is behavior across model providers?

Serving

Hosting

Where should hosted or open-weight models run?

Runtime

Which runtime serves model tokens?

Hardware

Which hardware should back inference?

Workflows

Can predefined multi-step AI work run reliably?

Agent Harness

Which runtime coordinates model loops and tools?

Agent Platform

How are multiple agents run and coordinated?

Context

Relevance Evals

Is retrieved context relevant?

Completeness Evals

Is retrieved context complete enough?

Faithfulness Evals

Are answers citing unsupported facts?

Memory

Working Memory

Is the current conversation carrying the right working state?

Long-Term Memory

Should useful knowledge persist across conversations?

Summarization

Can long conversations be compressed without losing essentials?

Retrieval

When should remembered information be retrieved?

Conflicts

What happens when memories disagree?

Memory Evals

Is memory making future answers better?

Guardrails

Prompt Injection Defense

Can users override system instructions?

Indirect Injection Defense

Can retrieved content inject malicious instructions?

PII Redaction

Is sensitive data being sent to model providers?

Output Guardrails

Can sensitive data leak back to users?

Budgets and Rate Limits

Can abuse or runaway loops create cost and safety risk?

Actions

What should happen when a guardrail fires?

Tool Guardrails

Can attacked context trigger unsafe tool calls?

Agents

Tool Registry

Which tools exist, what do they accept, and when are they allowed?

Tool Selection

Which tools should be loaded for this step?

Approval

Which agent actions require a human decision?

Tool Storage

Can tool outputs be stored for later lookup?

Browser

Can the agent operate web apps through a browser?

Subagents

Can work be delegated to specialized agents?

Protocols

How do agents connect to tools, data, and other agents?

Sandbox

Can the agent run untrusted code?

Permissions

Do agent permissions exceed task scope?

Interface

Input

Can users provide the input format the task needs?

Streaming

Can users follow long answers while they are being generated?

Reasoning

Can the interface show useful reasoning or progress state?

Tool UI

Can tool calls render as usable product UI?

Citations

Can users inspect the sources behind an answer?

Files

Can users upload and manage files in the agent workflow?

Stop

Can users stop long-running answers or actions?

Branching

Can users fork from an earlier conversation state?

Async

Does the UI survive disconnects and background work?

Feedback

Annotation

Can domain experts review outputs without engineering help?

Ingestion

Does user feedback improve eval sets?

Quality

Regression

Did output quality regress?

CI

Will this model upgrade break existing workflows?

Datasets

Do test cases reflect real user behavior?

Workflow

Can production failures become reusable eval cases?

Scorers

Which scorer should judge the answer?

Eval Types

Should this eval be rule-based, vector-based, or model-based?

Matrix

Which model and prompt combination works best?

Platforms

Where should eval runs, datasets, and review live?

Cost

Attribution

Who owns token spend by user, team, or workflow?

Units

Which unit should cost be measured against?

Forecasting

How will model spend change as usage grows?

Alerting

Will the team notice sudden cost spikes?

Observability

Tracing

Why did this answer happen?

Diagnosis

Did the failure come from prompt, retrieval, tool, model, or policy?

Replay

Can this agent run be reconstructed later?

Drift

Is production behavior drifting over time?

Performance

Monitoring

Why are responses too slow for the product experience?

Caching

Which repeated requests are wasting tokens?

Segments

Which model, prompt, or user segment is slow?

Concurrency

How many users can stream at the same time?

Throughput

How quickly does each model produce tokens?

Governance

Audit Logging

Is there an audit trail for AI decisions?

Evidence Collection

Is compliance evidence scattered across tools?

Data Retention

Are vendor data-retention rules clear?