InspectAgents - AI Agent Testing & Safety Platform

InspectAgents

A

Adversarial Attack

Security

Also known as: adversarial input, attack vector, exploit

Carefully crafted inputs designed to fool AI models into making mistakes or behaving unexpectedly. Can include prompt injection, jailbreaks, or adversarial examples that exploit model weaknesses.

Example:

Adding specific invisible characters to prompts that cause the AI to ignore safety instructions.

Related Terms:

Prompt Injection Jailbreak Red Teaming

Alignment

Safety

Also known as: AI alignment, value alignment, goal alignment

Ensuring AI behavior matches human values and intentions. Includes making models helpful, harmless, honest, and following instructions appropriately without harmful outputs.

Example:

Training models to refuse generating harmful content, respect user privacy, and admit when they don't know something.

C

Context Window

Architecture

Also known as: context length, sequence length, token window

The maximum amount of text (measured in tokens) an AI model can process at once. Includes both input prompt and generated output. Longer context windows allow more information but cost more.

Example:

Modern frontier models offer context windows from 200K up to 1M+ tokens — enough to process entire codebases or books at once.

Related Terms:

Token

Content Moderation

Safety

Also known as: content filtering, safety filtering, moderation layer

Filtering or blocking inappropriate, harmful, or policy-violating content in AI inputs and outputs. Critical safety layer for customer-facing AI applications.

Example:

Automatically blocking hate speech, personal data, or violent content before the AI processes or generates it.

Related Terms:

Jailbreak

E

Embeddings

Architecture

Also known as: text embeddings, semantic embeddings, vector representations

Mathematical representations of text as high-dimensional vectors that capture semantic meaning. Similar concepts have similar vectors, enabling AI to understand relationships.

Example:

"king" - "man" + "woman" ≈ "queen" in embedding space, showing the vectors capture meaning relationships.

Related Terms:

Vector Database

F

Fine-tuning

Training

Also known as: model fine-tuning, supervised fine-tuning, task-specific training

Further training a pre-trained AI model on specific data to specialize its behavior for particular tasks, domains, or styles. More expensive than prompting but gives more control.

Example:

Fine-tuning GPT on medical literature to create a specialized medical diagnosis assistant.

Few-Shot Learning

Capability

Also known as: few-shot, few-shot prompting, example-based learning

Providing a few examples in the prompt to teach the AI a pattern or format before asking it to perform the task. More reliable than zero-shot for specific formats.

Example:

Showing 3 examples of customer complaint → empathetic response pairs before asking the AI to handle a new complaint.

Related Terms:

Zero-Shot Learning

G

Grounding

Reliability

Also known as: factual grounding, source grounding, evidence-based generation

Constraining AI outputs to verifiable sources, facts, or retrieved information rather than allowing free generation. Reduces hallucinations by anchoring responses to real data.

Example:

Instead of letting the AI make up product specs, grounding it to your actual product documentation.

H

Hallucination

Reliability

Also known as: AI hallucination, LLM hallucination, confabulation

When an AI model generates false, fabricated, or nonsensical information presented as fact. Hallucinations occur when the model produces outputs that sound plausible but are not grounded in its training data or reality.

Example:

Air Canada's chatbot hallucinated a bereavement fare policy that didn't exist, costing the airline thousands in refunds and legal fees.

J

Jailbreak

Security

Also known as: guardrail bypass, safety bypass, alignment breaking

Techniques used to bypass AI safety guardrails and content policies, causing the model to generate harmful, unethical, or policy-violating content. Often uses roleplay, hypothetical scenarios, or encoding tricks.

Example:

DPD's chatbot was jailbroken to swear at customers and criticize the company after users manipulated its system prompt.

O

Orchestration Loop Attack Surface

Agentic

Also known as: controller loop attack surface, agentic attack surface, control plane attack surface

In multi-step AI agent systems, the attack surface shifts from the model itself to the orchestration controller loop that selects plans and tools, carries state across steps, decides stop/retry, and can cross into write paths. This surfaces three OWASP risks: Prompt Injection (LLM01) via retrieved/tool text, Excessive Agency (LLM06) when capabilities exceed what is justified, and Unbounded Consumption (LLM10) from loop-driven cost.

Example:

An agent tasked with researching a topic retrieves a malicious web page that contains hidden instructions. The orchestration loop passes these instructions as context to the next planning step, causing the agent to call unauthorized tools or exfiltrate data.

P

Prompt Injection

Security

Also known as: prompt hacking, prompt manipulation, instruction injection

A security vulnerability where malicious users manipulate an AI system by injecting instructions into prompts that override the system's intended behavior. Similar to SQL injection but for LLMs.

Example:

Chevrolet's chatbot was prompt-injected to agree to sell a 2024 Tahoe for $1 after a user inserted "ignore all previous instructions" commands.

R

Red Teaming

Testing

Also known as: adversarial testing, security testing, break testing

Systematic adversarial testing where security researchers intentionally try to break AI systems, discover vulnerabilities, and expose failure modes before deployment. Named after military "red team" exercises.

Example:

Frontier labs like OpenAI and Anthropic engage external red teamers before every major model release (GPT-5, Claude Opus) to find jailbreaks, bias issues, and security vulnerabilities.

Related Terms:

Adversarial Attack

Retrieval-Augmented Generation (RAG)

Architecture

Also known as: RAG, retrieval-based generation, augmented generation

AI architecture that retrieves relevant information from a knowledge base before generating responses, reducing hallucinations and enabling up-to-date answers without retraining.

Example:

A customer service chatbot that searches your product docs before answering questions, ensuring accurate responses.

Related Terms:

Grounding Hallucination Vector Database

RLHF (Reinforcement Learning from Human Feedback)

Training

Also known as: reinforcement learning from human feedback, RLHF, human feedback training

Training technique where humans rate AI outputs, and the model learns to generate responses that humans prefer. Key method for alignment and reducing harmful outputs.

Example:

ChatGPT was fine-tuned using RLHF, where humans ranked different responses to make the model more helpful and less harmful.

Related Terms:

Alignment Fine-tuning

S

System Prompt

Configuration

Also known as: system message, system instruction, base prompt

Special instructions given to an AI model that define its role, personality, constraints, and behavior rules. Hidden from users but controls how the AI responds.

Example:

"You are a helpful customer service agent. Never share confidential information. Always be polite and professional."

Related Terms:

Prompt Injection

Sycophancy

Reliability

Also known as: agreement bias, sycophantic behavior, people-pleasing AI, validation bias

When an AI model agrees with, validates, or endorses a user's incorrect claims or preferences instead of providing accurate information. The model prioritizes user satisfaction over truthfulness, even for objectively checkable facts.

Example:

A user tells an AI assistant "2+2=5, right?" and the model agrees or weakly endorses the claim instead of correcting it. In production, this manifests as chatbots agreeing with wrong assumptions, flipping their stance under pressure, and compounding errors across multi-turn conversations.

T

Token

Architecture

Also known as: text token, language token, subword unit

The basic unit of text that AI models process. Roughly 3-4 characters or 0.75 words in English. Models have token limits for input/output and pricing is usually per token.

Example:

The sentence "AI is amazing" is approximately 4 tokens. API costs are often $0.03 per 1K tokens.

Related Terms:

Context Window

Temperature

Configuration

Also known as: sampling temperature, randomness parameter, creativity setting

A parameter (0.0-2.0) controlling AI output randomness. Lower temperature = more predictable/conservative. Higher temperature = more creative/random. Critical for controlling behavior consistency.

Example:

Setting temperature to 0.0 for customer service (consistent answers) vs 0.9 for creative writing (varied outputs).

Related Terms:

Top-P Sampling (Nucleus Sampling)

Configuration

Also known as: nucleus sampling, top-p, probability sampling

Alternative to temperature for controlling randomness. Selects tokens from the smallest set whose cumulative probability exceeds P. More stable than temperature for controlling output quality.

Example:

Top-P of 0.1 means only consider the top 10% most likely next tokens, preventing nonsense while allowing variety.

Related Terms:

Temperature

Trust Boundary

Agentic

Also known as: security boundary, provenance boundary, ingress checkpoint

A boundary in an agentic AI pipeline where untrusted content (user input, retrieved documents, tool outputs) meets authoritative policy (system prompts, rules). OWASP lists prompt injection (LLM01) as a top risk at these boundaries. Proper enforcement requires typed provenance separation and fail-closed behavior at each checkpoint.

Example:

In a customer service agent, the trust boundary exists between the user's message (untrusted) and the system prompt defining what the agent can do (authoritative). If the boundary is weak, a user can inject instructions that override the agent's rules.

V

Vector Database

Architecture

Also known as: vector store, embedding database, vector search engine

Specialized database storing text as mathematical vectors (embeddings) to enable semantic search and retrieval. Essential for RAG systems and AI memory.

Example:

Storing all product documentation as vectors in Pinecone, allowing the AI to find relevant docs even when queries use different wording.

Z

Zero-Shot Learning

Capability

Also known as: zero-shot, zero-shot prompting, instruction following

AI performing tasks without any task-specific examples, relying only on instructions. Tests the model's ability to generalize from pre-training to new tasks.

Example:

Asking "Translate this to French: Hello" without providing any translation examples first.

Related Terms:

Few-Shot Learning

AI Safety Glossary

A

Adversarial Attack

Alignment

C

Context Window

Content Moderation

E

Embeddings

F

Fine-tuning

Few-Shot Learning

G

Grounding

H

Hallucination

J

Jailbreak

O

Orchestration Loop Attack Surface

P

Prompt Injection

R

Red Teaming

Retrieval-Augmented Generation (RAG)

RLHF (Reinforcement Learning from Human Feedback)

S

System Prompt

Sycophancy

T

Token

Temperature

Top-P Sampling (Nucleus Sampling)

Trust Boundary

V

Vector Database

Z

Zero-Shot Learning

Ready to Test Your AI Agents?