Context Control
ra gives you full control over what the model sees and when. Built-in mechanisms handle the common cases automatically — compaction, caching, thinking, context discovery — and middleware hooks let you intercept everything else.
Smart context compaction
When conversations grow long, ra compacts automatically. It splits the history into three zones — pinned messages (system prompt, first user message), compactable middle, and recent turns — then drops the minimum messages from the back of the compactable zone needed to free space. This keeps [pinned, ...early_compactable] byte-identical to the cached prefix, so provider prompt caches (Anthropic, OpenAI, Google) get maximum reuse on the very next model call.
agent:
compaction:
enabled: true
threshold: 0.90 # trigger at 90% of context window
strategy: truncate # 'truncate' (default) or 'summarize'Two strategies:
truncate(default) — Drops messages from the back of the compactable zone (the transition between old and recent context). Free, instant, and maximally cache-friendly: the message prefix[system, first_user, early_turns...]stays byte-identical across compactions, giving maximum prefix cache hits on all providers.summarize— Calls a model to summarize the entire compactable zone, enriches the summary with programmatically extracted metadata (tools used, key files, pending work), and injects the result into the pinned user message. Costs an extra API call but preserves more context semantically.
# opt into summarization if you prefer preserving context over cost
agent:
compaction:
strategy: summarize
model: claude-haiku-4-5-20251001 # cheap model for summarizationEnriched summaries
When using the summarize strategy, ra doesn't just forward the LLM's summary verbatim. Before calling the model, it scans the compactable messages and extracts:
- Tool names — every tool called during the compacted portion
- File paths — code file references detected in message content (
.ts,.py,.rs,.go, etc.) - Pending work hints — messages containing keywords like "todo", "next step", "remaining", "pending"
The LLM is prompted to return structured XML tags (<summary>, <pending_work>, <key_files>) alongside its narrative summary. ra then merges the LLM output with the programmatic metadata, deduplicates file paths, and produces a single enriched summary that includes tools used, key files, and pending work sections.
When re-compacting an already-compacted session, the previous summary is preserved as a "Previously compacted context" section, so no context is silently lost across multiple compactions.
To fully control the summarization output, set a custom prompt — this bypasses the metadata formatting and uses the LLM response as-is:
agent:
compaction:
strategy: summarize
prompt: "Summarize this conversation in bullet points."Key properties:
- Cache-friendly — Designed for provider prefix caching (Anthropic, OpenAI, Google). The truncate strategy keeps the message prefix as stable as possible across compactions — only the oldest messages change. The 0.90 threshold maximizes time between compactions.
- Token-aware — Uses real token counts from the provider when available, falls back to estimation.
- Pinned zones — System prompts and initial context never get compacted.
- Tool-call-aware — Boundaries never split an assistant message from its tool results.
- Provider-portable — Works the same across all providers.
- Dynamic context window learning — For unknown models (custom fine-tunes, local models, new releases), ra learns the real context window from provider errors. The first time a model hits a context limit, ra parses the actual size from the error message and caches it — all future compaction thresholds use the correct value automatically.
Context window resolution
ra resolves the context window in this order:
- Config override —
compaction.contextWindowin your config - Learned from errors — cached from a previous context length error
- Model registry — built-in lookup by model name prefix
If none of these match, ra skips proactive compaction and relies on the error-driven path. The first time the model rejects a request for exceeding its context limit, ra parses the real size from the error, caches it, and compacts. From that point on, proactive compaction works correctly.
For best results with unknown models, set the context window explicitly:
agent:
compaction:
contextWindow: 32000 # for a 32k modelToken tracking
ra tracks input and output tokens across every iteration of the loop. Your middleware can read cumulative usage via ctx.loop.usage and enforce budgets, log costs, or trigger compaction early.
// middleware/log-cost.ts
export default async (ctx) => {
const { inputTokens, outputTokens } = ctx.loop.usage
console.log(`Tokens used: ${inputTokens} in, ${outputTokens} out`)
}Prompt caching
For Anthropic, ra automatically applies cache hints to system prompts and tool definitions. This reduces costs on multi-turn sessions without any configuration — cached tokens are billed at a reduced rate.
Extended thinking
Enable extended thinking for models that support it. Five modes control how the model reasons before responding.
| Mode | Behavior |
|---|---|
off | Disabled (default) |
low | Minimal reasoning budget |
medium | Moderate reasoning budget |
high | Maximum reasoning budget |
adaptive | high for the first 10 iterations, then low — balances deep initial reasoning with faster follow-up turns |
ra --thinking high "Design a database schema for a social network"
ra --thinking adaptive "Build a REST API"agent:
thinking: adaptiveOptionally cap the thinking budget in tokens. The provider uses min(levelBudget, cap):
agent:
thinking: high
thinkingBudgetCap: 10000 # never exceed 10k thinking tokensThinking output streams to the terminal in the REPL, so you can watch the model reason in real time. In the HTTP API, thinking tokens are emitted as {"type":"thinking","delta":"..."} SSE events.
Context discovery
ra discovers and injects project context files into the conversation before your prompt. By default, ra looks for common convention files (CLAUDE.md, AGENTS.md, .cursorrules, .windsurfrules, .github/copilot-instructions.md). Configure which files to look for:
agent:
context:
enabled: true
patterns:
- "CLAUDE.md"
- "AGENTS.md"
- "CONVENTIONS.md" # add your own patternsra walks the directory tree upward to the git root, finds matching files, and injects them as system context. This is useful for project conventions, coding standards, or any persistent instructions.
Pattern resolution
Reference files and URLs inline in your prompts — ra resolves them before the model sees the message.
ra "explain what @src/auth.ts does" # file contents injected
ra "review @src/utils/*.ts for consistency" # glob expansion
ra "summarize url:https://example.com/api-docs" # fetched page contentTwo built-in resolvers are enabled by default:
| Resolver | Syntax | Description |
|---|---|---|
| File | @path or @glob | Resolves file contents, supports glob patterns |
| URL | url:https://... | Fetches and inlines page content |
Add custom resolvers for GitHub issues, database records, or anything else:
agent:
context:
resolvers:
- name: issues
path: ./resolvers/github-issues.tsMiddleware hooks
For full programmatic control over context, use middleware. Every hook receives the full conversation history and can mutate it.
agent:
middleware:
beforeModelCall:
- "./middleware/enforce-budget.ts"
afterToolExecution:
- "./middleware/redact-secrets.ts"// middleware/enforce-budget.ts — reject if context is too large
export default async (ctx) => {
const totalChars = ctx.request.messages.reduce((n, m) => n + JSON.stringify(m).length, 0)
if (totalChars > 500_000) ctx.stop()
}See also
- Middleware — all hook types and context shapes
- Dynamic Prompts — advanced middleware patterns for context injection
- Configuration — compaction and context settings