Context Control

ra gives you full control over what the model sees and when. Built-in mechanisms handle the common cases automatically — compaction, caching, thinking, context discovery — and middleware hooks let you intercept everything else.

Smart context compaction

When conversations grow long, ra compacts automatically. It splits the history into three zones — pinned messages (system prompt, first user message), compactable middle, and recent turns — then drops the minimum messages from the back of the compactable zone needed to free space. This keeps [pinned, ...early_compactable] byte-identical to the cached prefix, so provider prompt caches (Anthropic, OpenAI, Google) get maximum reuse on the very next model call.

yaml

agent:
  compaction:
    enabled: true
    threshold: 0.90              # trigger at 90% of context window
    strategy: truncate           # 'truncate' (default) or 'summarize'

Two strategies:

truncate (default) — Drops messages from the back of the compactable zone (the transition between old and recent context). Free, instant, and maximally cache-friendly: the message prefix [system, first_user, early_turns...] stays byte-identical across compactions, giving maximum prefix cache hits on all providers.
summarize — Calls a model to summarize the entire compactable zone, enriches the summary with programmatically extracted metadata (tools used, key files, pending work), and injects the result into the pinned user message. Costs an extra API call but preserves more context semantically.

yaml

# opt into summarization if you prefer preserving context over cost
agent:
  compaction:
    strategy: summarize
    model: claude-haiku-4-5-20251001  # cheap model for summarization

Enriched summaries

When using the summarize strategy, ra doesn't just forward the LLM's summary verbatim. Before calling the model, it scans the compactable messages and extracts:

Tool names — every tool called during the compacted portion
File paths — code file references detected in message content (.ts, .py, .rs, .go, etc.)
Pending work hints — messages containing keywords like "todo", "next step", "remaining", "pending"

The LLM is prompted to return structured XML tags (<summary>, <pending_work>, <key_files>) alongside its narrative summary. ra then merges the LLM output with the programmatic metadata, deduplicates file paths, and produces a single enriched summary that includes tools used, key files, and pending work sections.

When re-compacting an already-compacted session, the previous summary is preserved as a "Previously compacted context" section, so no context is silently lost across multiple compactions.

To fully control the summarization output, set a custom prompt — this bypasses the metadata formatting and uses the LLM response as-is:

yaml

agent:
  compaction:
    strategy: summarize
    prompt: "Summarize this conversation in bullet points."

Key properties:

Cache-friendly — Designed for provider prefix caching (Anthropic, OpenAI, Google). The truncate strategy keeps the message prefix as stable as possible across compactions — only the oldest messages change. The 0.90 threshold maximizes time between compactions.
Token-aware — Uses real token counts from the provider when available, falls back to estimation.
Pinned zones — System prompts and initial context never get compacted.
Tool-call-aware — Boundaries never split an assistant message from its tool results.
Provider-portable — Works the same across all providers.
Dynamic context window learning — For unknown models (custom fine-tunes, local models, new releases), ra learns the real context window from provider errors. The first time a model hits a context limit, ra parses the actual size from the error message and caches it — all future compaction thresholds use the correct value automatically.

Context window resolution

ra resolves the context window in this order:

Config override — compaction.contextWindow in your config
Learned from errors — cached from a previous context length error
Model registry — built-in lookup by model name prefix

If none of these match, ra skips proactive compaction and relies on the error-driven path. The first time the model rejects a request for exceeding its context limit, ra parses the real size from the error, caches it, and compacts. From that point on, proactive compaction works correctly.

For best results with unknown models, set the context window explicitly:

yaml

agent:
  compaction:
    contextWindow: 32000   # for a 32k model

Token tracking

ra tracks input and output tokens across every iteration of the loop. Your middleware can read cumulative usage via ctx.loop.usage and enforce budgets, log costs, or trigger compaction early.

// middleware/log-cost.ts
export default async (ctx) => {
  const { inputTokens, outputTokens } = ctx.loop.usage
  console.log(`Tokens used: ${inputTokens} in, ${outputTokens} out`)
}

Prompt caching

For Anthropic, ra automatically applies cache hints to system prompts and tool definitions. This reduces costs on multi-turn sessions without any configuration — cached tokens are billed at a reduced rate.

Extended thinking

Enable extended thinking for models that support it. Five modes control how the model reasons before responding.

Mode	Behavior
`off`	Disabled (default)
`low`	Minimal reasoning budget
`medium`	Moderate reasoning budget
`high`	Maximum reasoning budget
`adaptive`	`high` for the first 10 iterations, then `low` — balances deep initial reasoning with faster follow-up turns

bash

ra --thinking high "Design a database schema for a social network"
ra --thinking adaptive "Build a REST API"

yaml

agent:
  thinking: adaptive

Optionally cap the thinking budget in tokens. The provider uses min(levelBudget, cap):

yaml

agent:
  thinking: high
  thinkingBudgetCap: 10000   # never exceed 10k thinking tokens

Thinking output streams to the terminal in the REPL, so you can watch the model reason in real time. In the HTTP API, thinking tokens are emitted as {"type":"thinking","delta":"..."} SSE events.

Context discovery

ra discovers and injects project context files into the conversation before your prompt. By default, ra looks for common convention files (CLAUDE.md, AGENTS.md, .cursorrules, .windsurfrules, .github/copilot-instructions.md). Configure which files to look for:

yaml

agent:
  context:
    enabled: true
    patterns:
      - "CLAUDE.md"
      - "AGENTS.md"
      - "CONVENTIONS.md"    # add your own patterns

ra walks the directory tree upward to the git root, finds matching files, and injects them as system context. This is useful for project conventions, coding standards, or any persistent instructions.

Pattern resolution

Reference files and URLs inline in your prompts — ra resolves them before the model sees the message.

bash

ra "explain what @src/auth.ts does"            # file contents injected
ra "review @src/utils/*.ts for consistency"     # glob expansion
ra "summarize url:https://example.com/api-docs" # fetched page content

Two built-in resolvers are enabled by default:

Resolver	Syntax	Description
File	`@path` or `@glob`	Resolves file contents, supports glob patterns
URL	`url:https://...`	Fetches and inlines page content

Add custom resolvers for GitHub issues, database records, or anything else:

yaml

agent:
  context:
    resolvers:
      - name: issues
        path: ./resolvers/github-issues.ts

Middleware hooks

For full programmatic control over context, use middleware. Every hook receives the full conversation history and can mutate it.

yaml

agent:
  middleware:
    beforeModelCall:
      - "./middleware/enforce-budget.ts"
    afterToolExecution:
      - "./middleware/redact-secrets.ts"

// middleware/enforce-budget.ts — reject if context is too large
export default async (ctx) => {
  const totalChars = ctx.request.messages.reduce((n, m) => n + JSON.stringify(m).length, 0)
  if (totalChars > 500_000) ctx.stop()
}

Context Control ​

Smart context compaction ​

Enriched summaries ​

Context window resolution ​

Token tracking ​

Prompt caching ​

Extended thinking ​

Context discovery ​

Pattern resolution ​

Middleware hooks ​

See also ​

Context Control

Smart context compaction

Enriched summaries

Context window resolution

Token tracking

Prompt caching

Extended thinking

Context discovery

Pattern resolution

Middleware hooks

See also