Skip to content

Context Control

ra gives you full control over what the model sees and when. Built-in mechanisms handle the common cases automatically — compaction, caching, thinking, context discovery — and middleware hooks let you intercept everything else.

Smart context compaction

When conversations grow long, ra compacts automatically. It splits the history into three zones — pinned messages (system prompt, first user message), compactable middle, and recent turns — then drops the minimum messages from the back of the compactable zone needed to free space. This keeps [pinned, ...early_compactable] byte-identical to the cached prefix, so provider prompt caches (Anthropic, OpenAI, Google) get maximum reuse on the very next model call.

yaml
agent:
  compaction:
    enabled: true
    threshold: 0.90              # trigger at 90% of context window
    strategy: truncate           # 'truncate' (default) or 'summarize'

Two strategies:

  • truncate (default) — Drops messages from the back of the compactable zone (the transition between old and recent context). Free, instant, and maximally cache-friendly: the message prefix [system, first_user, early_turns...] stays byte-identical across compactions, giving maximum prefix cache hits on all providers.
  • summarize — Calls a model to summarize the entire compactable zone, enriches the summary with programmatically extracted metadata (tools used, key files, pending work), and injects the result into the pinned user message. Costs an extra API call but preserves more context semantically.
yaml
# opt into summarization if you prefer preserving context over cost
agent:
  compaction:
    strategy: summarize
    model: claude-haiku-4-5-20251001  # cheap model for summarization

Enriched summaries

When using the summarize strategy, ra doesn't just forward the LLM's summary verbatim. Before calling the model, it scans the compactable messages and extracts:

  • Tool names — every tool called during the compacted portion
  • File paths — code file references detected in message content (.ts, .py, .rs, .go, etc.)
  • Pending work hints — messages containing keywords like "todo", "next step", "remaining", "pending"

The LLM is prompted to return structured XML tags (<summary>, <pending_work>, <key_files>) alongside its narrative summary. ra then merges the LLM output with the programmatic metadata, deduplicates file paths, and produces a single enriched summary that includes tools used, key files, and pending work sections.

When re-compacting an already-compacted session, the previous summary is preserved as a "Previously compacted context" section, so no context is silently lost across multiple compactions.

To fully control the summarization output, set a custom prompt — this bypasses the metadata formatting and uses the LLM response as-is:

yaml
agent:
  compaction:
    strategy: summarize
    prompt: "Summarize this conversation in bullet points."

Key properties:

  • Cache-friendly — Designed for provider prefix caching (Anthropic, OpenAI, Google). The truncate strategy keeps the message prefix as stable as possible across compactions — only the oldest messages change. The 0.90 threshold maximizes time between compactions.
  • Token-aware — Uses real token counts from the provider when available, falls back to estimation.
  • Pinned zones — System prompts and initial context never get compacted.
  • Tool-call-aware — Boundaries never split an assistant message from its tool results.
  • Provider-portable — Works the same across all providers.
  • Dynamic context window learning — For unknown models (custom fine-tunes, local models, new releases), ra learns the real context window from provider errors. The first time a model hits a context limit, ra parses the actual size from the error message and caches it — all future compaction thresholds use the correct value automatically.

Context window resolution

ra resolves the context window in this order:

  1. Config overridecompaction.contextWindow in your config
  2. Learned from errors — cached from a previous context length error
  3. Model registry — built-in lookup by model name prefix

If none of these match, ra skips proactive compaction and relies on the error-driven path. The first time the model rejects a request for exceeding its context limit, ra parses the real size from the error, caches it, and compacts. From that point on, proactive compaction works correctly.

For best results with unknown models, set the context window explicitly:

yaml
agent:
  compaction:
    contextWindow: 32000   # for a 32k model

Token tracking

ra tracks input and output tokens across every iteration of the loop. Your middleware can read cumulative usage via ctx.loop.usage and enforce budgets, log costs, or trigger compaction early.

ts
// middleware/log-cost.ts
export default async (ctx) => {
  const { inputTokens, outputTokens } = ctx.loop.usage
  console.log(`Tokens used: ${inputTokens} in, ${outputTokens} out`)
}

Prompt caching

For Anthropic, ra automatically applies cache hints to system prompts and tool definitions. This reduces costs on multi-turn sessions without any configuration — cached tokens are billed at a reduced rate.

Extended thinking

Enable extended thinking for models that support it. Five modes control how the model reasons before responding.

ModeBehavior
offDisabled (default)
lowMinimal reasoning budget
mediumModerate reasoning budget
highMaximum reasoning budget
adaptivehigh for the first 10 iterations, then low — balances deep initial reasoning with faster follow-up turns
bash
ra --thinking high "Design a database schema for a social network"
ra --thinking adaptive "Build a REST API"
yaml
agent:
  thinking: adaptive

Optionally cap the thinking budget in tokens. The provider uses min(levelBudget, cap):

yaml
agent:
  thinking: high
  thinkingBudgetCap: 10000   # never exceed 10k thinking tokens

Thinking output streams to the terminal in the REPL, so you can watch the model reason in real time. In the HTTP API, thinking tokens are emitted as {"type":"thinking","delta":"..."} SSE events.

Context discovery

ra discovers and injects project context files into the conversation before your prompt. By default, ra looks for common convention files (CLAUDE.md, AGENTS.md, .cursorrules, .windsurfrules, .github/copilot-instructions.md). Configure which files to look for:

yaml
agent:
  context:
    enabled: true
    patterns:
      - "CLAUDE.md"
      - "AGENTS.md"
      - "CONVENTIONS.md"    # add your own patterns

ra walks the directory tree upward to the git root, finds matching files, and injects them as system context. This is useful for project conventions, coding standards, or any persistent instructions.

Pattern resolution

Reference files and URLs inline in your prompts — ra resolves them before the model sees the message.

bash
ra "explain what @src/auth.ts does"            # file contents injected
ra "review @src/utils/*.ts for consistency"     # glob expansion
ra "summarize url:https://example.com/api-docs" # fetched page content

Two built-in resolvers are enabled by default:

ResolverSyntaxDescription
File@path or @globResolves file contents, supports glob patterns
URLurl:https://...Fetches and inlines page content

Add custom resolvers for GitHub issues, database records, or anything else:

yaml
agent:
  context:
    resolvers:
      - name: issues
        path: ./resolvers/github-issues.ts

Middleware hooks

For full programmatic control over context, use middleware. Every hook receives the full conversation history and can mutate it.

yaml
agent:
  middleware:
    beforeModelCall:
      - "./middleware/enforce-budget.ts"
    afterToolExecution:
      - "./middleware/redact-secrets.ts"
ts
// middleware/enforce-budget.ts — reject if context is too large
export default async (ctx) => {
  const totalChars = ctx.request.messages.reduce((n, m) => n + JSON.stringify(m).length, 0)
  if (totalChars > 500_000) ctx.stop()
}

See also

Released under the MIT License.