Blog/Rate Limits and Budgets: Financial Safety Rails for AI Agents

Rate Limits and Budgets: Financial Safety Rails for AI Agents

Keystore Team·March 5, 2026·6 min read

Rate Limits and Budgets: Financial Safety Rails for AI Agents

In February 2026, a three-person startup received a Google Cloud bill for $82,314. Their usual monthly spend was $180. Someone had stolen their Gemini API key and ran up charges --- a 46,000% increase in 48 hours. Google has not forgiven the charges. The startup is still fighting it.

A month earlier, a recursive agent loop ran for 11 days before anyone noticed. The agent kept calling itself, generating API requests around the clock with no human in the loop. Total damage: $47,000. The developer found out when the invoice arrived.

These are not edge cases. They are the predictable result of giving autonomous software unrestricted access to pay-per-use APIs. AI agents make decisions, call tools, retry failures, and spawn sub-tasks --- all of which cost money --- without asking permission for each action. When something goes wrong, it goes wrong at API speed.

The Real Cost Landscape

Understanding why budget controls matter requires understanding what the providers actually charge and how their rate limits work:

OpenAI Tier 1 allows 1,000 requests per minute and 500,000 tokens per minute. GPT-5.2 is priced at $1.75 per million input tokens and $14 per million output tokens. An agent generating long-form content can burn through $50-100/day without trying hard.

Anthropic provides roughly 5x fewer requests than OpenAI at equivalent spend levels. Claude Opus 4.6 is priced at $5 per million input tokens and $25 per million output tokens --- nearly 3x OpenAI's input cost and almost 2x the output cost. An agent that defaults to Claude for every task will spend significantly more than one that routes appropriately.

Google Gemini offers 4 million tokens per minute with no tier system --- the highest immediately-available throughput of any major provider. This is also what makes it dangerous. An agent can consume tokens at an extraordinary rate because there is no graduated throttle slowing it down.

According to industry surveys, 73% of development teams lack real-time cost tracking for autonomous agents. They deploy an agent, it starts making API calls, and the first cost signal they receive is the monthly invoice. By then, the damage is done.

Provider Billing Alerts Are Not Enough

Every major provider offers billing alerts. OpenAI sends an email when you hit a spending threshold. Google Cloud can trigger notifications at budget percentages. These are useful, but they share a fundamental limitation: they notify, they do not enforce.

A billing alert tells you that your agent has spent $500. It does not stop the agent from spending $501. The notification arrives in your inbox (or a Slack channel, or a PagerDuty alert) and waits for a human to read it, assess the situation, and take action. During that response time --- which could be minutes during business hours or hours overnight --- the agent keeps spending.

In the $82,314 Gemini incident, the charges accumulated over 48 hours. Even if billing alerts had been configured at $500 and $1,000, the theft would have continued through every threshold while the team slept, ate, and worked on other things.

Keystore's budget enforcement operates differently. The proxy checks the token's accumulated spend before processing each request. When the budget is exhausted, the proxy returns a 402 and the agent stops. No human in the loop required. No notification delay. The budget is a hard ceiling, not an alert threshold.

The Retry Storm Problem

Beyond theft and runaway loops, there is a subtler failure mode: retry storms during provider outages.

When a provider returns 500 errors, well-intentioned retry logic kicks in. The agent retries with exponential backoff --- except many implementations cap the backoff too low, or the agent framework adds its own retry layer on top, or multiple agents all retry simultaneously. One real-world case saw a 1,700% cost spike from retry logic during a provider outage. The provider was intermittently returning successful responses between failures, so each retry had a chance of succeeding (and incurring cost) while the overall error rate stayed high.

This is where rate limits complement budgets. A rate limit of 60 requests per minute means that even in the worst retry storm, the agent cannot exceed 60 requests in any given minute. Combined with a daily budget, the maximum damage is bounded in both velocity (requests per minute) and total spend (dollars per day).

How Keystore Implements This

Rate Limits

Rate limits are set per agent token, enforced at the proxy:

bash

ks token create \
  --name production-agent \
  --providers openai,anthropic \
  --rate-limit 60/minute

The proxy uses sliding window counters backed by Upstash Redis. When the limit is hit, the proxy returns 429 Too Many Requests with a Retry-After header. The agent (or its framework) can respect the header and wait, rather than hammering the provider.

You can also set per-provider limits within a single token:

typescript

const token = await ks.tokens.create({
  name: "production-agent",
  providers: [
    { name: "openai", rateLimit: { requests: 100, window: "minute" } },
    { name: "anthropic", rateLimit: { requests: 30, window: "minute" } },
  ],
});

This reflects reality: Anthropic allows fewer requests at equivalent spend, so your rate limit should be proportionally lower.

Budgets

Budgets set a dollar ceiling per period:

bash

ks token create \
  --name production-agent \
  --providers openai,anthropic \
  --budget 50 \
  --budget-period daily

The proxy estimates cost per request based on provider pricing (token counts for LLMs, per-request costs for other APIs) and maintains a running total. When the total exceeds the budget, subsequent requests are rejected.

Configure alerts at multiple thresholds to get advance warning:

bash

ks token update production-agent \
  --budget-alert-threshold 50 \
  --budget-alert-threshold 80 \
  --budget-alert-email ops@yourcompany.com

Circuit Breaker Pattern

For teams that want more sophisticated protection, rate limits and budgets work together as a circuit breaker. Consider a contract analysis agent --- the kind that generated $1,410 in charges from 47,000 API calls in 6 hours:

typescript

const token = await ks.tokens.create({
  name: "contract-analyzer",
  providers: ["openai"],
  rateLimit: { requests: 30, window: "minute" },
  budget: { amount: 25, period: "daily" },
});

At 30 requests per minute, even if every request costs $0.05 (a high estimate for GPT-5.2 completions), the maximum hourly spend is $90. But the $25 daily budget means the agent stops after roughly 500 requests regardless of rate. The rate limit prevents bursts; the budget prevents sustained drain.

What This Looks Like in Practice

bash

ks token info production-agent

Name:         production-agent
Budget:       $50.00/daily
Used:         $18.40 (36.8%)
Remaining:    $31.60
Resets:       2026-03-06 00:00 UTC
Rate Limit:   60/minute
Current Rate: 22/minute (36.7%)

Real-time visibility, not end-of-month surprises. The $82,314 Gemini theft, the $47,000 recursive loop, the $1,410 contract analyzer --- each of these would have been stopped by a daily budget configured in a single command.

The average data breach costs $4.88 million (IBM/Ponemon, 2024). A daily budget costs one line of configuration. The math is not complicated.