Building with AI APIs: What Breaks After npm install

2026-03-14 · Nico Brandt

You picked an AI API. You read the comparison articles, evaluated the benchmarks, chose a provider, ran npm install. Now you’re staring at the SDK docs wondering why the “quick start” example covers exactly none of what you need to ship.

Every guide told you which API to pick. None showed you what happens when building with AI APIs meets production reality — the streaming quirks, the 2 AM rate limit errors, the surprise invoice that makes your CFO schedule a meeting. This is that missing guide.

What the SDKs Actually Feel Like

OpenAI’s SDK is the most mature of the three. Best TypeScript types, largest community, handles auth and retries out of the box. Calling GPT-5 feels like calling any well-built REST client — openai.chat.completions.create(), pass your messages, get a response. You’ll be productive in minutes.

Anthropic’s SDK is cleaner than you’d expect. No role gymnastics for system prompts — they’re a separate parameter, not a message with role: "system" shoved into the array and hoped the model respects. The messages API feels more intentional: fewer configuration options, harder to misconfigure. If you’ve ever set a wrong parameter and gotten a silently degraded response, you’ll appreciate the guardrails.

Google’s Gemini SDK feels newer — rougher error messages, some documentation gaps you’ll notice on day one. But structured output support is surprisingly solid, and context windows up to 1M tokens change what’s architecturally possible. The free tier is generous enough to prototype without entering a credit card. That matters more than most comparisons acknowledge.

Honest take: all three work. I’ve shipped production features with each. The DX differences are real — Anthropic’s message format is cleaner, OpenAI’s ecosystem is deeper, Gemini’s pricing is friendlier — but none are dealbreakers. Pick based on your use case, not which SDK has prettier method names.

“Works for basic calls” and “works in production” are different conversations entirely. The first thing that’ll remind you? Streaming.

Streaming That Doesn’t Break in Production

Without streaming, your users stare at a blank screen for 3–8 seconds while the model generates a response. GPT-5.2 outputs tokens at around 187 tok/s. Claude Opus runs closer to 50 tok/s. Either way, the full response takes seconds — and without streaming, those seconds feel like minutes. Every serious AI integration streams.

The implementations differ more than you’d expect. OpenAI returns an async iterator — set stream: true and each chunk carries a choices[0].delta object with token-by-token text. Clean and predictable. Anthropic uses server-sent events with typed event names like content_block_delta and message_stop. More structured than OpenAI’s approach — you always know what each event type means and can handle them distinctly. Gemini streams via generateContentStream(), returning partial response objects rather than token deltas. Different mental model entirely.

The pattern that saves you: abstract all three behind a common async generator. A short wrapper function that yields text chunks regardless of provider. Your UI code never knows or cares which API it’s talking to. If you’ve worked with TypeScript interfaces before, this is the same principle — program against a contract, not an implementation.

Here’s the gotcha nobody mentions in tutorials. Token usage reporting arrives differently from each provider. Anthropic bundles input token counts into the final message_stop event. OpenAI tucks usage into the last chunk. Gemini reports per-chunk. If you’re tracking costs in real time — and you should be — this inconsistency will bite you exactly once before you build the normalization layer.

Streaming gets your UI feeling responsive. Then your API returns a 429 at 2 AM and your feature silently dies. That’s where the real production work begins.

The Error Handling Nobody Warns You About

Rate limits aren’t theoretical. They will hit you, probably during a demo. OpenAI and Anthropic both use tiered RPM and TPM limits — your free-tier key has very different ceilings than your production key. Gemini offers the most generous free tier but imposes stricter paid-tier limits than you’d guess from the marketing page.

The retry pattern that works: exponential backoff with jitter. All three SDKs have built-in retry logic, but defaults vary and often aren’t aggressive enough. Configure max retries explicitly. Understand what gets retried — 429s yes, 400s no. A malformed prompt retried five times is still malformed. Now it’s five times more expensive.

Set explicit timeouts. Default SDK timeouts are far too generous for user-facing features. A 60-second timeout on a chat response means your user already left, opened a competitor’s app, and forgot about you. For interactive features, 15–20 seconds is the hard ceiling. For background processing, be generous — but always set something.

Token counting will surprise you. You’re billed for input plus output tokens, but you can only estimate input tokens before the call. OpenAI’s tiktoken library handles client-side counting. Anthropic’s counting is server-side only — an extra API round trip. Gemini’s countTokens endpoint adds similar latency. Factor this into your cost-tracking architecture early, not after the first invoice that raises eyebrows.

Model deprecation is faster than you think. Claude 3.x models were deprecated between October 2025 and January 2026 — roughly three months from announcement to shutdown. If your model string is hardcoded instead of pulled from config, you’re one deprecation notice away from a production outage. Treat model identifiers like you’d treat any external dependency — version them, make them configurable, have a migration plan before you need one.

And the classic cautionary tale: one misplaced loop sending GPT-5 requests without max_tokens set. Someone’s $2,400 lesson. Always set max_tokens. Always.

One provider’s quirks are manageable. But what happens when that one provider goes down entirely?

Why Production Teams Route Between Providers

The single-provider trap is straightforward: one outage, one rate limit spike, one pricing change — and your AI feature is offline. Every production team I’ve worked with learned this. The solution is multi-provider routing, and it’s simpler than it sounds.

Define a common interface. Implement per-provider adapters. Add a router that selects based on task type, cost, or availability. That’s the entire architecture. Fifteen minutes of abstraction saves you from vendor lock-in for the life of the project.

The cost savings make the architecture pay for itself immediately. Route 70% of simple tasks to cheap models — GPT-4o-mini and Gemini Flash both run at $0.15 per million input tokens. Route 20% to mid-tier models for tasks that need more capability. Reserve the remaining 10% for frontier models where quality genuinely matters. That 70/20/10 split can cut your AI spend by 70–90% compared to sending everything through a flagship model.

Two features changed the cost game in 2026. Prompt caching — now standard across all three providers — reduces costs up to 90% on repeated system prompts and few-shot examples. If you’re sending the same 2,000-token system prompt with every request, enabling caching is the single highest-ROI optimization available today. Batch processing gives a flat 50% discount across all three providers for non-real-time workloads. Analytics, content generation, bulk classification — if it doesn’t need to be instant, batch it.

The production architecture ends up clean. A thin router, provider adapters behind a shared interface, a fallback chain. When Provider A returns a 503, traffic shifts to Provider B automatically. Your users never notice. If you’ve built resilient services before, the pattern is familiar: isolate dependencies, plan for failure, keep the blast radius small.

You have the patterns now. Time to ship.

Your Move

You started with npm install and a blank editor. Now you have streaming wrappers, retry strategies, cost tracking, and a multi-provider architecture that won’t collapse when one API has a bad day.

The honest recommendation: start with one provider. Pick based on your use case — OpenAI for the broadest ecosystem and fastest iteration, Claude for code-heavy work and the cleanest message API, Gemini for budget-conscious apps and million-token context windows. But write your integration behind an interface from day one. Adding a second provider later should be a one-file change, not a rewrite.

Building with AI APIs in 2026 isn’t about picking the “best” provider. It’s about building abstractions that survive the next model release, the next deprecation notice, and the next 2 AM incident. The patterns in this article work regardless of which provider you start with. That’s the point.

Go ship something. Your node_modules folder has been waiting long enough.