Skip to content
← archives

What nobody tells you about running LLMs in production

#AI#LLMs#engineering

Everyone has a demo that works. The hard part is what happens after.

I’ve spent the last year building LLM-powered tools at Aultman Health Foundation — a clinical note summarizer, an HL7 message triage system, a few internal Slack bots that didn’t survive contact with real users. Here’s what I wish I’d known before shipping any of them.

Latency is a product decision, not an engineering one

GPT-4o takes 3–8 seconds for a typical completion. That’s fine for an async workflow. It’s catastrophic for anything that feels synchronous to the user. Before you write a single line of code, decide: is this fire-and-forget or does the user wait?

If they wait, you need streaming. If you stream, you need to handle partial JSON gracefully. If you handle partial JSON, you’ve just signed up for a non-trivial amount of state management. Plan for it.

Prompts are code. Treat them like code.

The worst thing you can do is keep your prompts in a string literal buried in a function. When it breaks at 2am (and it will break at 2am), you want:

  1. Version history — what changed between the prompt that worked and the one that’s hallucinating?
  2. A test suite — at least a handful of golden input/output pairs you can regression-check against
  3. A staging environment — never iterate prompts directly against production traffic

I store prompts as .txt files in a prompts/ directory, versioned in git. It sounds boring. It has saved me multiple times.

Token cost compounds faster than you expect

A prompt that costs $0.002 per call sounds free. At 10,000 calls/day it’s $730/month. At 100,000 calls/day, you’ve just hired an expensive contractor who does nothing but call OpenAI.

Build cost tracking in from day one. Log prompt_tokens and completion_tokens per request, aggregate by endpoint, alert when a day’s spend exceeds a threshold. This is not premature optimization — it’s the difference between noticing a runaway prompt loop in hours vs. weeks.

The model is not the product

The hardest lesson: users don’t care which model you use. They care whether the output is correct, consistent, and fast. A well-prompted GPT-3.5 often beats a poorly-prompted GPT-4o. Spend more time on your evaluation harness than on model selection.

The corollary: when the model provider releases a new version, you need a way to benchmark it against your existing prompts before you upgrade. “It’s better on the benchmark” does not mean it’s better on your prompts with your data.


None of this is exotic. It’s just the boring infrastructure work that separates a demo from a system. Build the boring parts first.