Testing AI-Generated Code: Three Patterns That Actually Catch Bugs

You pasted the ChatGPT function into your PR on Tuesday afternoon. The unit tests passed. CI was green. You shipped it. Thursday morning your pricing page started charging customers the wrong amount, and nobody can explain how every test in the file passed.

AI writes code fast. The hard part is testing ai generated code before it breaks production. Most guides on this are either abstract checklists or thinly disguised tool ads. This one isn’t. It’s three patterns, in the order you should apply them, with code you can copy.

By the end you have a decision tree, three runnable examples, and a real bug that passed twelve unit tests anyway. Let’s start with the tests you’re already writing.

Pattern 1: Unit Tests for Logic Errors and Hallucinated APIs

What unit tests catch when AI wrote the code: off-by-one errors, swapped arguments, calls to library methods that don’t exist, edge cases the model didn’t think about. The obvious-in-hindsight, invisible-in-the-diff bugs.

The shape is boring on purpose. Import the function. Feed it boundaries — empty, null, max, negative, unicode, leading whitespace. Assert the output. If the import itself fails because the AI hallucinated a method on a real library, that failure is the test working.

Here’s a function ChatGPT might write for you:

def parse_discount(code: str) -> float:
    return float(code.split("-")[1]) / 100

The test that does the work isn’t the happy path:

def test_parse_discount_boundaries():
    assert parse_discount("SAVE-20") == 0.20

    with pytest.raises(ValueError):
        parse_discount("")
    with pytest.raises(AttributeError):
        parse_discount(None)

    assert parse_discount("save-15") == 0.15      # FAILS
    assert parse_discount(" SAVE-10 ") == 0.10    # FAILS

The last two fail. The AI wrote a function that handles its own examples and breaks on lowercase codes and trailing whitespace — both real shapes you’ll see in a coupon column.

The non-obvious move: write the boundary tests before you read the AI’s code carefully. They surface assumptions the model made silently. Once you’ve read the code, you tend to test what’s there. The point of these tests is to find what isn’t.

What unit tests miss: anything that needs context outside the function. Database state. Other services. The business rule that lives in three other files and never made it into your prompt. That’s where the worst bugs in testing llm generated code actually hide.

Pattern 2: Integration Tests for Silent Failures

Integration tests catch the bugs that pass unit tests on the way to production. Dependency mismatches — the AI used a library signature from two major versions ago. Missing implicit business rules — it computed a discount but didn’t check tax-exempt status because that rule lives in the customer model, not in the prompt. The class of failure TestKube calls “logic drift”: locally correct, globally wrong. This is where most teams fail to verify ai code quality.

The shape: spin up a real dependency (in-memory SQLite is plenty), seed rows that encode the rules the AI didn’t know about, run the function end-to-end, and assert against the side effect — not just the return value.

def test_apply_promo_with_real_order():
    db = sqlite3.connect(":memory:")
    seed_schema(db)
    db.execute("INSERT INTO customers VALUES (1, 'Acme', 1)")  # tax_exempt
    db.execute("INSERT INTO orders VALUES (101, 1, 200.00, 'pending')")

    apply_promo(db, order_id=101, promo_code="SAVE-20")

    rows = db.execute(
        "SELECT type, amount FROM order_lines WHERE order_id=101"
    ).fetchall()

    assert ("discount", -40.00) in rows
    assert ("tax", 0.00) in rows  # exempt — and the AI's version wrote 18.00

The assertion that fails isn’t the discount math. It’s the tax line. The function returns the right total in isolation and writes the wrong row to the database when the customer is tax-exempt.

The non-obvious move: assert against side effects, not return values. AI is unusually good at returning the correct number while writing the wrong row, emitting the wrong event, or logging the wrong audit trail. Return values are what the model was looking at when it wrote the function. Side effects are where reality lives. Any honest code review ai output workflow has to check both. The same principle applies to safe refactoring patterns — any code change risks violating implicit business rules that live outside the function’s scope.

Why this catches what unit tests can’t: unit tests pass when the function does what it says. Integration tests fail when the function does what it says but the saying was wrong. Logic and integration are now covered. But half the AI code shipping in 2026 ends up on a screen — and these tests can’t see a pixel.

Pattern 3: Visual Regression Tests for Rendering Bugs

What visual regression catches: AI-generated UI that compiles, type-checks, passes every logic test, and renders wrong. Wrong spacing token. Deprecated color variable still exported as an alias. Responsive breakpoint that triggers 80px too early. Flex direction reversed because the AI confused two component variants in its training data.

The shape is render-snapshot-diff. Render the component to a known viewport. Snapshot the pixels. Compare to a baseline. Tool doesn’t matter — Playwright, Storybook + Chromatic, plain Puppeteer with pixelmatch.

test("checkout renders the way design says it should", async ({ page }) => {
  await page.setViewportSize({ width: 1280, height: 800 });
  await page.goto("/checkout");
  await expect(page).toHaveScreenshot("checkout.png");
});

Three lines. More bugs caught on a typical AI-generated component than most developers expect, because the bug type AI produces here is uniquely invisible to logic tests. The component renders. It just doesn’t render the way the design system says it should.

The classic example: AI imports tokens.spacing.md. That token was renamed to tokens.space.4 six months ago, and the old name still exists as an alias mapped to a slightly different value. Compiles. Type-checks. Renders four pixels off. Lighthouse is fine. Nobody notices for a week.

What visual tests miss: business logic, application state, accessibility. Pair them with the first two patterns; don’t replace them. If you don’t have a frontend testing strategy yet, this is a fine place to start.

Three patterns, three bug classes. Which one do you reach for first when the PR lands at 4pm Friday?

The Decision Tree: When to Reach for Which

The rule is shorter than you’d expect.

If the AI wrote a pure function — input in, output out, no I/O, no UI — write the boundary unit tests. Five minutes. Catches hallucinated APIs and the logic errors you’d otherwise miss because reading the code makes you trust it.

If the AI wrote anything that touches a database, an external API, or another service, write the integration test with real dependencies. Don’t mock the thing you’re worried the AI got wrong. That’s mocking your concern away.

If the AI wrote a component, a template, or anything that ends up on a screen, add a visual regression on top of the logic tests. Logic tests can’t see pixels.

The trap is defaulting to unit tests for everything because they’re easiest to write. Most production AI bugs aren’t pure-function bugs. They’re integration bugs where the function does exactly what it says and the saying was incomplete.

One-line rule for your ai code verification workflow: test at the layer where the AI is most likely to be wrong, not the layer that’s easiest to mock.

The decision tree sounds clean on a slide. Does it actually hold up against AI code that broke real production?

Real Example: The Pricing Function That Passed Every Unit Test

Here’s the failure that made me write this.

An apply_promo(order, codes) function, ChatGPT-generated, twelve unit tests, all green. Shipped on a Tuesday. Customer support flagged it Thursday: a customer with two valid promo codes had been charged less than half what they owed.

The function accepted a list of codes, looped, applied each. Logically correct. The tests confirmed each path: one code applied, two codes applied, expired code skipped, invalid code rejected. Twelve assertions, all true.

The bug wasn’t in any single test. It was in the rule that lived in three other files and never made it into the prompt: best single promo wins. Promos don’t stack. The AI wasn’t told that, stacked them, and the unit tests faithfully verified the wrong behavior.

The integration test that would have caught it is two assertions long:

def test_promos_dont_stack():
    seed_order_with_two_valid_promos(order_id=42, total=200.00)
    apply_promo(order_id=42, codes=["SAVE-20", "WELCOME-10"])

    assert final_discount(order_id=42) == 40.00  # 20% of 200, the larger
    assert final_discount(order_id=42) != 60.00  # not 20% + 10%

The first assertion encodes the real business rule. The second would have failed loudly on the AI’s version, because the model defaulted to the most common promo pattern in its training data — and the company’s pattern is the less common one.

The lesson: AI code is most dangerous when it’s locally correct and globally wrong. Integration tests live at the boundary where “locally correct” stops being enough.

Briefer one on the UI side: a Copilot-generated card component imported a deprecated spacing token. Compiled. Rendered. Sat four pixels too close to the page edge on mobile. A visual snapshot would have flagged it before merge.

One real bug per pattern is enough. But you still need to know which two to write first when you only have ten minutes.

The 80/20: If You Only Have Time for Two Tests

Most teams won’t write all three patterns for every PR. That’s fine. The goal is catching 80% of AI bugs with 20% of the testing effort.

Tier 1, every time, even at 4pm Friday. One boundary unit test on the new function with empty, null, max, and one weird input the prompt didn’t mention. Five minutes. Catches hallucinated APIs and the obvious logic errors.

Tier 2, whenever the code touches state. One integration test that runs the function against a real dependency and asserts the side effect. This is the test that prevents pricing-function-style failures. It also forces you to articulate the business rule the AI didn’t know — which is half the value.

Tier 3, once per UI component, then let it run forever. A visual snapshot. Cheap to write, cheap to maintain, catches a class of bugs nothing else sees.

Write only Tier 1 and Tier 2 and you’ll catch most of what ships broken. Tier 3 closes the UI gap nothing else can.

Now the question isn’t what to test. It’s whether you’ll actually do this every time.

Make It a Habit, Not a Heroic Effort

Back to Tuesday afternoon. Tests passed. CI was green. The bug wasn’t in the function — it was in everything the AI didn’t know about your codebase, and “tests pass” stopped meaning what it used to.

The three patterns aren’t more work. They’re a redistribution of where the work happens: less debugging at 7am, more thinking at the test layer where bugs are cheap and fixes don’t need a postmortem. Testing ai generated code isn’t optional anymore. It’s made testing the part of the job AI still can’t do for you.

Pick one PR this week. Run the decision tree. Write the two tests. The habit forms in about a month, and the rest takes care of itself.

If you want the wider picture on how AI is reshaping reviews too, the AI code review piece covers what changes when 82% of teams use AI but quality is still slipping.