Vibe Coding: Speed, Slop, and the 80% Problem

“Vibe coding” is the style of development where you iterate quickly with an AI assistant: you describe what you want, the model generates code, you run it and maybe fix a few things, and you ship. It’s fast and feels productive. The downside is “slop”: code that works in the narrow case you tried but is brittle, inconsistent, or wrong in structure. You get to 80% of the way in 20% of the time, but the last 20% (correctness, edge cases, structure) can take 80% of the effort, or never get done.

The 80% problem is that the model is optimised for “what looks right next” not “what is right overall.” So you get duplicate logic, missing error paths, and design drift. Tests help but only for what you think to test. The structural issues, wrong state machine, flag conflicts, dead code, often don’t show up until production or a deep review. Vibe coding is great for prototypes and for learning; it’s risky for production unless you add discipline: review, structural checks, and clear specs.

Speed is real. The model can draft a whole feature in minutes. The trap is treating the draft as done. The fix is to treat vibe coding as a first pass: then refactor, add tests, and check structure. Some teams use the model for implementation and keep specs and architecture human-owned. Others use the model only for boilerplate and keep business logic and control flow hand-written.

Progress in LLMs will make the 80% better, fewer obvious bugs, better adherence to patterns. But the gap between “looks right” and “is right” is fundamental. Design your process so that the last 20% is explicit: who reviews, what gets checked, and what’s the bar for “done.”

Expect more tooling that helps close the gap: structural checks, spec-driven generation, and better integration of tests and review into the vibe-coding loop.

nJoy πŸ˜‰

Flag Conflicts, Stuck States, and Dead Branches: The AI Code Debt Catalog

Flag conflicts happen when two (or more) boolean flags are meant to be mutually exclusive but the code allows both to be true. For example “is_pending” and “is_completed” might both be true after a buggy transition, or “lock_held” and “released” might get out of sync. The program is in an inconsistent state that no single line of code “looks” wrong. Stuck states are states that have no valid transition out: you’re in “processing” but there’s no success, failure, or timeout path. Dead branches are code paths that are unreachable after some change, maybe an earlier condition always takes another branch. All of these are structural defects: they’re about the shape of the state space, not a typo.

AI-generated code tends to introduce these because the model adds code incrementally. It adds a new flag for a new feature and doesn’t check that it’s exclusive with an existing one. It adds a new state and forgets to add the transition out. It adds a branch that’s never taken because another branch is always taken first. Tests that only cover happy paths and a few errors won’t catch them. You need either exhaustive testing (often impractical) or a structural view (states, transitions, flags) that you check explicitly.

A simple catalogue helps when reviewing: (1) For every flag pair that should be exclusive, is there a guard or an invariant? (2) For every state, is there at least one transition out (including error and timeout)? (3) For every branch, is it reachable under some input? You can do this manually or with tooling. The goal is to make the “AI code debt”, these structural issues, visible and then fix them.

Prevention is better than cleanup: if you have a spec (e.g. a state machine or a list of invariants), generate or write code against it and then verify the implementation matches. The model is good at filling in code; it’s bad at maintaining global consistency. So the catalogue is both a review checklist and a design checklist.

Expect more linters and checkers that target flag conflicts, stuck states, and dead branches in generated code.

nJoy πŸ˜‰

Formal Reasoning Meets LLMs: Why Logic Engines Still Matter

LLMs are probabilistic: they score continuations and sample. They don’t have a built-in notion of “therefore” or “for all”, they approximate logical consistency from training data. So they can contradict themselves, miss a case in a case analysis, or add a branch that breaks an invariant. Formal reasoning engines (theorem provers, logic engines, constraint solvers) are the opposite: they deduce from rules and facts, and they can exhaustively enumerate or check. They don’t “guess” the next step; they derive it. So there’s a natural division of labour: the LLM for “how do I implement this?” and the logic engine for “is this structure sound?” or “what’s missing?”

Combining them means the LLM produces a candidate (e.g. a state machine, a patch, or a set of facts), and the logic engine checks it: are all states reachable? Is there a deadlock? Is there a state with no error transition? The engine doesn’t need to understand the domain; it reasons over the shape. That’s why people experiment with LLM + Prolog, LLM + SMT solvers, or LLM + custom rule engines. The LLM does the creative, fuzzy part; the engine does the precise, exhaustive part.

The challenge is translation: getting from code or natural language to a form the engine can reason about. That might be manual (you write the spec) or semi-automated (the LLM proposes a formalization and the engine checks it). Once you have a formal model, the engine can find the unknown unknowns that the LLM cannot see.

We’re not yet at “LLM writes the spec and the engine verifies the code” in one shot. But we’re at “use the LLM to draft, use the engine to check the draft or the structure.” That’s already valuable and will get more so as tooling improves.

Expect more research and products that pair LLMs with formal or logic-based back ends for verification and structural analysis.

nJoy πŸ˜‰

The Unknown Unknown: Structural Bugs That LLMs Cannot Find

Some bugs are “unknown unknowns”: you didn’t know to test for them because they’re structural, not in a single line. A state that has no way out. A branch that’s unreachable after a refactor. Two flags that can both be true. A resource that’s acquired but never released in one path. The code might run fine in the scenarios you thought of; the bug only appears when the right (wrong) combination of state and events happens. Traditional tests often miss these because they’re written for known behaviours and known paths.

LLMs are especially prone to introducing unknown unknowns. They add code that “looks right”, correct syntax, plausible logic, but they don’t have a global view of the system. They don’t know that the new branch they added never connects to the error handler, or that the flag they set is mutually exclusive with another flag used elsewhere. So they generate local correctness and global inconsistency. You only discover it when something breaks in production or when you do a structural review.

Finding unknown unknowns requires a different kind of check: not “does this test pass?” but “is the structure coherent?” That can mean: enumerate states and transitions and check every state has a path out; check that every branch is reachable; check that no two flags can be true together when they shouldn’t; check that every acquire has a release on all paths. Those are queries over the shape of the program, not over one execution.

Tools that do this exist in various forms (static analysis, model checkers, custom oracles). The point is to run them after generation, not to assume the model got the structure right. The model is good at “what to write”; it’s bad at “what’s missing.”

Expect more integration of structural checks into dev and CI, and more patterns for “generate then verify shape.”

nJoy πŸ˜‰

Van der Aalst’s 43 Workflow Patterns and What They Mean for AI-Generated Code

Researchers in workflow and process mining (van der Aalst, Russell et al.) catalogued 43 control-flow patterns: ways that tasks can be sequenced, split, merged, cancelled, and looped. The basics are Sequence (A then B), Parallel Split (A then B and C in parallel), Synchronisation (wait for B and C then D), Exclusive Choice (A then B or C but not both), and Simple Merge (B or C then D). From there you get multi-choice, discriminator, deferred choice, multiple instances, cancellation (task, case, region), structured loops, and more. The full list is a reference for “what kind of flow am I building?”

For AI-generated code the relevance is this: the model often implements one or two of these patterns (e.g. a sequence or a simple branch) but misses the rest. It might add a parallel split and forget the synchronisation. It might add a retry loop but not the cancel path. It might create multiple instances without a join. So the generated code can look right locally but violate the pattern, and that’s when you get stuck states, lost work, or races. Knowing the 43 patterns gives you a checklist: after the model generates code, which pattern is it trying to implement? Is the full pattern there?

You don’t have to implement all 43. Many systems only need a few: sequence, choice, maybe a retry or a timeout. But having the vocabulary helps. When you prompt the model (“add a retry with exponential backoff”), you’re asking for a specific pattern. When you review, you can ask “did we get the full pattern or only part of it?”

Formal workflow languages (BPMN, etc.) encode these patterns explicitly. In code they’re implicit. The gap is where bugs hide. Making the pattern explicit (in a spec or a diagram) and then checking the code against it is one way to keep AI-generated code structurally sound.

Expect more tooling that maps code to these patterns and flags incomplete or inconsistent implementations.

nJoy πŸ˜‰

State Machines as Software DNA: The Hidden Architecture of Every System

Most non-trivial software has an implicit state machine: entities (orders, jobs, sessions) move through stages, and only certain transitions are valid. A payment can be pending, then completed or failed; a job can be queued, running, or done. We don’t always draw the machine, it’s buried in if/else and flags, but it’s there. That hidden structure is the “DNA” of the system: it determines what can happen, what can’t, and what we might forget (e.g. the path from “running” to “cancelled”).

Making the state machine explicit (states, transitions, guards) pays off. You can see dead ends, missing transitions, and inconsistent flags. You can generate tests that cover every transition or every state. You can document and review the behaviour in one place. Many bugs in production come from the code drifting away from the intended machine: a new state was added in one place but not another, or a transition was forgotten in an error path.

State machines don’t have to be fancy. A table (state Γ— event β†’ next state) or a small DSL is enough. The point is to have a single source of truth for “what states exist and what transitions are allowed.” Code then implements that; tests and tools can check the implementation against the spec. When an LLM generates code, it’s implementing (or extending) an implicit machine, if the machine were explicit, you could check the model’s output against it.

In legacy code the machine is often undocumented. You can reverse-engineer it (manual or with tooling) and then maintain it. Going forward, designing the state machine first and then writing or generating code to match it is a way to keep structure stable even when the model is additive.

Expect more tooling that extracts or checks state machines from code, and more patterns for “spec the machine, then implement.”

nJoy πŸ˜‰

The Additive Trap: Why LLMs Build Up but Rarely Clean Up

LLMs are great at adding: new features, new branches, new error messages. They’re bad at removing or simplifying. When you ask for a change, they tend to append code or add another condition rather than delete dead paths or consolidate duplicates. That’s partly training (most edits in the wild are additive) and partly the nature of autoregressive generation: you’re always “continuing” the text, not rewriting it. So the codebase drifts: more branches, more flags, more special cases, and the implicit state machine (what states exist, what transitions are valid) slowly diverges from what you thought you had.

The additive trap shows up in control flow: you add a new state or transition and forget to add the corresponding cleanup, timeout, or error path. Or you add a new “success” path but the old “failure” path now leads nowhere. The model doesn’t reason over the full graph; it fills in the local request. So you get stuck states, unreachable code, or two flags that can both be true when they shouldn’t be. Tests that only cover the happy path won’t catch these, you need a view of the structure.

What would help: tools or disciplines that force a “structure pass.” After the model suggests a change, something checks: are all states covered? Are there new branches with no error handling? Are there conflicting flags? That could be a linter, a custom checker, or a formal spec that you diff against. The key is to treat “shape” as a first-class concern, not just “does it run in one scenario.”

Until we have that, the best mitigation is to use the model for small, localized edits and to do structural review yourself. When you add a state or a branch, explicitly ask: what’s the reverse path? What cleans up? What happens on failure? The model won’t ask for you.

Expect more research and tooling on “structural correctness” of generated code and on ways to make the additive trap visible and fixable.

nJoy πŸ˜‰

Hallucination in Code: Why AI Writes Bugs It Cannot See

When an LLM writes code, it can produce something that looks right and even runs in a narrow test but is wrong in structure: wrong API, wrong assumption about state, or a path that never gets tested. The model doesn’t “see” the full codebase or the spec; it predicts the next token. So it can add a happy path and forget the error path, or introduce two flags that can both be true in a bad combination, or leave a resource open. Those are structural bugs, bugs in the shape of the program, not simple typos. Tests that only cover the happy path won’t catch them.

Why the model writes bugs it cannot see: it has no formal model of the system. It doesn’t know “every state must have an error transition” or “this lock must be released.” It only knows statistical patterns from training code. So it tends to add and rarely to delete or refactor. It fills in the obvious next step and often misses the edge case or the cleanup. That’s the additive trap in code form.

Mitigations: use the model for drafts and then review. Run static analysis, linters, and tests that cover failure paths. In critical areas, keep the model on a short leash: generate small patches, run tests after each, and require human sign-off for structural changes. Some teams use formal specs or state-machine descriptions and then ask the model to implement against them, the spec is the source of truth, the model is the implementer.

Hallucination in code is a special case of “confident and wrong”: the code compiles, maybe even runs once, but the design is broken. The fix is the same as for other hallucinations: don’t trust the output without verification. For code, verification means tests, review, and a clear model of what “correct” means.

Expect more tooling that checks generated code against specs or structural rules, and more patterns for “model proposes, system verifies.”

nJoy πŸ˜‰

Grounding Strategies: RAG, Structured Outputs, and Tool Use

Grounding means tying the model’s output to something external: retrieved documents, tool results, or a strict schema. RAG (retrieval-augmented generation) is the most common: you have a corpus (docs, code, KB), you run a query (user question or embedding), you retrieve the top-k chunks, and you put them in the prompt. The model is then “grounded” in those chunks, it’s supposed to answer from them rather than from memory. It still can hallucinate (e.g. mix chunks or add detail), but the ceiling is lower.

Structured outputs force the model to fill a schema (e.g. JSON with fields like “answer”, “confidence”, “sources”). That doesn’t guarantee truth, but it makes parsing and downstream checks possible. You can require a “sources” array and then validate that each source exists. You can run the answer through a checker (e.g. a query against a DB) before showing it to the user.

Tool use is another form of grounding: instead of the model “remembering” or inventing a fact, it calls a tool (search, API, DB) and you inject the result. The model reasons over the result but doesn’t invent the result itself. So grounding strategies are: (1) put real data in context (RAG), (2) constrain the form of the answer (structured output), (3) get data via tools and let the model interpret it. Often you combine them.

The tradeoff is cost and latency: RAG and tools add retrieval and API calls; structured output can require more tokens or multiple turns. But for any application where correctness matters, grounding is the only reliable path. Unconstrained generation is for draft and exploration; grounding is for production.

Expect more tooling around RAG quality (better retrieval, chunking, and attribution) and tighter integration of tools and structured output in APIs.

nJoy πŸ˜‰

Confident and Wrong: The Anatomy of an LLM Hallucination

A hallucination is often confident: the model states something wrong with no hedging, in the same tone it uses for correct answers. That’s because the surface form (grammar, style, “authoritative” phrasing) is what the model is optimised for; it doesn’t have a separate channel for “I’m unsure.” So you get “The capital of Mars is Olympus City” or a fake study citation that looks real. The anatomy of such an error: the model chose a high-probability continuation that fits the prompt and prior tokens, and that continuation happened to be false.

Confidence and wrongness can combine in dangerous ways. In code, the model might invent an API that doesn’t exist or a parameter that sounds right but isn’t. In medicine or law, a confident wrong answer can be worse than “I don’t know.” The user often can’t tell the difference until they verify, and many users don’t verify. So the harm is in the pairing: wrong + confident.

Some models are being tuned to hedge or say “I’m not sure” when they’re uncertain, but that’s a band-aid: the model still doesn’t have access to ground truth. The better approach is to not rely on the model’s self-assessment. Use retrieval, tools, and human checks for anything that must be correct. Treat confident-sounding output as “draft” until verified.

In UX you can nudge users: “Always verify facts and code.” In system design you can add guardrails: require citations, or run generated code in a sandbox and check the result. The goal is to make the cost of trusting a hallucination visible and to make verification easy.

Expect more work on uncertainty signaling and citation, but the core lesson remains: confidence and correctness are not the same. Design for that.

nJoy πŸ˜‰