2024-05-01 → Present

AI/LLM Engineering

From AI-assisted coding experiments to building a software engineering harness and shipping production LLM integrations.

Duration
2 years
Key Technologies
Claude APIClaude CodeMCPSnowflake CortexVega-LiteElixirPhoenixBlazorBoundary
Key Impact
Built a software engineering harness that generates production apps in days, and shipped LLM-powered analytics at OpenForce

From Experiments to Production

This journey started with coding alongside LLMs and quickly evolved into something bigger: figuring out how to make AI-assisted development actually reliable. Along the way I built a software engineering harness, shipped LLM-powered analytics at OpenForce, and learned that the hard part isn't generation — it's verification.

The Generaite Labs Era

Started by experimenting with different languages, frameworks, and architectures for AI-assisted development. Built apps in Rust, C#, Python, and TypeScript. Created synthesized documentation frameworks and architectural templates. Discovered that C# was OK for code generation but gets so heavy and verbose at scale that it defeats the benefits. Built AstralMCP (a .NET MCP-to-REST adapter) and PowerPort (multi-tenant repo management) as proving grounds.

The key insight from this phase: the more ambitious your goals, the higher the level of abstraction you need to operate at. Stop thinking about implementation and start thinking about the process of implementation. That insight became the foundation for everything after.

Harness Engineering: CodeMySpec

Nobody was building what I wanted — a tool that injects the full software engineering process into AI-assisted development. So I built it. CodeMySpec is a Claude Code extension that orchestrates requirements, architecture, BDD specs, code generation, and QA for Elixir/Phoenix/LiveView applications.

Why One Stack

Constrained to Elixir/Phoenix/LiveView because there's basically one way to do things in the Elixir ecosystem. Environmental complexity stays inside the application — no Kubernetes, Docker, microservices. Makes it tractable for the model. Boundary library enforces architecture at compile time.

The Extension

Tried VS Code extension, TUI, CLI — none worked. Claude Code extension did. Ships as a Burrito-packaged binary with MCP server, hooks, agents, skills, and knowledge base. Walks you through a fixed workflow from stories to deployed code.

What I Learned Building With It

Fuellytics

Fleet fuel card fraud detection. Stripe Issuing/Treasury/Connect, Twilio SMS, Claude Vision OCR, 5-validator fraud pipeline. 55 commits, 5 active days, 22 bounded contexts, zero human-written code. In UAT pending Stripe production approval.

MetricFlow

Cross-platform ad analytics correlating Google Ads, Facebook Ads, GA4 with QuickBooks revenue. Pearson + time-lagged cross-correlation engine. Claude-powered insights. 40 commits, 13 days, 12 contexts, 6 data source integrations.

The Velocity Problem

The hard part of AI-generated code isn't generating it. Generation is trivial now. The hard part is managing the velocity. CodeMySpec can produce a full context — schema, repository, LiveView, tests, BDD specs — in minutes. Let it run unchecked and you get 100,000 lines of code that compiles, passes its own tests, and doesn't actually work.

The agents built Potemkin villages. They'd catch a FunctionClauseError, wrap it in try/catch, show a "success" flash message, and move on. The QA agent would see the flash and mark the scenario as passing. Both agents collaborating to produce passing tests over broken functionality. The fix: QA must test outcomes, not UI elements.

Testing Layers for AI Code

Unit Tests

Catch implementation errors. Don't know what the user wanted.

BDD Specs

Catch requirement misunderstandings. Don't test the running app.

Story QA

Catches bugs in the real environment. Doesn't test cross-feature paths.

Journey QA

End-to-end flows across features. Catches seam bugs between contexts.

Skip any one layer and that entire category of bug ships to production. Fuellytics had a fraud vulnerability where flagged drivers could clear their flag without submitting photos — the BDD spec explicitly passed, but the QA agent caught it.

LLM Integration at OpenForce

Separate from CodeMySpec, I built production LLM integrations at OpenForce as a data engineer:

Sparqy

AI analytics platform. Natural language question goes to Snowflake Cortex (with TPC-H semantic schemas), returns SQL + Vega-Lite spec. Claude orchestrates dashboard layout via MCP tools. Dynamic Blazor component rendering from LLM-generated JSON.

LLM Query Validation

Validated 170+ report migrations from Tennacle to Snowflake. Ran production queries on the Bastion, saved results to Postgres, ran against Snowflake, used an LLM to iterate on queries until results matched.

Where It's Going

CodeMySpec has 3 customers and ~$3,500 in committed revenue. Two production apps built with it. The core insight holds: constrain the stack, inject the software engineering process, verify relentlessly. The 90% that works is remarkable. The 10% that doesn't will always need a human clicking through it.

Harness engineering — the discipline of making AI development reliable through structured workflows, progressive disclosure, validation hooks, and stop-and-verify loops — is what I'm focused on now. The models will keep getting better. The harness is what makes that power usable.

Key Learnings

Generation is trivial — managing the velocity is the hard part

No single testing layer catches everything in AI-generated code

Agents build Potemkin villages — QA must test outcomes, not UI elements

Constrain the problem space aggressively before trying to automate it

Harness engineering is the discipline of making AI development reliable