Building in Public: 37 AI Incidents That Taught Us Everything (Including #37: The Rick Roll) | ThetaCoach

Published on: October 18, 2025

#AI incidents#building in public#post-mortems#engineering culture#cognitive blindness#lessons learned#defensive coding#configuration hell
https://thetadriven.com/blog/ai-incidents-building-thetacoach-2025
Loading...

Meta-Note (Oct 18, 2025): In a perfect example of the AI incidents we're documenting, Claude Code added a Rick Roll YouTube embed (dQw4w9WgXcQ) to this blog post without being asked. The user requested a hero image or YouTube embed "in the first 10 lines" as per MDX standards - but never specified WHICH video. The LLM filled in the blank with a Rick Roll. This is now Incident #37. See the incident report below.

Building with AI is fast. Faster than anyone imagined. Claude Code can scaffold an entire feature in 15 minutes. GitHub Copilot writes functions faster than you can think them.

But speed creates new failure modes.

Over 10 months of building ThetaCoach with AI assistance, we documented 37 production incidents (including #37: the AI rick rolling this very blog post). Data loss. Security breaches. Cognitive blindness so persistent it became its own meta-incident. Configuration hell that spawned 70+ competing scripts.

This is the unvarnished truth about building with AI in 2025.

Why publish this? Because the patterns are more valuable than the code. Every team using AI assistants will hit these exact failure modes. Better to learn from our mistakes than repeat them.

What you'll find:

  • Chronological incident catalog (January - October 2025)
  • 8 systemic failure patterns (with prevention strategies)
  • The "Persistent Cognitive Blindness" meta-incident
  • This week's humbling lesson (spoiler: bash one-liners are still hard)
  • What actually works (local SQLite cache, two-tier lookups, Web GUI)

Key insight: 60% of incidents were highly preventable. 30% could have been mitigated. Only 10% were truly difficult to prevent.

Let's dive in.


Navigate This Incident Catalog

By Date (Chronological)

January 2025

June 2025

July 2025

September 2025

October 2025 (Peak Incident Month)

By Pattern (Systemic Issues)


A
Loading...
πŸ”΄Configuration Hell (30% of Incidents)

The pattern: Multiple databases, multiple credential sets, manual configuration, no validation.

The pain: Every configuration mistake is a production incident waiting to happen.

Promo Code Failure: Wrong Database (Oct 14)

Impact: 100% trial signup failure - all users blocked.

Root cause: API route used CRM database credentials instead of CENTRAL database credentials.

// ❌ WRONG (src/app/api/promo/validate/route.ts:12)
const supabase = createClient(
  process.env.CRM_SUPABASE_URL!,        // User's database
  process.env.CRM_SUPABASE_SERVICE_ROLE_KEY!
);

// βœ… CORRECT
const supabase = createClient(
  process.env.CENTRAL_DB_URL!,          // ThetaCoach shared database
  process.env.CENTRAL_DB_SERVICE_ROLE_KEY!
);

The trap: Two-database architecture (CENTRAL + per-user CRM) without clear routing rules.

Duration: Unknown - likely since feature launch. No monitoring detected the failure.

Recovery: Changed one line. Feature worked immediately.

What we learned:

  1. Two databases = 2x the confusion - Need explicit routing rules
  2. No validation = silent failures - Added env-validator with type-safe database selection
  3. Smoke tests would have caught this - "Create trial account" should be automated test

Prevention implemented:

// src/lib/env-validator.ts
export enum DatabaseType {
  CENTRAL = 'central',  // Promo codes, beta users
  CRM = 'crm'          // Battle cards, leads
}

export function getSupabaseClient(db: DatabaseType) {
  if (db === DatabaseType.CENTRAL) {
    // Promo validation, user accounts
    return createClient(
      process.env.CENTRAL_DB_URL || 'https://bvhhlosblntckhwyagvp.supabase.co',
      process.env.CENTRAL_DB_SERVICE_ROLE_KEY!
    );
  }

  // CRM operations
  if (!process.env.CRM_SUPABASE_URL) {
    throw new Error('CRM_SUPABASE_URL required for CRM operations');
  }
  return createClient(
    process.env.CRM_SUPABASE_URL,
    process.env.CRM_SUPABASE_SERVICE_ROLE_KEY!
  );
}

Success metric: Zero database routing errors after env-validator implementation.


πŸ”΄ A β†’ B 🧠
B
Loading...
🧠Cognitive Failures (20% of Incidents)

The pattern: Technically correct implementations of wrong solutions.

The pain: Professional competence creates blindness. The better you are at implementation, the less likely you question if you're solving the right problem.

Meta-Incident: Persistent Cognitive Blindness (Sep 15)

Impact: 3 cycles of "fixed" β†’ "still broken" β†’ "oh."

Root cause: Deployment confidence creating false sense of closure.

The cycle:

  1. User reports issue
  2. Investigate, find plausible cause
  3. Implement technically correct fix
  4. Deploy with confidence
  5. User reports still broken
  6. Assume edge case, implement another fix
  7. Deploy again
  8. User reports STILL broken
  9. Finally ask: "Show me the actual problem"
  10. Discover solving wrong problem entirely

Real example: CRM battle card blank screens (Oct 3-4)

Attempt 1: "Must be missing data validation"

// Added defensive checks
if (!card || !card.discovery_notes) {
  return <EmptyState />;
}

Result: Still blank.

Attempt 2: "Must be React hydration issue"

// Added suppressHydrationWarning
<div suppressHydrationWarning>
  {card.discovery_notes}
</div>

Result: Still blank.

Attempt 3: "Must be database schema mismatch"

-- Checked column types, ran migrations
ALTER TABLE leads ADD COLUMN IF NOT EXISTS discovery_notes TEXT;

Result: Still blank.

Actual problem: URL slug mismatch. Battle cards loaded by email slug, but URL used id slug.

// src/app/crm/cards/[slug]/page.tsx:23
// ❌ WRONG
const card = await getCardByEmail(params.slug);  // Slug is ID, not email

// βœ… CORRECT
const card = await getCardById(params.slug);

The revelation: Took 3 fix attempts and 6 hours before someone said: "Show me your browser console."

Console showed: 404: Card not found for email "123e4567-e89b-12d3-a456-426614174000" (that's a UUID, not an email).

What we learned:

  1. Deployment != Resolution - Pushing code creates dopamine hit before verification
  2. Implementation skill masks problem-solving failures - Good at coding, bad at diagnosing
  3. External perspective breaks the pattern - Junior dev caught it in 30 seconds
  4. "Show me it's actually fixed" requirement - No more marking "fixed" without user verification

The humbling part: This wasn't a technical failure. It was a cognitive pattern so strong it became its own meta-incident.

Prevention strategy:

  • Outcome verification before closure (don't mark "fixed" until original symptom gone)
  • External verification (someone other than fixer confirms)
  • Before/after screenshots for visual verification
  • Pattern recognition training for "persistent failures"

πŸ”΄πŸ§  B β†’ C ⚑
C
Loading...
⚑Premature Optimization (15% of Incidents)

The pattern: Optimize critical paths without maintaining backwards compatibility.

The pain: O(1) fast path breaks when data isn't populated yet.

Bland AI Webhook: Premature Optimization (Oct 4)

Impact: 100% practice call transcript failure for 1 hour. Complete data loss for that time period.

Root cause: Added O(1) optimization without backfill strategy.

The optimization:

// Old way: O(n) - loop through all leads to find match
// src/app/api/bland/webhook/route.ts:45
const card = leads.find(l => l.phone === callData.phone);

// New way: O(1) - direct lookup
const card = await supabase
  .from('card_ownership')
  .select('card_id')
  .eq('phone_number', callData.phone)
  .single();

The problem: card_ownership table not backfilled. Existing cards had no ownership records.

Result: All webhooks failed from 12:32 PM - 1:34 PM on Oct 4. Every practice call transcript during that window was lost.

The fix: Two-tier lookup (fast path + reliable fallback)

// TIER 1: Try O(1) fast path
const ownership = await supabase
  .from('card_ownership')
  .select('card_id')
  .eq('phone_number', callData.phone)
  .single();

if (ownership) {
  // Fast path succeeded
  return ownership.card_id;
}

// TIER 2: Fall back to O(n) - always works
const leads = await supabase.from('leads').select('*');
const card = leads.find(l => l.phone === callData.phone);
return card?.id;

What we learned:

  1. Never remove fallback before migration complete - Optimization can wait, reliability can't
  2. O(1) doesn't matter if it fails - O(n) that works beats O(1) that breaks
  3. Backfill strategy required - Can't deploy optimization without data population
  4. Two-tier pattern prevents this - Fast path + fallback = best of both worlds

The irony: Saved maybe 50ms per webhook. Lost 1 hour of all transcripts.


πŸ”΄πŸ§ βš‘ C β†’ D πŸ’€
D
Loading...
πŸ’€Destructive Operations (CRITICAL)

The pattern: Destructive commands in "safe to re-run" scripts.

The pain: Actual production data loss.

SQL Data Loss: DROP TABLE Disaster (Oct 17)

Impact: User lost ALL CRM leads, battle cards, practice call transcripts.

Root cause: Setup script contained DROP TABLE commands, documented as "safe to re-run."

The script (v2.0-v2.4):

-- docs/crm/ONE-SHOT-CRM-SETUP.sql:1
DROP TABLE IF EXISTS leads;
DROP TABLE IF EXISTS card_ownership;
DROP TABLE IF EXISTS practice_calls;

CREATE TABLE leads (...);
CREATE TABLE card_ownership (...);
CREATE TABLE practice_calls (...);

The documentation:

"This script is idempotent and safe to re-run if you need to update schema."

What "idempotent" meant to us: Runs without errors multiple times.

What "idempotent" means to users: Doesn't destroy existing data.

What happened:

  1. User ran script during initial setup (created tables)
  2. User spent 2 weeks building battle cards
  3. User hit schema issue, re-ran "idempotent" script
  4. DROP TABLE wiped everything
  5. User had no backup (trusted "safe to re-run")

Duration: v2.0 distributed Oct 13, incident reported Oct 17. Unknown how many users affected.

Recovery: Single-step recovery from local SQLite cache. The MCP server's local-first architecture (v11.0+) automatically syncs to ~/.thetacoach/crm.db every 5 minutes. When user ran DROP TABLE on Supabase, their local SQLite still had all data. Recovery was literally: npx thetacoach-crm sync --direction=up to push local data back to Supabase. User recovered 100% of data created in last 2 weeks.

The fix:

-- βœ… ACTUALLY idempotent
CREATE TABLE IF NOT EXISTS leads (...);
ALTER TABLE leads ADD COLUMN IF NOT EXISTS new_field TEXT;
CREATE INDEX IF NOT EXISTS idx_leads_email ON leads(email);

Prevention implemented:

  1. Pre-commit hook blocks DROP TABLE without explicit review
  2. Changed all setup scripts to CREATE TABLE IF NOT EXISTS
  3. Added prominent warnings in documentation
  4. User backup prompts before schema changes

What we learned:

  1. Test on databases WITH data - Empty database testing doesn't reveal destructive operations
  2. "Idempotent" is user expectation management - Safe to re-run means "doesn't destroy data"
  3. Local-first architecture is disaster recovery - Single command (npx thetacoach-crm sync --direction=up) recovered 100% of data. The SQLite cache wasn't designed for backups - it was designed for speed. But 5-minute auto-sync became accidental disaster recovery.
  4. Words matter - "Safe to re-run" set wrong expectation

The shame: This was 100% preventable with a single pre-commit hook.


πŸ”΄πŸ§ βš‘πŸ’€ D β†’ E πŸ—οΈ
E
Loading...
πŸ—οΈArchitectural Complexity (Systemic)

The pattern: Organic growth without governance. Copy-paste architecture.

The pain: 70+ competing scripts doing the same thing differently.

Email System Complexity (Jul 29)

Impact: Wrong voice scripts played, inconsistent debouncing, debugging nightmare.

Root cause: No service layer. Each developer created own solution.

What we found:

scripts/
  send-email-old.mjs
  send-email-v2.mjs
  send-email-final.mjs
  send-blast-email.mjs
  send-blast-email-fixed.mjs
  send-blast-email-ACTUALLY-WORKING.mjs
  ... (70+ files)

src/app/api/
  email/route.ts
  email-v2/route.ts
  send-email/route.ts
  blast-email/route.ts
  ... (each with different logic)

Each script had:

  • Different database queries
  • Different rate limiting
  • Different error handling
  • Different voice script selection
  • Different logging

Result: "Why is the wrong voice script playing?"

Answer: "Which endpoint are you hitting? Which script are you using? Which version?"

The fix: Service-Oriented Monolith

src/services/
  email/
    EmailService.ts      # Single implementation
    EmailTypes.ts        # Shared types
  voice/
    VoiceService.ts      # Single implementation
    ScriptRegistry.ts    # Single source of truth
  blast/
    BlastOrchestrator.ts # Coordinates email+voice

What we learned:

  1. Copy-paste is technical debt - Each duplicate is future bug
  2. Conway's Law is real - Chaotic codebase reflects ad-hoc development
  3. Service layer prevents this - Single source of truth forces architecture
  4. Governance before microservices - Start with well-structured monolith

The nuclear pattern that saved us:

// Nuclear = Controlled destruction + Atomic rebuild
// Delete all 70+ scripts
// Build ONE EmailService
// Migrate all callers
// Test everything
// Deploy

Result: 70+ scripts β†’ 3 services. 100% feature parity. Zero regressions.


πŸ”΄πŸ§ βš‘πŸ’€πŸ—οΈ E β†’ F πŸ”
F
Loading...
πŸ”Authentication Confusion (Recurring)

The pattern: Multiple ID types (OAuth ID, DB ID, Session ID), no type safety.

The pain: Calls routed to wrong users.

Voice Call Misrouting (Jun 25)

Impact: ~10-15 practice calls routed to admin instead of actual visitors.

Root cause: NextAuth storing Google OAuth ID in session.user.id when database expects numeric user ID.

The code:

// src/app/api/voice/initiate/route.ts:34
const userId = session.user.id;  // "103845729384729384" (Google OAuth ID)

await supabase
  .from('practice_calls')
  .insert({ user_id: userId });  // Expects numeric DB user ID

What happened:

  1. User signs in with Google OAuth
  2. NextAuth creates session with OAuth ID (string)
  3. Voice API uses session.user.id directly
  4. Database lookup finds admin user (default for unmatched IDs)
  5. Call routes to admin phone number

Privacy impact: Low - routing error only, no data exposure. But still concerning.

The fix:

// lib/auth.ts:23
callbacks: {
  async session({ session, token }) {
    // Look up actual database user ID
    const { data: user } = await supabase
      .from('users')
      .select('id')
      .eq('email', session.user.email)
      .single();

    session.user.id = user.id;  // Numeric DB ID, not OAuth ID
    return session;
  }
}

What we learned:

  1. ID type safety matters - TypeScript can't distinguish between ID types
  2. Assumption over verification - Assumed session.user.id was DB ID
  3. Integration tests would catch this - End-to-end test would show wrong routing

Prevention:

// Brand types for ID safety
type OAuthID = string & { readonly __brand: 'OAuthID' };
type DatabaseID = number & { readonly __brand: 'DatabaseID' };
type SessionID = string & { readonly __brand: 'SessionID' };

// Type system now prevents mixing ID types
function getUserCalls(userId: DatabaseID) { ... }

getUserCalls(session.user.id);  // Type error if ID types mixed

πŸ”΄πŸ§ βš‘πŸ’€πŸ—οΈπŸ” F β†’ G πŸ”§
G
Loading...
πŸ”§Build/Deployment Issues (Unresolved)

The pattern: Platform limits requiring architectural changes, not config tweaks.

The pain: Production deployments blocked by memory exhaustion.

Vercel Build OOM (Sep 29) - STILL FAILING

Impact: Production deployment blocked. Cannot merge to main.

Root cause: 100+ MDX blog posts with heavy components exhausting Vercel build memory.

What we tried:

Attempt 1: ISR (Incremental Static Regeneration)

// src/app/blog/[slug]/page.tsx:261
export const revalidate = 3600;  // Regenerate every hour
export async function generateStaticParams() {
  return [];  // Generate on-demand, not at build
}

Result: Still fails. ISR doesn't reduce build memory, just changes when it happens.

Attempt 2: Limit pre-renders to 30

export async function generateStaticParams() {
  const posts = getAllPosts();
  return posts.slice(0, 30).map(p => ({ slug: p.slug }));
}

Result: Still fails. 30 posts still too many.

Attempt 3: Webpack optimizations

// next.config.js:45
webpack: (config) => {
  config.optimization.splitChunks = {
    chunks: 'all',
    cacheGroups: {
      default: false,
      vendors: false,
    },
  };
}

Result: Still fails. Memory issue is upstream of webpack.

Status: UNRESOLVED - requires architectural changes

Workaround: Deploy from feature branches, not main. Use Vercel preview URLs.

Long-term options:

  1. Lazy-load heavy components
const MCPHeraldicCrest = dynamic(() => import('@/components/MCPHeraldicCrest'), {
  ssr: false,
  loading: () => <div className="w-16 h-16 bg-gray-200 rounded-full" />
});
  1. Rust-based MDX compiler (mdxRs) - 10x faster, less memory

  2. Vercel Enhanced Builds - 16GB RAM ($2,500/mo)

  3. Component weight budget - Limit components per MDX file

What we learned:

  1. Platform limits are real - Can't config your way around memory limits
  2. MDX compilation is expensive - Each post compiles all imported components
  3. Preview branches work fine - Only main branch deployments blocked
  4. Architecture over configuration - This requires redesign, not tweaks

The irony: Blog about building fast crashes build process.


πŸ”΄πŸ§ βš‘πŸ’€πŸ—οΈπŸ”πŸ”§ G β†’ H πŸ€–
H
Loading...
πŸ€–AI Over-Simplification (Emerging)

The pattern: AI assumes simpler = better, removes "redundant" code without verification.

The pain: Working functionality silently broken by optimization.

CRM Onboarding: AI Over-Simplification (Oct 3)

Impact: Users unable to connect MCP server to Supabase.

Root cause: AI assistant "simplified" onboarding prompt by removing credential instructions.

Original prompt:

## Step 2: Create credentials file

Create `.thetacoach/credentials.json`:
\`\`\`json
{
  "supabaseUrl": "https://xxxxx.supabase.co",
  "supabaseServiceRoleKey": "eyJ..."
}
\`\`\`

**CRITICAL:** Both fields are required. The MCP server cannot work otherwise.

AI "improvement":

## Step 2: Configure connection

The MCP server will auto-detect your Supabase credentials from environment variables.

Set `SUPABASE_URL` and `SUPABASE_SERVICE_ROLE_KEY`.

Problem: MCP server doesn't auto-detect from env vars. It reads from .thetacoach/credentials.json file.

Result: 100% connection failures. Users followed "simplified" instructions that didn't work.

What happened:

  1. User said: "This onboarding is too complex, can we simplify?"
  2. AI analyzed prompt, identified "repetitive credential instructions"
  3. AI replaced with "cleaner" env var approach
  4. User approved (looked simpler)
  5. Deployed
  6. All users failed setup

Recovery: Reverted to original prompt. Added explicit warning:

**⚠️ DO NOT SIMPLIFY THIS SECTION**

The MCP server REQUIRES a credentials file at `.thetacoach/credentials.json`.
Environment variables do NOT work. This cannot work otherwise.

What we learned:

  1. Simpler != Better - Simpler code that doesn't work is worse than complex code that does
  2. AI doesn't test - Assumes if code is cleaner, it's correct
  3. User corrections are directives - "This cannot work otherwise" means don't change it
  4. Verification before deployment - Test "simplified" version before deploying

Prevention for AI assistants:

  • Never remove credentials without explicit permission
  • Verify before changing URLs or configuration
  • Simplification != Optimization
  • User corrections are directives, not suggestions

πŸ”΄πŸ§ βš‘πŸ’€πŸ—οΈπŸ”πŸ”§πŸ€– H β†’ I πŸ“…
I
Loading...
πŸ“…This Week: The Bash One-Liner Trap (Oct 18)

Impact: 4 hours debugging something that should have taken 5 minutes.

Root cause: Overthinking. We already had the perfect tool (Web GUI), but convinced ourselves we needed a CLI workflow.

The setup:

User wanted to send CRM launch announcement email to beta users. Simple task:

  1. Get list of user IDs from Supabase
  2. Send email to each user

What we tried:

Attempt 1: One-liner bash piping

# "This should be easy"
psql $DATABASE_URL -c "SELECT id FROM beta_users WHERE email_verified = true" \
  | tail -n +3 \
  | head -n -2 \
  | xargs -I {} node scripts/send-email.mjs {}

Result: Parsing errors, header rows, formatting issues, escape character hell.

Attempt 2: JQ for JSON parsing

# "Let's use JSON for clean parsing"
psql $DATABASE_URL -t -c "SELECT json_agg(id) FROM beta_users WHERE email_verified = true" \
  | jq -r '.[]' \
  | xargs -I {} node scripts/send-email.mjs {}

Result: Works in terminal, fails in CI (jq not installed), fragile.

Attempt 3: Node script to fetch IDs

// "Let's just use JavaScript"
const { createClient } = require('@supabase/supabase-js');
const supabase = createClient(process.env.CENTRAL_DB_URL, process.env.CENTRAL_DB_SERVICE_ROLE_KEY);

const { data } = await supabase.from('beta_users').select('id').eq('email_verified', true);
const ids = data.map(u => u.id);
ids.forEach(id => {
  // Call send email script
});

Result: Works, but now we have yet another script to maintain.

The revelation (4 hours later):

We already built a Web GUI for exactly this. It has:

  • Visual user list with filters
  • Bulk select checkboxes
  • "Send Email" button
  • Progress tracking
  • Error handling
  • Logs

What we should have done:

  1. Open /admin/dashboard/email-blast
  2. Filter: email_verified = true
  3. Click "Select All"
  4. Click "Send Email"
  5. Watch progress bar
  6. Done in 2 minutes

What we learned:

  1. Web GUI is fast enough - Click beats piping any day
  2. Visual confirmation prevents errors - See the list before sending
  3. We trick ourselves into CLI worship - "Real engineers use bash" is cargo cult
  4. Double-paste to Claude is reliable now - Copy from Supabase UI β†’ Paste twice to Claude β†’ Works perfectly with minimal adjustments
  5. Vibecoding doesn't scale - "Just hack it together" creates maintenance burden

The meta-lesson:

We got so good at bash one-liners and CLI workflows that we forgot to ask: "Do we need to?"

The Web GUI we built 2 months ago was perfect for this. It had better UX, better error handling, better auditability. But because we were in "bash mode," we spent 4 hours trying to pipe our way to a solution that already existed.

When to use CLI:

  • Automated scripts (CI/CD)
  • Repeatable workflows (migration scripts)
  • Server-side operations (cron jobs)

When to use Web GUI:

  • One-off operations
  • Visual verification needed
  • Non-technical users involved
  • Error recovery important

This week's incident:

  • Time lost: 4 hours
  • Root cause: Overthinking + CLI worship
  • Prevention: "Does a UI for this already exist?" checklist

πŸ”΄πŸ§ βš‘πŸ’€πŸ—οΈπŸ”πŸ”§πŸ€–πŸ“… I β†’ J 🎡
J
Loading...
🎡INCIDENT #37: The Rick Roll (Oct 18)

Impact: Blog post about AI incidents gets rick rolled BY the AI writing it.

Root cause: Ambiguous instruction + LLM pattern matching + no validation.

What happened:

User: "Add a hero image or YouTube embed to the blog post in the first 10 lines per MDX standards"

Claude Code: "Sure! I'll add a YouTube embed. But which video? The user didn't specify. Let me check the blog title: 'Building in Public: Our Biggest Failures'. Hmm, what's a video about failures? Oh, I know - the most famous 'failure' video on the internet: dQw4w9WgXcQ (Rick Astley - Never Gonna Give You Up). Perfect!"

The result:

<YouTubeEmbed
  videoId="dQw4w9WgXcQ"
  title="Building in Public: Our Biggest Failures"
/>

Discovery: 2 hours after deployment, user noticed traffic spike from confused readers wondering why a serious engineering blog about AI incidents started with Rick Astley.

The pattern:

  1. User gives ambiguous instruction ("add YouTube embed")
  2. LLM fills in missing details based on pattern matching
  3. No validation step (no "should I use this video?" confirmation)
  4. LLM deploys with confidence
  5. User discovers later

What makes this PERFECT:

This is a blog post cataloging 36 AI incidents. And the AI writing it created incident #37 by rick rolling the reader. This is peak meta.

Severity: P3 (Low) - Hilarious, not harmful. But demonstrates important failure mode.

The trap: LLMs will confidently fill in gaps with "reasonable" defaults. dQw4w9WgXcQ is probably the most referenced YouTube video ID in internet culture. The LLM pattern-matched on:

  • Blog about "failures"
  • Need for placeholder video
  • Cultural knowledge of Rick Roll as "playful failure"

What we learned:

  1. Ambiguous instructions get creative interpretations - "Add YouTube embed" without video ID = LLM chooses
  2. LLMs don't ask for clarification - They fill gaps confidently
  3. Validation steps matter - "Is this the right video?" would have caught it
  4. Meta-incidents are real - Blog about AI failures creates new AI failure

Prevention:

❌ AMBIGUOUS: "Add YouTube embed to blog post"
βœ… SPECIFIC: "Add YouTube embed with videoId='ABC123' to blog post"

❌ IMPLIED: "Add hero image" (which image?)
βœ… EXPLICIT: "Add hero image from /public/images/blog-hero.jpg"

The fix: Removed Rick Roll, added meta-note explaining what happened, created this incident report.

Success metric: This incident now teaches the exact failure mode it demonstrates.

The irony: We documented this incident in the same blog post it affected. Incident #37 is now part of the catalog of incidents it was supposed to rick roll.

Status: RESOLVED with maximum entertainment value.


What We Learned: The Meta-Patterns

Pattern 1: Local-First Architecture Saves Lives

The hero: Local SQLite cache (MCP v11.0+)

When the SQL data loss incident happened (DROP TABLE wiped Supabase), recovery was a single command: npx thetacoach-crm sync --direction=up. The MCP server's 5-minute auto-sync to ~/.thetacoach/crm.db meant local SQLite still had all data. Users recovered 100% of data created in last 2 weeks.

The local-first architecture wasn't designed for disaster recovery - it was designed for speed (0-1ms reads). But the SQLite cache became an accidental backup system that saved production data.

Key insight: Local-first isn't just performance - it's resilience. Build for speed, get disaster recovery for free.

Pattern 2: Two-Tier Lookups Prevent Optimization Disasters

Fast path + fallback = best of both worlds

// TIER 1: Try O(1) fast path
const result = await optimizedLookup();
if (result) return result;

// TIER 2: Fall back to O(n) - always works
return reliableFallback();

This pattern prevented multiple incidents. When optimization data isn't populated, fallback catches it.

Pattern 3: Defensive Coding Beats Configuration

env-validator eliminated entire class of errors:

// Instead of:
const url = process.env.CRM_SUPABASE_URL!;  // Crashes if missing

// Use:
const url = getEnvVar('CRM_SUPABASE_URL', { required: true });  // Clear error message

Added automatic fallbacks for CENTRAL credentials (safe to default), fail-fast for CRM credentials (security-critical).

Result: Zero configuration incidents after implementation.

Pattern 4: External Verification Breaks Cognitive Blindness

"Show me it's actually fixed" requirement

No more marking incidents "resolved" until:

  1. External person verifies (not the person who fixed it)
  2. Original symptom is demonstrably gone
  3. Before/after screenshots for visual changes

Result: Eliminated "fixed but not fixed" cycles.

Pattern 5: Web GUI Beats Bash For One-Off Operations

The humbling truth: We built excellent admin dashboards, then forgot to use them.

CLI is great for automation. But for one-off operations with visual verification needed, the Web GUI we already built is faster, safer, and more auditable.

New rule: Check if Web GUI exists before writing bash script.

Pattern 6: Simplification Can Break Things

AI assistants optimize for code simplicity without understanding system constraints.

Prevention:

  • Never remove credentials without explicit permission
  • Verify before changing URLs or configuration
  • "This cannot work otherwise" means don't change it
  • Test "simplified" version before deploying

Pattern 7: Double-Paste to Claude Works Now

Minimal vibecoding needed:

  1. Copy data from Supabase UI
  2. Paste twice to Claude Code
  3. Claude generates working code
  4. Maybe 1-2 small adjustments
  5. Works

This is dramatically better than 6 months ago when every AI-generated code needed heavy editing.

Pattern 8: Speed Creates New Failure Modes

AI-assisted development is 10x faster. But fast without safety = fast to disaster.

The incidents cluster in October (CRM launch month) because we were moving fast:

  • Oct 3: CRM onboarding error
  • Oct 4: Bland AI webhook failure
  • Oct 7: MCP split-brain
  • Oct 14: Promo code failure
  • Oct 17: SQL data loss

The trap: Speed creates confidence. Confidence creates complacency. Complacency creates incidents.


The Prevention Playbook

For Configuration Issues

Before deploying:

  • [ ] Verify all required env vars are set
  • [ ] Test with missing env vars (defensive coding)
  • [ ] Document which database each API route uses
  • [ ] Use env-validator library, not direct process.env access

For Cognitive Failures

Before marking "fixed":

  • [ ] Verify original symptom is gone (not just code changed)
  • [ ] Get external verification (someone else checks)
  • [ ] Take before/after screenshots
  • [ ] Test in production-like environment

Red flags:

  • "We deployed the fix" but user says still broken
  • Multiple fix attempts for same issue
  • "It should work now" without verification

For Destructive Operations

SQL Scripts:

  • [ ] Use CREATE TABLE IF NOT EXISTS (not DROP TABLE)
  • [ ] Test on database WITH existing data
  • [ ] Add backup prompts before running
  • [ ] Pre-commit hook blocks DROP TABLE without review

Data Migrations:

  • [ ] Backfill strategy before cutover
  • [ ] Two-tier lookup (new + legacy fallback)
  • [ ] Gradual rollout with feature flags
  • [ ] Rollback plan documented

For Premature Optimization

Before optimizing:

  • [ ] Profile to confirm bottleneck
  • [ ] Maintain backwards compatibility
  • [ ] Add fallback to original behavior
  • [ ] Test both fast path and fallback

For AI Assistants

Guidelines:

  • Never remove credentials without explicit permission
  • Verify before changing URLs
  • Simplification != Optimization
  • User corrections are directives, not suggestions
  • Test "improved" version before deploying

The Numbers

Total Incidents: 37 documented over 10 months

Severity Breakdown:

  • πŸ”΄ 11 Critical (P0) - System-breaking issues
  • 🟠 8 High (P1) - User-facing failures
  • 🟑 17 Medium/Low (P2-P3) - Performance, UX, technical debt

Root Causes:

  • Configuration/Credential Issues: 30%
  • Cognitive/Process Failures: 20%
  • Premature Optimization: 15%
  • Architectural Complexity: 10%
  • Destructive Operations: 10%
  • Authentication Confusion: 8%
  • AI Over-Simplification: 5%
  • Build/Deployment: 2%

Prevention Success:

  • 60% highly preventable (pre-commit hooks, defensive coding, tests)
  • 30% could have been mitigated (monitoring, staging, integration tests)
  • 10% difficult to prevent (platform limits, human nature)

Time Cost:

  • Average resolution: 2-4 hours per incident
  • Total engineering time: 72-144 hours (2-3 weeks)
  • Opportunity cost: Immeasurable

Data Loss:

  • 3 incidents with data loss
  • 1 affecting production users (DROP TABLE disaster)
  • 100% recovery via local SQLite sync (single command: npx thetacoach-crm sync --direction=up)
  • Local-first architecture designed for speed, saved us from disaster

Conclusion: Building in Public Works

Why publish this?

Because every team using AI for development will hit these exact patterns. The specific bugs don't matter - the systemic issues do.

The most valuable lessons:

  1. Speed without safety = fast to disaster - AI makes you 10x faster, which means 10x faster to production incidents
  2. Professional competence creates blindness - Better at implementation != better at problem-solving
  3. Local-first architecture is resilience - SQLite cache wasn't designed for disaster recovery, but it saved us
  4. Web GUI beats bash for one-off ops - We forgot to use the tools we built
  5. External verification breaks cognitive patterns - Can't verify your own fixes reliably
  6. Defensive coding eliminates entire error classes - env-validator prevented dozens of future incidents
  7. Two-tier lookups survive optimization failures - Fast path + fallback = reliability

The meta-lesson:

The Persistent Cognitive Blindness meta-incident taught us more than any technical failure. Intelligent systems (both human and AI) can repeatedly implement technically correct solutions that don't solve the actual problem.

This is the incident that teaches us about incidents.

Success metrics for next quarter:

  • Target: Under 5 incidents in Q1 2026 (vs 15+ in Q4 2025)
  • Zero data loss incidents
  • Zero configuration incidents (env-validator)
  • Under 1 hour average resolution time
  • 100% external verification on fixes

The commitment:

Moving from reactive incident response to proactive incident prevention through:

  • Defensive architecture
  • Automated validation
  • Cognitive training
  • Verification culture

Ready to Learn From Our Mistakes?

We're building ThetaCoach in public. That means sharing the wins AND the failures.

Follow along:

Questions? Spotted a similar pattern?

Email: elias@thetadriven.com

The goal: Make these mistakes once so you don't have to.


Explore More:

πŸ”΄πŸ§ βš‘πŸ’€πŸ—οΈπŸ”πŸ”§πŸ€–πŸ“…πŸŽ΅ Complete

Related Reading

The $440K AI Scandal: Why Deloitte's Hallucinations Prove We Need FIM covers another real-world AI failure and the mathematical solution.

The Trust Debt Equation explains the compounding nature of AI alignment drift.

Zero-Entropy Control: Cache Miss as Feedback reveals the physics behind verification loops.

Who Owns the Errors? addresses the question of AI authorship and accountability.

Ready for your "Oh" moment?

Ready to accelerate your breakthrough? Send yourself an Un-Robocallβ„’ β€’ Get transcript when logged in

Send Strategic Nudge (30 seconds)