Olllo case studies

The Honest Post-Mortem

Wed, 15 Apr 2026 00:00:00 GMT

I Shipped a Real Problem and Nobody Showed Up

A post-mortem of olllo, a performance-review and accomplishment-tracking tool I built solo across four months and shut down in March 2026.

Opening

People told me it was a real problem. Every conversation. Performance reviews that felt arbitrary. 1:1s that surfaced nothing. The 9pm panic of trying to reconstruct six months of work the night before a self-review was due.

I built olllo to solve that. It works. People who tried it said it works. Even with the price gate removed, almost none of them showed up to use it consistently.

This is the part of the case study where I was supposed to have a clean answer for why. I don’t. What I have is a set of signals I read correctly, a set I explained away, and a decision made in March 2026 to stop building and start writing about what I learned.

The closest analogue is the fitness or weight-loss app. People articulate the goal cleanly. They will tell you, with care, that health is important. They will sign up. They will not, in numbers that build a business, show up to do the work. olllo lived in that category, and the signals were there in the data before I let myself read them.

The bet

The hypothesis was simple. If you can’t remember your work, you can’t advocate for it, and your career compounds slower than your effort deserves. The script I wrote for the first YouTube video argues this cleanly: it names recency bias, it cites Kahneman, it frames the problem as memory rather than motivation. I believed it when I wrote it. I still believe it.

The user bet was that low-friction daily capture, reinforced by weekly reflection and quarterly summaries, would compound into a case file that made reviews, 1:1s, and compensation conversations dramatically less stressful.

The business bet was narrower: that enough people felt this pain acutely enough to pay roughly $8/mo on an annual plan ($96/year before the beta discount) for a tool that replaced the ad-hoc notes app or Google Doc they were already mismanaging.

The first bet was right about the pain and wrong about the behavior. Daily capture compounds only if you show up daily, and the cohort A retention says you mostly don’t. The second bet rode on the first, and they fell together.

What I built

Four months of solo work. Full architecture details live in Solo Architecture; the short version:

A Next.js 16 web app with onboarding, reflections, goals, and accomplishment tracking
A React Native mobile app with voice capture that transcribes audio and extracts structured STAR-format entries
A PWA for the in-between surface
A multi-agent reflection flow using Claude Sonnet and Haiku across tiered calls
Infrastructure most solo products skip: rate limiting, feature flags, email authentication, consent management, referral loops, waitlist, free-forever grants (the full growth-stack story is in Growth Engineering)

40+ numbered feature branches shipped, each gated through a speckit workflow that required a spec, clarifications, a plan, tasks, and an analysis pass before implementation. The discipline story lives in Culture as Code.

The thing worked. That’s not in dispute.

What 14 people said they wanted

The waitlist had 30 people. Fourteen filled out the survey, and the pattern in the responses was consistent enough that I should have read more meaning into it than I did.

The modal respondent was a senior individual contributor at an enterprise company — staff-level engineers and designers, plus a band of people in career transition. Two-thirds heard about olllo through LinkedIn or a direct referral; almost none arrived from a search or a content channel I had not personally seeded.

The pain they named was real and articulated cleanly: performance reviews, interviews and resume updates, weekly planning, and the recurring imposter-syndrome moment of “what have I actually been working on?” Six respondents cited reviews specifically. Seven cited interviews. Across role levels, the shape held.

What they said they wanted built was a clean record of wins they could search later, a promotion-ready career story, and weekly reflection prompts that surface patterns. Goal tracking and a thirty-second “capture a win” both rated highly. Across the four feature-interest scores, no single feature dominated — respondents wanted the whole shape, not one piece of it.

What they said would stop them is the section this post-mortem turns on. The top blockers, in order: too much effort to capture things, I don’t want another tool to maintain, I’m not sure what to write, I wouldn’t remember to come back. Privacy concerns scattered through. Price was not in the top blockers. Habit and inertia were. That should have been louder to me than it was.

The magic-wand answers, paraphrased: “Identify a career path and prepare for success.” “Keep me engaged with itself — too often the things I want to work on peter off.” “Push me to develop the habit.” “Help me organize my energy toward highest impact, for me, not my company.” Read as a set, those wishes describe a coaching product more than a tracking product. olllo was a tracking product.

The survey was a clear, useful read. The feature priorities were visible. So was the willingness-to-engage signal — and it was thinner than I let myself see at the time.

The signals I saw clearly

The numbers, as of March 2026:

Metric	Value
Waitlist signups	30
Survey completions	14 (47%)
Cohort A — invited (no payment, n=10): created → onboarded → active@1mo	9 / 5 / 3
Cohort B — invited (Stripe gate, n=20): created → onboarded → active@1mo	2 / 0 / 0
Paid conversions	0

The split between cohorts was the natural experiment, and the cohort B funnel — eighteen of twenty users walking away at the credit-card field, the two who entered a card never finishing onboarding — is the readable answer to “will people pay for this.” Growth Engineering covers the cohort design and instrumentation in detail. The post-mortem-relevant version: even with the price gate removed, half of cohort A onboarded and three of ten were still logging weekly a month later. That number is real product-fit signal in a tiny sample, and not a number any product gets to build a business on.

Read without flinching, that funnel says: strong top-of-funnel from a personal network, a hard wall at the credit-card gate, and — when the gate is removed — engagement so thin that seven of the ten free cohort had stopped logging by week four. Zero paid conversions on either path. The price was the visible failure. The habit was the deeper one.

The qualitative signal was consistent. I captured it in a note to myself: “People identify it as a pain point to keep track of accomplishments and prep for 1:1s and reviews, but they don’t seem willing to invest in it.” That sentence is the whole post-mortem in miniature. Acknowledgment without investment is the category-definition of a vitamin rather than a painkiller — the workout-app pattern, the weight-loss-app pattern, the nutrition-tracker pattern. People articulate the goal, sign up to solve it, and don’t reliably show up to do the work. I wrote the sentence and kept building.

The signals I explained away

This is the section that matters.

The metric I kept reassuring myself about was the survey itself. Fourteen people answered fourteen questions with care — these were not the throwaway responses of people who were casually interested. The depth and specificity felt like signal. The mistake was treating well-articulated pain as adjacent to willingness to show up. They are different things, and the survey already told me so. Under what would stop you from using olllo regularly, the top answers were habit blockers, not pricing blockers: too much effort, another tool to maintain, wouldn’t remember to come back. A tool that adds a daily capture ask is not solving those by being free; it is making them worse.

The pattern is visible in the git history. Every time the numbers were soft, I added a feature. Voice capture. Multi-agent reflection. Smarter summaries. Referral loops. Each one was defensible in isolation, and each one was a way of not confronting the base rate the survey had already drawn — that the people who articulated this pain most clearly were also articulating, in the same survey, why they would not show up daily to solve it.

There is a version of this product that would have worked. It probably does not include voice capture. It probably includes a conversation I did not have enough of: “Would you pay $X today, before I build a single screen?” That conversation costs nothing and it would have told me in a few weeks what it took me four months and a cohort experiment to confirm.

What I’d do differently

Test the habit before the tool. A cohort of ten people, with weekly fifteen-minute calls, asked to capture three accomplishments by the end of week one using nothing but a notes app. If the habit doesn’t form for ten people who agreed to a call, it will not form for ten thousand who never will. This is the experiment that costs nothing, takes two weeks, and tells you whether the rest is worth building.

Pre-sell before build. A landing page, a Stripe checkout, a promise to refund if I don’t ship in 60 days. If 30 people pay, there’s demand. If 3 people pay, I know before writing a line of code. I had all the infrastructure to run this test and ran it too late.

Narrower ICP. Not “knowledge workers who have performance reviews.” That is almost everyone. Something like “senior engineers at companies with formal promotion packets, actively prepping for a cycle in the next 90 days.” Urgency is the filter that separates painkiller from vitamin, and I spent too long recruiting beta users who had the problem in principle rather than the problem this quarter.

Import over habit. The daily-capture habit is the product’s largest ask and its largest conversion killer. A version that reads existing Slack, email, and calendar signal and pre-populates the case file would remove the cold start. The habit can come after the value is obvious, not before.

A willingness-to-pay experiment in month one, not month four. The hardest thing to un-know in solo building is the sunk cost of a working product. Ask the uncomfortable pricing question while the product is still cheap to kill.

A single activation event to optimize, not a funnel. Not “they used it in week one.” Something sharper, like “they walked into a real 1:1 with notes this tool generated.” Everything else is a proxy.

Measure with people first, instrumentation second. Fifteen-minute calls with the first ten or twenty users tell you more in a week than a drip system tells you in three months. The growth stack would have surfaced the demand thinness eventually; the calls would have surfaced it in week two. The full version of this lesson is in Growth Engineering.

What I’d still do the same

The problem framing. The memory-and-recall reframe holds up. The three video scripts still read as sharp product thinking and I’d use them to open any future pitch in the career-development space.

Shipping solo with team-grade discipline. Speckit, the constitution file, the merge gates. These kept me out of the classic solo trap of endless scope creep disguised as progress. Detailed in Culture as Code.

Tiered model strategy with latency-first selection. Haiku for fast classification, Sonnet for interactive reasoning, no single-tier hammer. The UX stayed snappy across thirty-eight features (and the monthly AI bill stayed within reason as a side effect, but that was not what picked the model). Detailed in AI Product Craft.

Betting on the shutdown. The decision to stop is the judgment call I’m most confident in. Pulling the plug on a product I still believe in, because the market doesn’t, is the skill I most want this portfolio to demonstrate.

What I’m taking with me

A codebase I’d hire the person who wrote it. A set of product decisions I can defend in detail. A sharper read on the gap between “interesting problem” and “viable product.”

If you’re reading this as a hiring signal: the thing I want you to notice is not that olllo shipped. It’s that it stopped.

Culture as Code · Speckit + Enforcement

Wed, 15 Apr 2026 00:00:00 GMT

Culture as Code: Speckit and Enforcement

How I built the guardrails a team of ten would need, for a team of one, and why it turned out to be the highest-traction decision across four months of solo shipping.

Opening

The startup playbook is older than software. Move fast in the early days. Ship scrappy. Fix it later. AI hasn’t changed that advice; what AI has changed is that scrappy is no longer the price of fast.

You can now ship the codebase a senior systems architect would be proud of in roughly the time it used to take to ship the duct-taped MVP. The catch is that AI does not give you the discipline to do it. It only removes the excuse not to. You still have to know what great looks like, and you have to refuse to skip the parts you cannot see yet.

This is the case study for what that refusal looks like in practice. Four months of solo shipping, 38 numbered features, one constitution that grew the way every honest policy document grows: each principle added the day after I learned why it should have existed.

The problem

Solo shipping breaks in predictable ways, and AI-assisted coding multiplies all of them.

Past decisions go undocumented, so future-me cannot tell why the code is the way it is. Features drift: a spec starts as one thing and becomes another with no trace of the handoff. Every fix risks breaking something the original author already decided, and the original author is also me. AI assistants cheerfully rewrite a module in a style that contradicts three others it touches, and without rules of the road, the assistant is right that the rules do not exist.

Underneath all of this is the empty-review problem. Code review is where most engineering cultures enforce taste. Solo, there is no reviewer.

The team-of-ten answer is process: specs, ADRs, code standards, test gates, doc requirements, PR templates. The bet I made four months ago was that those artifacts are worth more to a solo dev than to a team, because a solo dev is the one who most needs to trust their own past self.

The workflow

The workflow is built on Speckit, an open-source templating layer for AI-assisted feature specification. Speckit gives you the six stages and the templates; the discipline is in running every feature through every stage, every time.

Every feature in olllo, without exception, passes through these six stages before a line of implementation code is written.

The six stages:

specify generates a spec.md from a plain-English feature description
clarify surfaces 3–5 targeted questions about ambiguous requirements and encodes the answers back into the spec
plan produces the technical design: data model, integration points, dependencies
tasks produces a dependency-ordered implementation list, each task with acceptance criteria and test-coverage requirements
analyze runs cross-artifact consistency across spec, plan, and tasks, catching contradictions before code
implement generates code task-by-task with local test verification at each step

A real example. Feature 036-knowledge-user-context replaced three existing onboarding cards (work context, preferences, notifications) with a single conversational AI flow. The spec.md has a literal “Clarifications” section with five Q/A pairs from the clarify stage. Each one is a decision that would have shipped as an unstated assumption without the process:

Q: Should Knowledge capture reporting structure (direct reports count, manager relationship, 1:1 cadence)?

A: Yes, full: direct reports count, manager role/title (not name), 1:1 cadence. People names save to Contacts with encrypted realName, accessible via @-mention. User is informed of encryption and contact creation.

That single clarification triggered an encryption decision, a Contacts integration, and a new user-facing string. Without the clarify step, at least one of those three would have been missed or decided silently in implementation.

By the time implementation starts, the ambiguity budget is spent and implementation is execution rather than exploration.

The implement stage is the loudest in any AI-assisted workflow. It’s where the assistant does most of the visible typing. But on every feature in olllo, my time was spent in the earlier stages: reading research, refining the spec, choosing the harder long-term path over the easier short-term one the research suggested. The clarify and plan stages exist to make those judgment calls visible. The implement stage exists to make sure the visible decisions make it into the code. AI did most of the typing. None of the deciding.

The spec directory for 036 ends up with nine artifacts:

Every feature gets the same nine slots. Not all slots are always full; some are a single sentence. But the shape is the same, which is what makes two-month-old features readable.

The constitution

The constitution is a single file, versioned like code: .specify/memory/constitution.md, currently at version 1.7.0, last amended 2026-03-23. It declares seven non-negotiable principles.

Each principle is declared, rule’d, and rationalised. Principle 5 exists because I lost data once. Principle 6 exists because GDPR Article 17 does. Principle 7 exists because I shipped changelog entries inconsistently until I made it mandatory.

The file is amendment-tracked with semantic versioning. Every amendment is a scar.

The gates

Principles are nothing without enforcement. The gates are where the discipline compounds.

Pre-merge, via CI

Every PR runs unit tests against the database package, against the app, and against shared packages, via .github/workflows/ci.yml. The build fails if any test fails. No “I ran it locally” exception.

Pre-PR, via the workflow itself

The speckit implement step refuses to create a PR if affected tests have not been run and passed locally. That rule lives in the constitution (Principle 4) and in the speckit templates, so it’s enforced both in the assistant’s behavior and in my own muscle memory.

Documentation, via required artifacts

A feature is not complete until spec.md, plan.md, and tasks.md exist. Implementation tasks reference constitution principles by number. When a new user-facing feature ships without a changelog entry, the analyze stage flags it.

Security, via four dedicated workflows

SAST, DAST, secret scanning, and dependency auditing run on every push. The existence of security-secrets.yml alone means a hardcoded API key won’t land even if I miss it in self-review.

AI behavior, via evals

ai-eval.yml runs model-output tests in CI. Changing a prompt without verifying downstream output would be trivial to do by accident without this gate; with it, prompt changes are gated on measurable behavior.

Living context, via CLAUDE.md

The assistant reads CLAUDE.md on every session. When conventions evolve, the file is updated in the same PR as the convention change. The assistant’s suggestions stay aligned with current standards instead of drifting toward training-data defaults.

What the gates didn’t catch

Most case studies on engineering process tell the story of the time the gate caught a bug. This one tells a different story: the time the gate didn’t, and what came of it.

On 2026-01-13, I added Principle 6 to the constitution: every new user-data model must update the export and deletion services. The principle was added during the GDPR/CCPA work, when I had just built the services and discovered that “every model” was a longer list than I’d assumed.

Eighteen days later, on 2026-01-31, I pushed ea14b27 directly to master: fix(database): add quarterly review models to user data services. A feature I’d shipped between those two dates had added new user-data models without updating the services. The principle existed. The enforcement was still me. I missed it during the feature, caught it in self-review later, and patched it with a direct push.

That is a near-miss, and it is the kind of near-miss that the rest of the constitution exists to make rarer. Read the version history again with this in mind:

1.6.0 (2026-02-05) added Changelog Maintenance after I shipped features without changelog entries
1.4.0 and 1.5.0 (2026-01-13) added Database Migrations and Data Portability the day I felt the absence of both
1.3.0 (2025-12-13) added Test Planning after I shipped tests as follow-up work and watched coverage drift
1.7.0 (2026-03-23) added Local Test Verification during feature 038, after a CI failure that would have taken fifteen seconds to catch locally

Every amendment is a postmortem encoded as a rule. Read top to bottom, the version history is a curriculum: every lesson I’d want to teach the next person on day one, in the order I learned it the hard way.

The outcome

38 numbered feature branches shipped between December 2025 and March 2026. Branches 001 through 040, with two gaps where features were consolidated. Every one followed the same six-stage workflow. Every one produced the same nine artifacts in its spec directory. Every one was gated on tests and documentation before merge.

The measurable consistency: a feature from two months ago is readable in minutes, not hours. The spec tells me what we decided, the clarifications tell me why, the tasks tell me what was built, the constitution tells me what rules applied. There is no archaeology. There is only reading.

The trade was never about speed; it was about debt. Every startup I have watched at scale has hit the same wall: things start breaking at year two because the early observability was thin, the early tests were inadequate, the early decisions were unwritten. Retrofitting those things at scale costs more than building them at month one would have. Speckit and the constitution were the bet that I could pay that cost up front, every feature, and ship a codebase without a debt cliff to climb later.

What I’d port to a team

Universal, ship day one:

The constitution as code, versioned with amendments
The speckit clarify stage as a required step for any feature spec
Test tasks in the same task file as implementation tasks, not in a follow-up
CI enforcement of every non-negotiable rule, not convention

Solo-only, probably cut on a team:

The full nine-artifact spec directory. A team version would consolidate into three: spec.md, plan-plus-tasks.md, decisions.md.
The implement stage’s requirement that the assistant runs tests locally. On a team, CI is the gate.
Principle 7’s “when in doubt, ask” clarify step for changelog entries. A team would codify this in the PR template and skip the clarify roundtrip.

The biggest constraint, and the part I would reframe before bringing this approach to a team, is that Speckit was built for one author. The artifacts it produces (spec.md, plan.md, tasks.md, decisions.md) are excellent for me-to-future-me communication. They are hard to share for in-progress feedback. There is no good way for a teammate to comment on a plan that is still being written, no way for a designer to weigh in on a clarification before it is resolved, no way to fork a plan into alternatives and pick between them. A team version of this approach needs a collaborative layer: shared workspace for in-flight specs, async comment threads on clarifications, plan branching and review. That is the next thing I would build if I took this to a team.

Secondary gap: a lightweight ADR workflow outside of features. Cross-cutting decisions (“switch from Postgres full-text search to pgvector”) currently live in whatever feature spec happens to touch them, which is the wrong home. The specs/adr/ directory exists but is underused.

Why the discipline compounds

Process discipline reads as overhead until the moment it isn’t, and the moment it isn’t is usually six weeks after a decision you made is now haunting you. Solo devs do not have a senior engineer down the hall to ask. They have their past self, who will either have left notes or not.

AI Product Craft

Wed, 15 Apr 2026 00:00:00 GMT

AI Product Craft: When the User Is in the Moment

How models got matched to tasks, agents got scoped to outcomes, and structure beat prompting at every step. A walk through the four-tier model config, a multi-agent spec that replaced a broken one, and an eval harness that runs every night.

Opening

The user is in the moment. They are typing a sentence about a meeting that just ended, rating the week they just had, or asking the assistant to help them prep for a 1:1 in fifteen minutes. Every AI call had to fit inside that moment.

Three things picked the model that handled it: speed, correctness for the task, and cost. The first two are why this case study leads with latency budgets and tier-to-task fit. The third was a real factor in every decision, never the lead one. Cost is the reason the system does not run Opus on everything; it is not the reason any specific model got picked.

Pick the wrong tier and the user feels the lag. Pick the wrong agent boundary and the assistant talks past them. Pick the wrong scope and the assistant talks about itself instead of about them. This is the case study for how I picked, three decisions deep: the tier, the agent, and the line between what the model gets to decide and what the code does.

The constraint that picked the model

Model tier configuration lives in packages/ai/lib/config/model-tiers.ts. Four tiers, each with a primary model, a fallback, a timeout, a retry count, a temperature, a max-token cap, and a budget cap.

micro     Gemini 2.0 Flash Lite   Haiku fallback     3s   t=0.1   512 tok    $0.005
fast      Claude Haiku 3.5        Gemini Flash Lite  5s   t=0.2   1024 tok   $0.01
balanced  Claude Sonnet 4.5       Gemini 2.0 Flash   15s  t=0.5   4096 tok   $0.10
advanced  Claude Sonnet 4.5       Claude Opus 4.5    30s  t=0.7   8192 tok   $0.50

Read it left to right and the design intent is plain. Each tier is organized around a latency budget (3, 5, 15, 30 seconds across the four) because that is what the user feels first. Temperature climbs as the task gets more generative. Token caps grow with the task’s natural length. Cost grows with model size and with the chosen primary; the per-request budget caps exist so a runaway loop cannot bankrupt the system, not as the lever that decides which model runs.

PII detection runs on the micro tier because the user is mid-sentence and the model cannot afford to think for two seconds. Refinement chat sits at balanced because the assistant is in dialogue, the user is reading the words as they stream, and a half-second to first chunk is good enough. Reflection conversation runs at advanced because it is the deepest, most generative use of the system, and by the time it starts the user has already committed to the moment.

Every tier has a fallback model from a different provider. Every request routes through Vercel AI Gateway, which is enabled: true everywhere with no opt-out. The gateway gives unified logging, request-level cost tracking, and rate limiting. That is the observability layer that makes everything else in this case study possible. If you cannot see what your AI is doing, you cannot improve it.

Where single-agent got stuck

For the first version of weekly reflection, I built what every AI product builds: a single agent with a long, well-prompted system message, a list of themes to cover, and a hard cap on total questions. The prompt told the agent how to move through the themes, what each one should cover, when to summarize, when to wrap up. It did not work.

The diagnosis is in specs/012-reflection-multi-agent/spec.md. The production data captured there:

Goals theme received 14 questions (target: 2)

Well-being received 4 questions (target: 2)

Engagement received 3 questions (target: 2)

The agent sometimes declared completion while themes remained under-covered

The root cause, written in my own words in that same spec, is the entire lesson of this section:

A single LLM cannot reliably track multi-theme state and enforce hard limits through prompting alone. The agent “forgets” constraints as the conversation progresses.

Prompts are wishes. The longer the conversation, the more the wish decays. A reflection that should have run twelve questions across five themes in three minutes was running twenty-five questions about goals, abandoning users halfway through.

The mistake I see most often in AI product work is to respond to this kind of failure with better prompts. Longer prompts. Cleverer prompts. The mistake is mine too; the early iterations of the reflection prompt are still in git history, each one longer than the last, trying to encode the constraints more emphatically. None of them moved the needle past a certain point, because the model was never the right place to encode hard constraints.

The orchestrator-specialist pattern

The fix was structural, not prompted. The new architecture, shipped as feature 012, replaces the monolithic agent with three layers:

An orchestration agent that controls flow and enforces limits with hard state checks in code, not in prompts
Theme-specific agents (Engagement, Well-being, Career, Performance, Accomplishments) that each focus on a single domain for 2 to 4 questions
A response-length heuristic that decides whether to continue or transition: continue if the user’s response is under 20 words, transition after 2 questions otherwise

The orchestrator’s job is the part the model could not do reliably: state tracking. Has Engagement reached its minimum? Has any theme exceeded its maximum? Is total question count approaching the 15-question cap? Those are programmatic checks now, not prompt requests. The orchestrator hands the conversation to the next theme agent when its checks say so.

Each theme agent is what an engineering manager would call an expert: scoped to a focused outcome, expert in reaching it, blind to everything else. Engagement does not know about Career. Career does not know about Well-being. The orchestrator is the only piece that knows the whole. That separation is what kept the conversation moving.

20 words. The orchestrator hands the conversation to theme agents and receives completion signals. Below it, five theme agents in a row: Engagement (team & company, min 2 / max 4), Well-being (feelings live here, min 2 / max 4, FEELINGS SCOPE), Career (progress on goals, min 2 / max 4), Performance (outcomes, min 2 / max 4), Accomplishments (what shipped, min 2 / max 4). At the bottom, two cards comparing approaches: V0 monolith (one agent, long prompt, hard cap in words — state, scope, and flow all encoded as wishes the model forgets) versus V1 orchestrator-specialist (each agent expert in one theme, blind to the rest; orchestrator is the only piece that knows the whole)."> The same pattern reappeared in goal-setting, in accomplishment refinement, anywhere the original temptation was “one big prompt that does it all.”

What the model wanted to talk about

The most useful thing I learned about AI product craft is that the model has biases that prompting alone cannot suppress.

In the single-agent reflection, even with explicit prompts telling the agent to spend most of its questions on the user’s actual work and impact, the conversation kept drifting toward feelings. How do you feel about your week? How did that meeting make you feel? How do you feel about that? It was not what users wanted, it was not what the prompt asked for, and the system kept doing it anyway.

Telling the model to stop did not work. I tried. Several times.

The fix was not a sharper prompt. The fix was scope. The Well-being agent talks about feelings; that is its job. The other four agents are not allowed to. Performance asks about outcomes. Career asks about progress toward stated goals. Engagement asks about team and company connection. Accomplishments asks about what shipped. None of them have permission to ask “how do you feel about that,” because that question lives in a different agent’s scope and the orchestrator hands the conversation off when it is time.

The lesson that generalizes: when the model wants to do something the product does not want, give the unwanted behavior its own scope and gate it. Telling the model “do not do X” is a wish. Building the system so X is structurally unavailable outside the agent that handles X is law.

The reliability layer

The eval harness lives in .github/workflows/ai-eval.yml. It runs nightly at 2 AM UTC and on every PR that touches packages/ai/**, the AI services in the database package, or the eval workflow itself. The harness uses a versioned golden set, runs evaluations through pnpm --filter @repo/ai test:evals, and grep’s the output for the literal string REGRESSION DETECTED. Any regression fails the workflow and posts a comment on the PR. A nightly failure pages me through Slack.

A specific regression the harness has caught, in four months of operation: none. That is worth being honest about.

I cannot tell you whether that is because my prompt changes were never bad enough to trigger one, because the golden set is not broad enough to catch the subtle drift that would have shown up at scale, or because the discipline of having the harness shaped how I thought about prompt changes in the first place. All three are plausible.

What the harness did was let me ship prompt changes more confidently, because there was a runnable check between me and production. That confidence was worth building even when nothing was ever flagged. At sufficient scale, a regression will land. When it does, the gap between having an eval system already running and needing to build one is the difference between a five-minute fix and a three-week one.

The other reliability pieces are smaller but worth listing. Vercel AI Gateway in front of every request, for unified logging, observability, and a single rate-limit chokepoint. Upstash Redis for request-layer rate limiting beyond what the gateway handles. Retry policies declared per tier, two to three attempts depending on tier. Streaming timeouts that distinguish between total request time, time to first chunk, and chunk-to-chunk stalls. None of these are novel; all of them are the boring infrastructure that turns a demo into a product.

What I’d take into another product

The tier abstraction. Picking a model per task is the wrong unit of decision; picking a tier and assigning tasks to it is the right one. The tier captures the latency budget, the temperature philosophy, the retry policy, and the safety caps in a single object that downstream code references by name. New AI feature, new tier assignment, zero fresh decisions about timeouts or fallbacks.

The orchestrator-specialist split for any conversation that needs to move through structured phases. Reflection, goal-setting, performance prep, anything that has phases. The model is not good at state across long contexts. The code is. Let each one do what it is good at.

The eval harness as a non-negotiable. Building AI features without an eval harness is shipping a product whose quality you cannot measure. Even a small golden set with regression detection on PR is enough to start; the discipline of running it grows the set over time.

The boring observability. AI Gateway, request logs, cost tracking, rate limiting. None of this looks like AI product craft on a portfolio. All of it is what makes the AI product behave like a product instead of a demo.

The meta-point

AI product work has three layers that get conflated. There is the model layer, where the choice is which model, what temperature, what budget. There is the orchestration layer, where the choice is what the model decides and what the code decides. There is the scope layer, where the choice is what each agent is allowed to talk about.

The mistake I made early, and that I see often elsewhere, is to handle all three through prompting. Better prompts, longer prompts, cleverer prompts. None of it works past a certain point because the model was never the right place to encode hard constraints, define state machines, or scope domains.

The right division of labor: the model decides language, the code decides flow, the scope decides domain. Once those three layers are separate, the model’s biases stop being problems and start being characteristics you design around.

Architecture & Technology

Wed, 15 Apr 2026 00:00:00 GMT

Solo Architecture: Nine Apps, Twenty-One Packages

A walk through the olllo monorepo: what was built, what held up, and what I would reconsider with four months of hindsight.

Opening

Architecture for one is a different problem from architecture for many. The team-of-ten answer is to design for handoffs, parallelism, and reviewer load. The team-of-one answer has to design for someone else: future-you, who will inherit every decision in three months and not remember why.

The olllo monorepo is the version of that I shipped. Nine apps, twenty-one packages, thirty-eight numbered features over four months. Most of it I would build the same way again. Some of it I would not. This case study is the honest version of both.

The shape

The repo is a Turborepo workspace on top of pnpm. The split between apps and packages is the spine: apps are deployed surfaces, packages are libraries that two or more apps share.

The apps:

app: the authenticated product where users live, Next.js 16 on the App Router, 112 API routes
api: the webhook + cron + admin surface, deployed separately from app (more on this below)
web: the marketing site
mobile: the iOS/Android client (Expo SDK 54)
email: the React Email preview app
docs, storybook, studio: internal tooling
workflows: background jobs deployed as a separate service

The packages, grouped by what they do:

Foundation: typescript-config, next-config, database (Prisma), design-system (shadcn/ui-based, with the caveats in Cross-Platform Consistency)
Product domains: ai, chatbot, email, payments, notifications
Cross-cutting: auth (Clerk), feature-flags, internationalization, rate-limit, security, storage, webhooks (Svix-based outbound), observability (Sentry + Logtail)
Operational: analytics, seo, workflow-utils

This is more surfaces than a typical solo product. The shape is intentional: each app is a separate deployment with separate concerns, each package is a boundary I knew I would want to swap or scale independently.

Decisions that held up

Four architectural choices I would defend in a hostile interview.

Splitting `apps/api` from `apps/app`

apps/api is its own Next.js deployment on its own port (3002 in dev). It serves only inbound webhooks (Clerk auth, Stripe payments, Resend email), cron endpoints (keep-alive, drip emails), and a thin admin API. The user-facing routes live in apps/app, which has 112 API routes for product features.

The reasons are practical. Webhooks need signature verification and no Clerk session; product routes need exactly the opposite. BotID protection lives on api, not on app. A flaky Stripe handler should not be able to take down the user-facing app. Webhook load is bursty and external; product load is smooth and authenticated. None of those concerns alone are decisive; the combination is what made the split worth maintaining.

The cost is one extra Next.js app to deploy and a slightly more complex local-dev story (the Stripe CLI forwards to port 3002, the app runs on port 3000). The benefit is that I never had to think about “should this middleware run on the webhook” or “what if this cron starves the product.”

The four reasons the split is worth maintaining, side by side. Auth posture and load shape diverge cleanly across the boundary; blast radius and BotID enforcement only work because the boundary exists.

Vercel AI Gateway as a universal chokepoint

Every AI request in the system routes through Vercel AI Gateway. The config in packages/ai/lib/config/model-tiers.ts declares enabled: true and there is no opt-out. AI Product Craft covers the AI design in detail; architecturally, the gateway is the single most consequential decision in the AI surface.

One observability surface. One rate-limit boundary. One cost-tracking layer. One place to swap providers underneath. When Anthropic releases a faster model, the swap is a config change. When I want to know what the system spent on AI yesterday, the answer is one screen. When I add a new AI feature, none of the observability work is fresh.

Single Postgres + single Prisma schema

The data layer is one Vercel Postgres instance with one Prisma schema covering everything: user data, marketing consent, subscriptions, accomplishments, reflections, goals, contacts, notification events. No separate marketing DB, no separate auth DB, no separate analytics DB.

The temptation to split was always there. Each domain feels like it deserves its own schema. The reason I did not split is that the data needs to join. Marketing consent correlates to billing status. Onboarding completion correlates to reflection cadence. Notification eligibility correlates to feature flag state. Splitting the database would have produced sync work for no real isolation gain: the same JOIN logic, just spread across services and probably implemented as a custom event bus that nobody asked for.

A team would eventually outgrow this default. A solo product never reached the size where the limits showed.

The four-layer component hierarchy

Components live in one of four places, codified in the constitution:

packages/design-system/components/: UI primitives, no business logic (shadcn/ui base + form wrappers)
apps/app/components/{domain}/: used across two or more features
apps/app/app/(authenticated)/_components/: layout shell (sidebar, header, nav)
apps/app/app/(authenticated)/{feature}/_components/: feature-route components

The decision tree fits on one page of the constitution. New component, single question: where does this live? The answer is determined, not negotiated. There is a promotion path when a component graduates from feature-route to app-shared to design system.

The reason this held up: solo, the temptation is to treat every component as worthy of promotion to design system because you have seen it twice. The hierarchy makes you wait until the third use, which is when the abstraction is actually safe. Most components stay in _components folders, which is exactly where they should be.

Four layers, one decision tree. New component? Single question: where does this live? The answer is determined, not negotiated.

Decisions I’d reconsider

Expo for mobile

The mobile app is Expo SDK 54. I would not use Expo if I were starting today.

The reasoning is direct experience. Since shutting down olllo, I have spent time in pure native iOS development, and the experience of controlling the build, debugging across the simulator and a physical device, and shipping changes is materially better without the translation layer. Expo’s promise is “write JavaScript, ship to two platforms.” The promise is real. The cost is also real: when something breaks, the surface area of “is this a JS bug, an Expo SDK bug, a Metro bundler bug, an iOS-Expo-config issue, or a real native bug” is wide. With native, the surface is narrower, the tools are sharper, and the debugger does not lie to you about what is on the device.

For a solo product genuinely targeting both iOS and Android from day one, Expo is still defensible. For a solo product where iOS is the primary surface (which is what olllo’s mobile ended up being once usage data showed where the users actually were), I would write Swift.

The half-built analytics abstraction

packages/analytics is named for swappability. Call sites import analytics from @repo/analytics rather than from posthog-js, which is exactly what you would want if you ever needed to swap providers. The boundary is at the import path.

The methods underneath are not abstracted. analytics.capture(), analytics.identify(), analytics.flush() are PostHog method names. The package re-exports posthog-js directly with a renamed identifier. The server-side variant instantiates new PostHog(...) and the noop development shim mimics the PostHog interface.

Now that I am moving off PostHog (the UI is harder to use than I expected, the Slack integration is shallow, the views I want are easier to build myself), the incompleteness is visible. Two paths forward:

Build the homegrown analytics layer to match PostHog’s method API. The package swap stays at one file.
Update all the call sites to a new API. The package shape changes too.

Either works. The lesson is that a half-abstraction is worse than no abstraction in one specific way: it makes you think the swap is cheaper than it is. A package boundary named for swappability looks like an interface; it isn’t, until the methods underneath are wrapped too.

If I were doing this again, I would either go all the way (a generic capture/identify interface, PostHog wrapped inside it) or not at all (call sites import posthog-js directly, with a one-time migration cost when the swap eventually came). The middle path is the expensive one.

Bought vs built

Things I bought:

Clerk (auth, sessions, social login, org management, webhooks)
Stripe (subscriptions, billing, customer portal)
Resend (transactional + marketing email)
Knock (multi-channel notification routing, user preferences)
Vercel (hosting, edge runtime, AI Gateway)
Anthropic via the gateway (Claude models)
Sentry (error tracking)
Logtail (structured logs)
Upstash Redis (rate limiting)
Sanity (marketing copy CMS)

Things I built:

The reflection multi-agent flow (see AI Product Craft)
The accomplishment refinement chat
The waitlist + invite + free-forever-grant system (see Growth Engineering)
The marketing email consent + tokenized unsubscribe (see Growth Engineering)
The voice capture pipeline (audio captured, transcribed, then extracted into a STAR-format entry)

The split is roughly: bought every commodity, built every product surface. The instinct was right in nearly every case. Auth, payments, and email delivery are all commodities. Reflection conversation is the product, and a custom build was the only way it could have worked.

Two vendor decisions look different in hindsight than I thought they would when I picked them.

PostHog for a homegrown analytics layer. The product was harder to use and integrate than I expected, especially around chart customization and Slack alerts. I have started building lightweight in-app analytics tailored to the metrics olllo actually needed, and it has been surprisingly cheap. There is complexity I might be missing that PostHog handles for free: funnel tools, retention cohort math, session replay. Whether the homegrown version stays simple as I add use cases is the open question.

Knock turned out to be narrower than I bought it for. I picked Knock for multi-channel notification routing across email, in-app, and push, with user preferences and a send-history API. In practice I used it only for schedule management. Keeping email styling consistent across olllo meant rendering templates inside my own @repo/email package and sending them through Resend, so the real flow is: Knock fires a scheduled webhook, apps/api listens, the email package renders, Resend delivers. The promise of Knock-as-multi-channel-sender did not survive the practical need for consistent email design. With hindsight, I would consider replacing the Knock dependency with a cron and my own scheduling, since the value I extracted was the schedule, not the multi-channel send.

The rest of the bought stack paid off cleanly. Clerk, Stripe, Resend, Vercel + AI Gateway, Sentry, Logtail, Upstash, and Sanity all behaved as advertised and saved meaningful build time on day one.

Why solo

The honest answer to “why solo” is not “I prefer working alone.”

I started olllo with a co-builder, and after about a month they became unavailable. I had two options at that point: pause and find another co-builder, or absorb the second seat and keep going. The architecture decisions documented above are mostly downstream of choosing to keep going.

A co-built version of olllo would probably look different. Some of the structural discipline I imposed on myself (the constitution, the spec gates, the four-layer hierarchy) exists because the team-of-one cannot rely on review to enforce taste, so it has to be codified. With a second engineer, more of the discipline could have been informal. With a third or fourth, more of the package boundaries would have been driven by ownership rather than coupling.

The architecture is what it is partly because of who built it. That is worth saying out loud rather than pretending the structure was always the plan.

What I’d take into another product

The apps + packages split, with a Turborepo backbone and a pnpm workspace. Universal default for any product more complex than a single Next.js app.

The webhook/cron deployment isolation. Anything that runs on someone else’s schedule (Stripe, Clerk, Resend, cron) belongs in its own deployment with its own auth posture. This will not feel necessary on day one. It will feel obvious by month three.

The four-layer component hierarchy, codified in a constitution. Cheap to enforce, expensive to retrofit, scales to a team without modification.

The AI Gateway pattern. Whatever provider you pick, route everything through one chokepoint, get observability for free, treat model swaps as config rather than refactors.

The single-database default. Split when you have a reason. Do not split because the domains feel different.

What I would not bring forward: Expo, the half-abstracted analytics layer, and the way I integrated Knock without checking that the multi-channel send promise would survive my own design constraints.

The meta-point

The point of architecture in a solo + AI build is not to look like a senior engineer. It is to make decisions that compound, draw boundaries that future-you will recognize, and skip the abstractions that are flattering on day one and expensive on day ninety.

Most of what I built was the right shape. The pieces I would swap are the ones where I drew a line and then did not finish enforcing it. A package called @repo/analytics that exposes PostHog’s method surface is dishonest in a small but consequential way. Future-you will believe the package name and discover the cost at the exact moment the swap was supposed to be cheap.

The architectural taste I want to carry forward is unsentimental about that. Either the boundary is the interface, or the boundary is just a folder. Both are fine. The thing to avoid is the boundary that pretends to be an interface and is not.

Growth Engineering Experiments

Wed, 15 Apr 2026 00:00:00 GMT

Growth Infrastructure Cannot Manufacture Habit

A case study for what I learned trying. Four months of growth engineering across waitlist, surveys, drips, consent, and free-forever grants. What each piece was supposed to do, what each piece actually did, and the lesson I would take into the next product.

Opening

This is the twin of the post-mortem. The post-mortem tells the story of behavior: people identified the pain, said it was real, and even with the price gate removed did not reliably show up. This case study is the part where I tried to fix that with infrastructure, and discovered that infrastructure is not the lever.

That is the thesis in one sentence: growth infrastructure cannot manufacture habit. If the habit is there, infrastructure amplifies it. If the habit is thin, infrastructure measures the thinness with very high precision. Pricing experiments are the visible failure inside that frame. Behavior is the deeper one.

I built a lot of infrastructure.

The shape of the growth stack

Five layers, each with a job.

Top of funnel — a waitlist on the marketing site, fed by LinkedIn posts, with a short survey to qualify intent. Admit users in waves.
Activation — drip tips, onboarding reminders, and progress nudges, sent to bring people back on day two, three, and seven.
Virality — referral codes with two-sided extensions, scaffolded into the billing path but never deployed to users.
Goodwill — free-forever grants for early users who stayed engaged through rough edges. Gratitude as an entitlement.
Compliance — a marketing email consent system with tokenized unsubscribe and per-topic preferences, in place before any marketing email got sent.

Underneath: a transactional email pipeline (queue → cron → delivery → telemetry) shared by all four user-facing layers, with the platform caveats covered in Solo Architecture.

That is more growth machinery than most pre-launch products carry. The shape is intentional in some places, premature in others. The case study is about which was which.

What each piece was actually for

The honest version of why I built each layer. None of it was generic “growth hacking.” Each piece had a real reason that made sense at the time.

Waitlist + survey

Two jobs at once: get the name olllo in front of people through LinkedIn marketing, and control the rate at which they entered the beta. I wanted to admit users in waves, because if there was a problem (a bug, a thin feature, an onboarding rough edge), I would rather hit it on twenty users than on two hundred. The waitlist was a safety mechanism as much as a marketing one.

The survey on the waitlist replaced an earlier Typeform setup. Bringing it in-app meant the responses landed in the same database as the rest of the user data, so I could correlate “what someone said on the survey” with “what they actually did once admitted.”

Free-forever grants

Not a pricing experiment. A thank-you to the early users who provided feedback, helped surface unknown technical issues, and stayed engaged through rough edges. The grant system was deliberate gratitude, encoded as an entitlement that would persist even after I shut the product down.

Drip tips + onboarding reminders

Celebration and continuation. Each tip was framed around progress the user had made, with a reason to come back and a reminder of why the next step mattered. Not aggressive re-engagement, not manufactured urgency. Closer to what I would have written by hand if I had been emailing each beta user individually.

Compliance from day one. Tokenized unsubscribe links signed with JWT, per-topic preferences, an audit trail of consent events. I built this before any marketing email got sent, because I never wanted to be in the position of retrofitting compliance after a campaign had already gone out.

What the data actually showed

The infrastructure worked. The metrics did not.

Top of funnel filled the way LinkedIn-driven waitlists usually do, in ones and twos:

Top of funnel	Count
Waitlist signups	30
Survey completions	14

From the thirty on the waitlist, I split admissions into two cohorts to learn something about the price gate. Ten were granted accounts with no payment step. Twenty were given a Stripe subscription path that required a credit card up front, in exchange for the beta plan: a 60-day free trial then $48/year (50% off the $96/year retail), with two extra months free on the annual plan.

Activation	Cohort A — no payment (n=10)	Cohort B — Stripe gate (n=20)
Account created	9	2
Onboarding completed	5	0
Active at 1 month (4+ weekly logs)	3	0
Paid conversion	n/a	0

Two readings sit on top of each other.

The credit-card gate was a hard wall. Eighteen of twenty users never finished account setup, and the two who did finish failed onboarding. The deal on offer was generous (60-day free trial, $48/year at 50% off retail, two more months free on annual), and it did not matter. Asking for a credit card before a user has done anything in the product is asking the user to commit before they have a reason to.

The free cohort surfaced the deeper signal. With the payment friction removed, the funnel still narrowed: half dropped at onboarding, seven of ten had stopped logging by week four. The daily-capture habit did not form for most of the cohort that articulated the pain most clearly — and that is the same shape fitness apps and weight-loss apps live with: a population that names the goal correctly and does not reliably show up to do the work. It is the signal the post-mortem leans on.

Cohort A. With the payment friction removed, the funnel still narrowed: 9 of 10 created accounts, 5 of 10 onboarded, 3 of 10 still logging weekly at week four. Even free, the habit didn’t form for most of the cohort that named the pain most clearly.

By layer, the same pattern shows up everywhere infrastructure was supposed to do the lifting:

Each layer worked. Each layer also failed to produce the outcome it was built for.

The instrumentation surfaces the conclusion cleanly, which is what good instrumentation is supposed to do. The growth stack did its job. Its job was to measure, and the measurement was not what I wanted it to be.

The order of operations was wrong

The chronology of what shipped is the part of this case study I would change if I were doing it again.

The features shipped in roughly this order across the back half of the project:

Late January: referrals scaffolded into the Stripe billing feature (never operationalized)
Late January: marketing email consent system
Early February: drip tip system + onboarding reminder emails
Late March: free-forever grant migration
Late March: in-app waitlist survey, replacing the earlier Typeform

Read that list against what each piece is supposed to do, and the inversion is visible. I built virality scaffolding (referrals) and compliance (consent) early, then activation (drips, reminders), and shipped the waitlist survey near the end — the piece that should have come first to tell me whether to build the rest came last.

The order I shipped reflects what was easy to build at each moment, not what would have validated demand fastest. Referrals were a natural extension of the Stripe billing work, so they came when billing did. Compliance was a clean self-contained project. Drips required content and scheduling infrastructure, which took longer to set up. The waitlist survey came late because the Typeform version had been good enough to delay the migration.

A more disciplined order would have started with activation. If new users were not coming back on day two, none of the other layers were going to help.

What shipped versus what I would ship next time. The next-time ordering puts activation first; virality only earns its slot after the curve bends without help.

What I would do differently

The next time I build a product like this, I would not invest in scale-up infrastructure until I had verified, hands-on, that the product was meeting users’ needs and that the early audience was generating organic word-of-mouth without my help.

Concretely, that means:

Running the first ten or twenty users through the product manually, in close one-on-one observation, before any drip system gets built
Driving week-over-week adoption through that hands-on attention until the curve bends without me touching it
Only then investing in the scale machinery (drips, waitlists, surveys, and virality after the curve bends) that would let me step back

Growth infrastructure is a multiplier. Applied to zero demand it produces zero growth; applied to a real signal it compounds. The mistake I made was reaching for the multiplier before the signal was there.

The cleanest version of this lesson, the one I would tell another founder who asked: measure with people first, instrumentation second. Fifteen-minute calls with the first cohort tell you more in a week than a drip system tells you in three months. The instrumentation has its place, but not before the calls.

What I would defend

The compliance-first instinct.

Most products build the marketing system, run a campaign, and then bolt on consent management once the legal requirement gets noticed or once a user complaint surfaces. I built the consent system before the campaign system, with tokenized unsubscribe and per-topic preferences in place from the first marketing email I ever sent. The audit trail of consent events meant I could prove, for any subscriber, when they opted in and to what.

That work was invisible to users (which is exactly what compliance work should be) and would have saved me a hard problem if olllo had ever scaled to a place where data subject requests started arriving. It was also cheap to do up front and expensive to retrofit. The instinct to build compliance before need is one I would carry into every product going forward.

The three compliance primitives that were in place before the first marketing email got sent. Each was cheap to build day-one and expensive to retrofit later.

The honest takeaway

Growth infrastructure measures and amplifies. It does not generate the underlying habit. The temptation in solo building is to mistake the act of building growth machinery for the act of growing, because the machinery is concrete and visible while the underlying habit is abstract and uncomfortable.

Growth = signal × infrastructure. Multiply five layers by zero and you still get zero, instrumented precisely.

I felt that temptation. Every drip email I shipped felt like progress. Every referral mechanic felt like traction. Every survey response felt like signal. None of it was wrong, and none of it was the bottleneck.

The bottleneck was upstream. The post-mortem covers what was upstream and what it taught me. This case study is the instrumentation that surfaced the conclusion. Without the growth stack, I would have taken longer to read the demand signal accurately. With the growth stack, the read was unambiguous, and the decision to stop was easier to make.

That is worth something. It is just not what I built the stack to be worth.

Design System Across Web + Native

Wed, 15 Apr 2026 00:00:00 GMT

Cross-Platform Consistency Is a Systems Problem Until It’s a Platform Problem

What I learned trying to keep three surfaces consistent without a design team. Why component libraries are no longer enough in the AI era. And the case for what I would build instead, the next time someone hands me a blank monorepo and a Claude API key.

Opening

Cross-platform consistency is a systems problem until it’s a platform problem. That is the short version of what I learned trying to keep three surfaces of olllo visually coherent.

The longer version is two layers deep, and the case study has to walk both. The first layer is the work itself: shadcn/ui as a primitives base, NativeWind to carry Tailwind syntax to the Expo mobile app, a four-layer component hierarchy enforced by the project constitution, a Next.js manifest making the web installable as a PWA. The second layer is the realization that what I had was not a design system. It was a component library wearing the words “design system” on its package.

That distinction was uncomfortable in a useful way. I had built four design systems before olllo (an Angular system, a web components system, a React system, and one on top of Chakra UI) and across all of them the lesson had been the same. The components are not the system. The patterns are. The instructions for how a button, a header, and a section interact when they appear together are. The tokens for color, spacing, type scale, and motion are. The accessibility contract is. The visualization layer that lets a non-engineer see what the system produces is. A component library is one ingredient. A design system is the recipe.

I knew this going in. I built a component library anyway, called it a design system, and moved on. Then AI started composing my components, and the gap between what I had and what I needed got loud.

What I actually built

packages/design-system looks like a design system from the outside. Inside, it is a well-organized component library:

packages/design-system/
├── components/
│   ├── ui/         # shadcn/ui primitives (50+ components)
│   ├── kibo-ui/    # chat & AI surface components
│   ├── forms/      # form field wrappers (FormInputField, FormSelectField)
│   └── pricing/    # pricing-related composites
├── hooks/
├── providers/
├── styles/
├── lib/
├── components.json # shadcn config: New York style, neutral base, CSS variables
└── postcss.config.mjs

Underneath it: shadcn/ui in the New York style, neutral base color, CSS variables for theming, lucide for icons, and a kibo-ui component family pulled in for chat and AI surfaces specifically. The mobile app uses the same Tailwind class vocabulary via NativeWind, with a separate tailwind.config.ts mirroring the web config where it can. The PWA is the Next.js app with a manifest declared at apps/app/app/manifest.ts, so “three surfaces” is honest but the third surface is a wrapped second surface.

Component placement is governed by the four-layer hierarchy from the project constitution: design-system primitives, app-shared components, layout-shell components, and feature-route components. That hierarchy was the closest thing in the project to actual pattern documentation, and it is documented and enforced through speckit (Culture as Code covers it).

The four-layer hierarchy from the project constitution. The boundary each layer enforces is the only formal pattern documentation the system has.

What is missing from this picture, viewed against any of the four design systems I had worked on before, is everything that turns a component library into a design system: documented composition patterns (when to use a Card versus an Item versus a Field, and what they should contain), motion tokens and motion guidelines, an opinionated accessibility contract beyond what shadcn ships, a visualization layer that demonstrates patterns rather than individual components, and a place where a non-engineer could review the system’s output without reading code.

I built none of that. I shipped on what was good enough for one engineer composing components by hand or by prompt, and the case study below is what that decision cost.

What shadcn/ui gave me, and what it didn’t

shadcn/ui was the right primitives layer for olllo at the moment olllo got built. The flexibility is real (every component lives in your codebase, customizable to the file), the breadth is meaningful (50+ ui primitives plus kibo-ui’s chat and AI components saved me weeks on the assistant surfaces specifically), and the integration with Tailwind and the AI tooling around it was unmatched in early 2026.

It is also not a design system. Nothing about shadcn/ui tells you when to use a Card versus an Item versus a Field for a list of accomplishments. Nothing about it constrains a button’s size to match the page-header pattern. Nothing about it documents composition. The library hands you components and gets out of the way, which is exactly what makes it useful and exactly what makes it insufficient as the only artifact in the design system slot.

Would I pick shadcn/ui again? No. The calculus has shifted in two ways since olllo started, and both push toward building something custom.

The first shift is in what I weight. Long-term stability matters more to me now than it did at the time. shadcn ships components into your codebase that you own, which is good, but the conventions around them keep moving. Tailwind has its own breaking version cycle. The NativeWind plus Tailwind combination on the mobile side adds another moving part. A custom system has a single stability surface, the code I wrote, with no version mismatches between layers and no upstream conventions evolving underneath me.

The second shift is AI capability. The reason “build from scratch” was prohibitively slow at the start of olllo was that custom design systems are months of repetitive scaffolding work. AI assistance has improved enough that the same work moves significantly faster now. The build-custom path that was infeasible against the product work is feasible today, and would land me in a better long-term place than reaching for shadcn would.

Net: shadcn was right for where I was at. The tradeoff has changed. The next system I build will be one I own end to end, with AI assistance accelerating the construction rather than a third-party library accelerating my dependence.

The AI composition problem

The clearest moment I have for this case study is small, specific, and recurring.

The standard SaaS page-header pattern is a title on the left, optional breadcrumbs above it, and a primary call-to-action button on the far right. Across olllo’s authenticated surface, that pattern appears on every list view and most detail views: Accomplishments, Goals, Reflections, Settings, every one. The button on the far right is the page’s primary action: New Accomplishment, Add Goal, Start Reflection. There is one canonical visual treatment for that button, and there should never be variation.

Across thirty-eight numbered features, the AI sometimes rendered that button as size="default" and sometimes as size="sm". Not because the prompt asked for variation. Not because I wanted variation. The model would pick a size, often the right one, sometimes a smaller one, with no reliable way to predict which.

I added checks and balances. Component conventions in CLAUDE.md. Examples in the closest spec file. A note in the constitution. Type-level constraints where I could push them down. The variation kept happening.

The variation users feel without naming. The component library allows it; the design system that should have prevented it does not exist.

The diagnosis is two parts.

The first part is a failure of the component library. shadcn’s Button component takes a size prop with default, sm, lg, and icon as values, and the component does not encode the page-header pattern. There is no Button variant called pageHeaderPrimary that is locked to the canonical size. The component library is correctly generic and incorrectly silent on the pattern.

The second part is the AI part, and it is the new part. A solo developer composing components by hand, with a component library and no design system, will be reasonably consistent over time because their hands have a memory the file system doesn’t. A solo developer composing components with an AI assistant has none of that hand-memory advantage. The assistant has an opinion about button size every time it generates a page header, and the opinion drifts. Today’s prompt produces size="default". Next week’s prompt, with no relevant change in context, produces size="sm". The model is not wrong; the model is correctly inferring from a library that does not constrain the choice.

This is not a shadcn problem. It is a category problem. Component libraries assumed a developer was the constraint on consistency. With AI in the loop, the assistant is making the composition decisions, and a library that does not encode patterns will be composed inconsistently.

The AI era moves the design system requirement from useful to necessary. Without one, every prompt is a small bet on whether the model remembers what consistency looks like in your project. Some of those bets land. Enough of them land badly that a careful reader can feel the inconsistency even if they cannot name it.

That feeling is what users mean when they say a product feels off without being able to point to anything specific. It is the texture of an inconsistent system, and component libraries cannot prevent it on their own.

Why flexibility is the cost

The deeper read on this applies to any flexible component library used as the foundation for a consistent product, not just shadcn or Tailwind specifically.

The more flexible the library, the more variations an AI assistant can choose from on any given prompt. Every prop, every variant, every size, every spacing class is a degree of freedom for the model. A library with five button sizes generates more visual variation than a library with two. A library where Cards can contain anything generates more variation than one with a strict slot pattern. A library where margin can be any of twenty Tailwind classes generates more variation than one with three predefined spacing tokens.

AI composition compounds flexibility into inconsistency much faster than a human composer does. The crossover happens around shadcn/Tailwind’s degrees of freedom.

This is exactly why people love shadcn and Tailwind. The flexibility is the feature. Pre-AI, that flexibility let solo developers ship fast and tailor everything. In the AI tooling era, the same flexibility is what makes v0, Lovable, Bolt, and similar generators work at all: the model can satisfy almost any prompt because the underlying primitives can be assembled into almost any output.

The same property that makes a library good for AI tools that build is what makes it bad for AI tools that compose inside an existing product. When the goal is an opinionated UI driving consistent feel across forty-plus surfaces, flexibility is the enemy. The best design systems are the ones with the most constraints: one right way to render a page header, one right way to lay out a card, one right way to space a form. Constraints are how the system stays the system across hundreds of compositions.

Build mode and compose mode want opposite properties from the same primitives.

shadcn and Tailwind sit at exactly the wrong end of that spectrum for the consistency goal. That is not a critique of the libraries; it is a recognition that the same primitives used in two modes (build a thing fast, or compose inside an existing thing consistently) require opposite properties.

The platform problem reveals itself

Even if every component had been perfectly consistent across the codebase, cross-platform consistency would still have been the wrong goal in places.

NativeWind let me carry Tailwind class syntax into the Expo mobile app, which made styling cheap to author. What it did not carry was platform conventions. iOS users expect a sheet to slide up from the bottom with a specific easing curve, dismiss with a specific gesture, and use the system’s blur and depth conventions. Android users expect different defaults. A web user expects neither. Tailwind classes do not translate any of this; they translate visual properties.

The result was a mobile app that looked consistent with the web app at the pixel level and felt slightly off in the hand. Not broken. Not unusable. But the kind of subtle wrongness that native developers spot in a second and that translation-layer apps never quite shake.

Solo Architecture covers the broader Expo reconsideration in detail. The design system angle on it is specific: the goal of cross-platform consistency was, in retrospect, the wrong target for half the surface area. Native iOS users do not benefit from a button that looks identical to its web counterpart. They benefit from a button that uses iOS-native press behavior, haptic feedback, and platform-typical visual weight. The cross-platform consistency I was protecting was protecting nobody.

The right framing, with hindsight: there are surfaces where cross-platform consistency is a feature (brand, copy, identity color), and surfaces where it is a tax (interaction patterns, transitions, gesture vocabulary). A design system that does not distinguish between those surfaces will get both wrong.

The Storybook gap

Component libraries need a visualization layer. Storybook is the default answer in the React community, and Storybook is its own friction.

The version compatibility story is the worst part. Major version upgrades break stories, sometimes silently. Add-on ecosystems lag the core release schedule. CSF 2 to CSF 3 was not a free migration. A monorepo running Storybook against a Next.js 16 app and a separate Vite-based design system has at least three places where versions can disagree, and they sometimes do.

I shipped Storybook in apps/storybook because the alternative was no visualization layer at all. I did not maintain it as actively as the rest of the monorepo. Stories drifted from their components. Some were rewritten on every Storybook upgrade. By the end of the project, Storybook was a graveyard of partly-true documentation, which is worse than no documentation in one specific way: a reader trusts a partly-true Storybook the same way they trust a complete one, and gets misled.

The lesson is not that Storybook is bad. Storybook solves a real problem and there is no obvious better answer in early 2026. The lesson is that the visualization layer being a separate piece of infrastructure with its own upgrade cycle, addon catalog, and configuration is a structural mistake the industry has not yet corrected.

A design system worthy of the name should not require its visualization layer to be a separate framework with separate breakages. Components, patterns, tokens, accessibility tests, and visual documentation should live in one system that upgrades together.

What I’d take into another product

Build the primitives layer myself, with AI assistance, rather than reaching for shadcn/ui. The build-custom path is feasible today in a way it was not when olllo started. Long-term stability (owning every component, every token, every pattern, with no version mismatches between layers) is worth more to me now than the day-one acceleration shadcn provided.

Treat the component library as one ingredient, not the whole system. Document composition patterns explicitly, in a place AI assistants will read on every session. CLAUDE.md is one such place; a richer version would be a patterns.md per package, with concrete examples of what good composition looks like and what to avoid.

Distinguish cross-platform consistency from cross-platform translation. Brand and identity should be consistent across surfaces. Interaction patterns should follow platform convention. Carry Tailwind syntax across surfaces if it helps, but stop pretending the result is the same product everywhere.

Skip Storybook unless and until something fundamental changes about how it is maintained. Use a smaller scoped solution (a single docs route in the design-system package, generated from real code, updated at build time) until the industry produces a unified visualization layer that does not break on its own.

The thing I would not bring forward at all is the unspoken belief that a component library plus tokens equals a design system. It does not, and the next product I build will be honest about that from day one.

The future I’d build toward

The future of design systems in the AI era is a single integrated system, not a piecemeal of separate ones.

Today the responsible solo setup glues several pieces together: a primitives library, a token layer, separate accessibility testing, pattern documentation in CLAUDE.md or similar, Storybook for visual review, a motion library, and the developer’s hand-memory holding it all together. Each piece has its own upgrade cycle and its own way of being out of date. The cracks between them are where AI composes inconsistently.

Seven pieces, seven upgrade cycles. The cracks between them are where AI composes inconsistently.

The system I would build would unify these into a single source of truth that both humans and AI assistants can read and respect:

Components, tokens, and patterns in one package, versioned together
Composition patterns expressed as types, so the AI sees the constraint and the human sees the demonstration
Accessibility contracts encoded into component types, not retroactively tested
A built-in visualization layer generated from the same source as the components, with no separate Storybook to drift
A pattern enforcement layer that catches “wrong size for this context” the way TypeScript catches “wrong type for this argument”

The pieces exist in fragments today. Stitching them together is what the AI era is asking for. Someone will build it, because the cost of not having it compounds with every prompt that adds a small inconsistency to a product supposed to feel coherent.

The system the AI era needs. One source of truth, one version, one place where humans and assistants both go to learn what consistency looks like in this product.

Where this leaves us

A design system in the AI era is no longer optional infrastructure for products that want to feel coherent. The composition decisions are happening whether or not the system encodes them; the question is whether they happen with constraints or with drift.

Component libraries solved a real problem in the developer-as-composer era. That era has changed underneath us, and the libraries have not caught up. The interim discipline (explicit composition patterns in places AI will read, treating consistency as a contract instead of a hope, distinguishing the surfaces where consistency helps from the ones where it hurts) is the work of bridging the gap until the industry produces a system that closes it.

What I built for olllo was the best I could ship solo in the time I had. What I learned building it is the more interesting half of this case study, and the part I would carry into anything I build next.

Olllo case studies

The Honest Post-Mortem

I Shipped a Real Problem and Nobody Showed Up

Opening

The bet

What I built

What 14 people said they wanted

The signals I saw clearly

The signals I explained away

What I’d do differently

What I’d still do the same

What I’m taking with me

Culture as Code · Speckit + Enforcement

Culture as Code: Speckit and Enforcement

Opening

The problem

The workflow

The constitution

The gates

Pre-merge, via CI

Pre-PR, via the workflow itself

Documentation, via required artifacts

Security, via four dedicated workflows

AI behavior, via evals

Living context, via CLAUDE.md

What the gates didn’t catch

The outcome

What I’d port to a team

Why the discipline compounds

AI Product Craft

AI Product Craft: When the User Is in the Moment

Opening

The constraint that picked the model

Where single-agent got stuck

The orchestrator-specialist pattern

What the model wanted to talk about

The reliability layer

What I’d take into another product

The meta-point

Architecture & Technology

Solo Architecture: Nine Apps, Twenty-One Packages

Opening

The shape

Decisions that held up

Splitting apps/api from apps/app

Vercel AI Gateway as a universal chokepoint

Single Postgres + single Prisma schema

The four-layer component hierarchy

Decisions I’d reconsider

Expo for mobile

The half-built analytics abstraction

Bought vs built

Why solo

What I’d take into another product

The meta-point

Growth Engineering Experiments

Growth Infrastructure Cannot Manufacture Habit

Opening

The shape of the growth stack

What each piece was actually for

Waitlist + survey

Free-forever grants

Drip tips + onboarding reminders

Marketing email consent

What the data actually showed

The order of operations was wrong

What I would do differently

What I would defend

The honest takeaway

Design System Across Web + Native

Cross-Platform Consistency Is a Systems Problem Until It’s a Platform Problem

Opening

What I actually built

What shadcn/ui gave me, and what it didn’t

The AI composition problem

Why flexibility is the cost

The platform problem reveals itself

The Storybook gap

What I’d take into another product

The future I’d build toward

Splitting `apps/api` from `apps/app`