<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Olllo case studies</title><description>Six case studies from a weekly-reflection product experiment that didn&apos;t find its audience.</description><link>https://hypoth.ai/</link><language>en</language><item><title>The Honest Post-Mortem</title><link>https://hypoth.ai/olllo/the-honest-post-mortem</link><guid isPermaLink="true">https://hypoth.ai/olllo/the-honest-post-mortem</guid><description>I shipped a product people validated as solving a real pain. Even with the price gate removed, almost none of them showed up to use it. Here&apos;s what I missed about wanting a thing versus doing it.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;i-shipped-a-real-problem-and-nobody-showed-up&quot;&gt;I Shipped a Real Problem and Nobody Showed Up&lt;/h2&gt;
&lt;p&gt;A post-mortem of olllo, a performance-review and accomplishment-tracking tool I built solo across four months and shut down in March 2026.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/the-honest-post-mortem/images/olllo-post-mortem-the-cliff.png&quot; alt=&quot;Cohort A retention curve. Week zero starts at 90%, drops to 50% by week one — a 40-point cliff annotated in coral — then 40% at week two, 30% at week three, and a flat tail at 30% (three of ten) through week four.&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;opening&quot;&gt;Opening&lt;/h3&gt;
&lt;p&gt;People told me it was a real problem. Every conversation. Performance reviews that felt arbitrary. 1:1s that surfaced nothing. The 9pm panic of trying to reconstruct six months of work the night before a self-review was due.&lt;/p&gt;
&lt;p&gt;I built olllo to solve that. It works. People who tried it said it works. Even with the price gate removed, almost none of them showed up to use it consistently.&lt;/p&gt;
&lt;p&gt;This is the part of the case study where I was supposed to have a clean answer for why. I don’t. What I have is a set of signals I read correctly, a set I explained away, and a decision made in March 2026 to stop building and start writing about what I learned.&lt;/p&gt;
&lt;p&gt;The closest analogue is the fitness or weight-loss app. People articulate the goal cleanly. They will tell you, with care, that health is important. They will sign up. They will not, in numbers that build a business, show up to do the work. olllo lived in that category, and the signals were there in the data before I let myself read them.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-bet&quot;&gt;The bet&lt;/h3&gt;
&lt;p&gt;The hypothesis was simple. If you can’t remember your work, you can’t advocate for it, and your career compounds slower than your effort deserves. The script I wrote for the first YouTube video argues this cleanly: it names recency bias, it cites Kahneman, it frames the problem as memory rather than motivation. I believed it when I wrote it. I still believe it.&lt;/p&gt;
&lt;p&gt;The user bet was that low-friction daily capture, reinforced by weekly reflection and quarterly summaries, would compound into a case file that made reviews, 1:1s, and compensation conversations dramatically less stressful.&lt;/p&gt;
&lt;p&gt;The business bet was narrower: that enough people felt this pain acutely enough to pay roughly $8/mo on an annual plan ($96/year before the beta discount) for a tool that replaced the ad-hoc notes app or Google Doc they were already mismanaging.&lt;/p&gt;
&lt;p&gt;The first bet was right about the pain and wrong about the behavior. Daily capture compounds only if you show up daily, and the cohort A retention says you mostly don’t. The second bet rode on the first, and they fell together.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-i-built&quot;&gt;What I built&lt;/h3&gt;
&lt;p&gt;Four months of solo work. Full architecture details live in &lt;a href=&quot;/olllo/architecture-and-technology&quot;&gt;Solo Architecture&lt;/a&gt;; the short version:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A Next.js 16 web app with onboarding, reflections, goals, and accomplishment tracking&lt;/li&gt;
&lt;li&gt;A React Native mobile app with voice capture that transcribes audio and extracts structured STAR-format entries&lt;/li&gt;
&lt;li&gt;A PWA for the in-between surface&lt;/li&gt;
&lt;li&gt;A multi-agent reflection flow using Claude Sonnet and Haiku across tiered calls&lt;/li&gt;
&lt;li&gt;Infrastructure most solo products skip: rate limiting, feature flags, email authentication, consent management, referral loops, waitlist, free-forever grants (the full growth-stack story is in &lt;a href=&quot;/olllo/growth-engineering-experiments&quot;&gt;Growth Engineering&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;40+ numbered feature branches shipped, each gated through a speckit workflow that required a spec, clarifications, a plan, tasks, and an analysis pass before implementation. The discipline story lives in &lt;a href=&quot;/olllo/culture-as-code&quot;&gt;Culture as Code&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The thing worked. That’s not in dispute.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-14-people-said-they-wanted&quot;&gt;What 14 people said they wanted&lt;/h3&gt;
&lt;p&gt;The waitlist had 30 people. Fourteen filled out the survey, and the pattern in the responses was consistent enough that I should have read more meaning into it than I did.&lt;/p&gt;
&lt;p&gt;The modal respondent was a senior individual contributor at an enterprise company — staff-level engineers and designers, plus a band of people in career transition. Two-thirds heard about olllo through LinkedIn or a direct referral; almost none arrived from a search or a content channel I had not personally seeded.&lt;/p&gt;
&lt;p&gt;The pain they named was real and articulated cleanly: performance reviews, interviews and resume updates, weekly planning, and the recurring imposter-syndrome moment of “what have I actually been working on?” Six respondents cited reviews specifically. Seven cited interviews. Across role levels, the shape held.&lt;/p&gt;
&lt;p&gt;What they said they wanted built was a clean record of wins they could search later, a promotion-ready career story, and weekly reflection prompts that surface patterns. Goal tracking and a thirty-second “capture a win” both rated highly. Across the four feature-interest scores, no single feature dominated — respondents wanted the whole shape, not one piece of it.&lt;/p&gt;
&lt;p&gt;What they said would stop them is the section this post-mortem turns on. The top blockers, in order: &lt;em&gt;too much effort to capture things&lt;/em&gt;, &lt;em&gt;I don’t want another tool to maintain&lt;/em&gt;, &lt;em&gt;I’m not sure what to write&lt;/em&gt;, &lt;em&gt;I wouldn’t remember to come back&lt;/em&gt;. Privacy concerns scattered through. &lt;strong&gt;Price was not in the top blockers. Habit and inertia were.&lt;/strong&gt; That should have been louder to me than it was.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/the-honest-post-mortem/images/olllo-post-mortem-what-stopped-users-survey.png&quot; alt=&quot;Horizontal bar chart of the seven survey blockers tagged HABIT, OTHER, or PRICE. Too much effort to capture things 10 of 14 (HABIT). I don&amp;#x27;t want another tool to maintain 8 of 14 (HABIT). I&amp;#x27;m not sure what to write 7 of 14 (HABIT). I wouldn&amp;#x27;t remember to come back 7 of 14 (HABIT). Privacy concerns 4 of 14 (OTHER). Not sure I&amp;#x27;d trust the AI 3 of 14 (OTHER). Price 2 of 14 (PRICE). Three callout cards underneath: habit blockers are four of the top four (none about cost), price ranked sixth of seven (removing the price gate did not change the shape of the curve), and the pattern reads as vitamin rather than painkiller — people articulated the pain but not the willingness to do the work.&quot;&gt;
The magic-wand answers, paraphrased: &lt;em&gt;“Identify a career path and prepare for success.” “Keep me engaged with itself — too often the things I want to work on peter off.” “Push me to develop the habit.” “Help me organize my energy toward highest impact, for me, not my company.”&lt;/em&gt; Read as a set, those wishes describe a coaching product more than a tracking product. olllo was a tracking product.&lt;/p&gt;
&lt;p&gt;The survey was a clear, useful read. The feature priorities were visible. So was the willingness-to-engage signal — and it was thinner than I let myself see at the time.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-signals-i-saw-clearly&quot;&gt;The signals I saw clearly&lt;/h3&gt;
&lt;p&gt;The numbers, as of March 2026:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/the-honest-post-mortem/images/olllo-post-mortem-funnel.png&quot; alt=&quot;Combined-cohort funnel across six steps. Step one waitlist signups: 30 of 30 (top of funnel). Step two survey completions: 14 of 30 (47% kept). Step three account created: 11 of 30 (37% kept). Step four onboarded: 5 of 30 (45% kept). Step five active at week four: 3 of 30 (60% kept). Step six paid conversion: 0 of 30 (0% kept). Three callout cards underneath frame the funnel as top-of-funnel 30 to 14 (strong from a personal network), activation 11 to 5 (half of accounts never reached value), and habit 5 to 3 (seven of ten had stopped logging by week four).&quot;&gt;&lt;/p&gt;





























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;/th&gt;&lt;th&gt;Value&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Waitlist signups&lt;/td&gt;&lt;td&gt;30&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Survey completions&lt;/td&gt;&lt;td&gt;14 (47%)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cohort A — invited (no payment, n=10): created → onboarded → active@1mo&lt;/td&gt;&lt;td&gt;9 / 5 / 3&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Cohort B — invited (Stripe gate, n=20): created → onboarded → active@1mo&lt;/td&gt;&lt;td&gt;2 / 0 / 0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Paid conversions&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/the-honest-post-mortem/images/olllo-post-mortem-cohort-a-vs-b.png&quot; alt=&quot;Side-by-side cohort funnels. Cohort A (price gate removed, n=10) on the left: invited 10, account created 9, onboarded 5, active at one month 3, paid 0. Cohort B (Stripe gate, $48/yr beta, n=20) on the right: invited 20, account created 2, onboarded 0, active at one month 0, paid 0. The footer card frames the read: 90% drop at the credit-card field in cohort B, 70% drop by week four in cohort A. Price was the visible failure. Commitment was the deeper one.&quot;&gt;
The split between cohorts was the natural experiment, and the cohort B funnel — eighteen of twenty users walking away at the credit-card field, the two who entered a card never finishing onboarding — is the readable answer to “will people pay for this.” &lt;a href=&quot;/olllo/growth-engineering-experiments&quot;&gt;Growth Engineering&lt;/a&gt; covers the cohort design and instrumentation in detail. The post-mortem-relevant version: even with the price gate removed, half of cohort A onboarded and three of ten were still logging weekly a month later. That number is real product-fit signal in a tiny sample, and not a number any product gets to build a business on.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/the-honest-post-mortem/images/olllo-post-mortem-beta-offer.png&quot; alt=&quot;Two-column composition. On the left, the Starter beta pricing card: &amp;#x22;Everything you need to track and grow your career.&amp;#x22; $96/year crossed out, $48/year as the beta price. 60-day free trial, 50% off for 12 months, 2 months free with annual plan. Feature checklist: unlimited accomplishments, career goal tracking, weekly AI-powered reflections, STAR story generation, export your data anytime, priority email support. A coral pill underneath reads &amp;#x22;18 / 20 walked away here.&amp;#x22; CTA: Join Waitlist. On the right, three stacked panels: &amp;#x22;What it took to commit&amp;#x22; (60-day free trial before any charge, 50% off retail locked for a year, two extra months free on annual, cancel anytime), &amp;#x22;The reading&amp;#x22; in coral (the price was not the wall, the commitment was), and &amp;#x22;Stripe gate · Result&amp;#x22; (2 cards entered, 0 finished onboarding, 0 paid).&quot;&gt;
&lt;img src=&quot;/olllo/the-honest-post-mortem/images/olllo-post-mortem-retention-by-cohort.png&quot; alt=&quot;Retention curve by weekly signup cohort. Five lines — Week 1 of beta (Dec 16, 20%), Week 2 (Dec 23, 18%), Week 3 (Jan 13, 20%), Week 4 (Jan 27, 20%), Week 5 (Feb 17, 22%) — track active usage across subsequent weeks. All five lines start at 100% at week zero, separate at weeks one and two as onboarding changes shift the early number, then converge into a shaded &amp;#x22;convergence&amp;#x22; band at week three. By week four every cohort lands within a tight range around a 22% floor across all five cohorts.&quot;&gt;
Read without flinching, that funnel says: strong top-of-funnel from a personal network, a hard wall at the credit-card gate, and — when the gate is removed — engagement so thin that seven of the ten free cohort had stopped logging by week four. Zero paid conversions on either path. The price was the visible failure. The habit was the deeper one.&lt;/p&gt;
&lt;p&gt;The qualitative signal was consistent. I captured it in a note to myself: &lt;em&gt;“People identify it as a pain point to keep track of accomplishments and prep for 1:1s and reviews, but they don’t seem willing to invest in it.”&lt;/em&gt; That sentence is the whole post-mortem in miniature. Acknowledgment without investment is the category-definition of a vitamin rather than a painkiller — the workout-app pattern, the weight-loss-app pattern, the nutrition-tracker pattern. People articulate the goal, sign up to solve it, and don’t reliably show up to do the work. I wrote the sentence and kept building.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-signals-i-explained-away&quot;&gt;The signals I explained away&lt;/h3&gt;
&lt;p&gt;This is the section that matters.&lt;/p&gt;
&lt;p&gt;The metric I kept reassuring myself about was the survey itself. Fourteen people answered fourteen questions with care — these were not the throwaway responses of people who were casually interested. The depth and specificity felt like signal. The mistake was treating well-articulated pain as adjacent to willingness to show up. They are different things, and the survey already told me so. Under &lt;em&gt;what would stop you from using olllo regularly&lt;/em&gt;, the top answers were habit blockers, not pricing blockers: too much effort, another tool to maintain, wouldn’t remember to come back. A tool that adds a daily capture ask is not solving those by being free; it is making them worse.&lt;/p&gt;
&lt;p&gt;The pattern is visible in the git history. Every time the numbers were soft, I added a feature. Voice capture. Multi-agent reflection. Smarter summaries. Referral loops. Each one was defensible in isolation, and each one was a way of not confronting the base rate the survey had already drawn — that the people who articulated this pain most clearly were also articulating, in the same survey, why they would not show up daily to solve it.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/the-honest-post-mortem/images/olllo-post-mortem-shutdown-gap.png&quot; alt=&quot;Horizontal timeline running December 2025 through March 2026, with a &amp;#x22;Feature ships per week&amp;#x22; bar chart above the axis and five numbered markers below. Marker 1 (W05, COULD HAVE CALLED, in coral): retention signal first readable — cohort A onboarding done, week-two drop visible. Marker 2 (W09, KEPT BUILDING): voice capture shipped. Marker 3 (W12, KEPT BUILDING): multi-agent reflection. Marker 4 (W14, KEPT BUILDING): referral loops. Marker 5 (W17, ACTUALLY CALLED): shutdown, March 2026. A coral-hatched band labeled &amp;#x22;THE GAP · 12 WEEKS&amp;#x22; stretches from marker 1 to marker 5. Three summary cards underneath frame Week 5 (signal was readable here), 12 weeks (voice capture, multi-agent reflection, referral loops — each defensible in isolation), and Week 17 (pulling the plug on a product I still believe in, because the market doesn&amp;#x27;t).&quot;&gt;
There is a version of this product that would have worked. It probably does not include voice capture. It probably includes a conversation I did not have enough of: &lt;em&gt;“Would you pay $X today, before I build a single screen?”&lt;/em&gt; That conversation costs nothing and it would have told me in a few weeks what it took me four months and a cohort experiment to confirm.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-id-do-differently&quot;&gt;What I’d do differently&lt;/h3&gt;
&lt;p&gt;Test the habit before the tool. A cohort of ten people, with weekly fifteen-minute calls, asked to capture three accomplishments by the end of week one using nothing but a notes app. If the habit doesn’t form for ten people who agreed to a call, it will not form for ten thousand who never will. This is the experiment that costs nothing, takes two weeks, and tells you whether the rest is worth building.&lt;/p&gt;
&lt;p&gt;Pre-sell before build. A landing page, a Stripe checkout, a promise to refund if I don’t ship in 60 days. If 30 people pay, there’s demand. If 3 people pay, I know before writing a line of code. I had all the infrastructure to run this test and ran it too late.&lt;/p&gt;
&lt;p&gt;Narrower ICP. Not “knowledge workers who have performance reviews.” That is almost everyone. Something like “senior engineers at companies with formal promotion packets, actively prepping for a cycle in the next 90 days.” Urgency is the filter that separates painkiller from vitamin, and I spent too long recruiting beta users who had the problem in principle rather than the problem this quarter.&lt;/p&gt;
&lt;p&gt;Import over habit. The daily-capture habit is the product’s largest ask and its largest conversion killer. A version that reads existing Slack, email, and calendar signal and pre-populates the case file would remove the cold start. The habit can come after the value is obvious, not before.&lt;/p&gt;
&lt;p&gt;A willingness-to-pay experiment in month one, not month four. The hardest thing to un-know in solo building is the sunk cost of a working product. Ask the uncomfortable pricing question while the product is still cheap to kill.&lt;/p&gt;
&lt;p&gt;A single activation event to optimize, not a funnel. Not “they used it in week one.” Something sharper, like “they walked into a real 1:1 with notes this tool generated.” Everything else is a proxy.&lt;/p&gt;
&lt;p&gt;Measure with people first, instrumentation second. Fifteen-minute calls with the first ten or twenty users tell you more in a week than a drip system tells you in three months. The growth stack would have surfaced the demand thinness eventually; the calls would have surfaced it in week two. The full version of this lesson is in &lt;a href=&quot;/olllo/growth-engineering-experiments&quot;&gt;Growth Engineering&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-id-still-do-the-same&quot;&gt;What I’d still do the same&lt;/h3&gt;
&lt;p&gt;The problem framing. The memory-and-recall reframe holds up. The three video scripts still read as sharp product thinking and I’d use them to open any future pitch in the career-development space.&lt;/p&gt;
&lt;p&gt;Shipping solo with team-grade discipline. Speckit, the constitution file, the merge gates. These kept me out of the classic solo trap of endless scope creep disguised as progress. Detailed in &lt;a href=&quot;/olllo/culture-as-code&quot;&gt;Culture as Code&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Tiered model strategy with latency-first selection. Haiku for fast classification, Sonnet for interactive reasoning, no single-tier hammer. The UX stayed snappy across thirty-eight features (and the monthly AI bill stayed within reason as a side effect, but that was not what picked the model). Detailed in &lt;a href=&quot;/olllo/ai-product-craft&quot;&gt;AI Product Craft&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Betting on the shutdown. The decision to stop is the judgment call I’m most confident in. Pulling the plug on a product I still believe in, because the market doesn’t, is the skill I most want this portfolio to demonstrate.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-im-taking-with-me&quot;&gt;What I’m taking with me&lt;/h3&gt;
&lt;p&gt;A codebase I’d hire the person who wrote it. A set of product decisions I can defend in detail. A sharper read on the gap between “interesting problem” and “viable product.”&lt;/p&gt;
&lt;p&gt;If you’re reading this as a hiring signal: the thing I want you to notice is not that olllo shipped. It’s that it stopped.&lt;/p&gt;</content:encoded></item><item><title>Culture as Code · Speckit + Enforcement</title><link>https://hypoth.ai/olllo/culture-as-code</link><guid isPermaLink="true">https://hypoth.ai/olllo/culture-as-code</guid><description>I built the guardrails a team of 10 would need, for a team of 1, because shipping solo is when discipline compounds fastest.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;culture-as-code-speckit-and-enforcement&quot;&gt;Culture as Code: Speckit and Enforcement&lt;/h2&gt;
&lt;p&gt;How I built the guardrails a team of ten would need, for a team of one, and why it turned out to be the highest-traction decision across four months of solo shipping.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-declare-and-enforce.png&quot; alt=&quot;Two-panel diagram. Top panel labelled .specify/memory/constitution.md (v1.7.0): &amp;#x22;The Constitution — seven non-negotiable principles&amp;#x22; as a grid of seven cards numbered 01 through 07 — Testing Strategy (1.0.0), Internationalization (1.1.0), Component Organization (1.2.0), Test Planning (1.3.0), Database Migrations (1.4.0), Data Portability &amp;#x26; Deletion (1.5.0, coral-accented), Changelog Maintenance (1.6.0). Bottom panel labelled speckit · six-stage flow: &amp;#x22;The Workflow — six stages, in order&amp;#x22; as a 2×3 grid of stage cards — specify (feature → spec.md), clarify (3–5 questions → spec.md), plan (→ plan.md), tasks (→ tasks.md), analyze (consistency check), implement (→ code · tests local).&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;opening&quot;&gt;Opening&lt;/h3&gt;
&lt;p&gt;The startup playbook is older than software. Move fast in the early days. Ship scrappy. Fix it later. AI hasn’t changed that advice; what AI has changed is that scrappy is no longer the price of fast.&lt;/p&gt;
&lt;p&gt;You can now ship the codebase a senior systems architect would be proud of in roughly the time it used to take to ship the duct-taped MVP. The catch is that AI does not give you the discipline to do it. It only removes the excuse not to. You still have to know what great looks like, and you have to refuse to skip the parts you cannot see yet.&lt;/p&gt;
&lt;p&gt;This is the case study for what that refusal looks like in practice. Four months of solo shipping, 38 numbered features, one constitution that grew the way every honest policy document grows: each principle added the day after I learned why it should have existed.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-problem&quot;&gt;The problem&lt;/h3&gt;
&lt;p&gt;Solo shipping breaks in predictable ways, and AI-assisted coding multiplies all of them.&lt;/p&gt;
&lt;p&gt;Past decisions go undocumented, so future-me cannot tell why the code is the way it is. Features drift: a spec starts as one thing and becomes another with no trace of the handoff. Every fix risks breaking something the original author already decided, and the original author is also me. AI assistants cheerfully rewrite a module in a style that contradicts three others it touches, and without rules of the road, the assistant is right that the rules do not exist.&lt;/p&gt;
&lt;p&gt;Underneath all of this is the empty-review problem. Code review is where most engineering cultures enforce taste. Solo, there is no reviewer.&lt;/p&gt;
&lt;p&gt;The team-of-ten answer is process: specs, ADRs, code standards, test gates, doc requirements, PR templates. The bet I made four months ago was that those artifacts are worth more to a solo dev than to a team, because a solo dev is the one who most needs to trust their own past self.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-workflow&quot;&gt;The workflow&lt;/h3&gt;
&lt;p&gt;The workflow is built on &lt;a href=&quot;https://github.com/github/spec-kit&quot;&gt;Speckit&lt;/a&gt;, an open-source templating layer for AI-assisted feature specification. Speckit gives you the six stages and the templates; the discipline is in running every feature through every stage, every time.&lt;/p&gt;
&lt;p&gt;Every feature in olllo, without exception, passes through these six stages before a line of implementation code is written.&lt;/p&gt;
&lt;p&gt;The six stages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;specify&lt;/code&gt; generates a &lt;code&gt;spec.md&lt;/code&gt; from a plain-English feature description&lt;/li&gt;
&lt;li&gt;&lt;code&gt;clarify&lt;/code&gt; surfaces 3–5 targeted questions about ambiguous requirements and encodes the answers back into the spec&lt;/li&gt;
&lt;li&gt;&lt;code&gt;plan&lt;/code&gt; produces the technical design: data model, integration points, dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tasks&lt;/code&gt; produces a dependency-ordered implementation list, each task with acceptance criteria and test-coverage requirements&lt;/li&gt;
&lt;li&gt;&lt;code&gt;analyze&lt;/code&gt; runs cross-artifact consistency across spec, plan, and tasks, catching contradictions before code&lt;/li&gt;
&lt;li&gt;&lt;code&gt;implement&lt;/code&gt; generates code task-by-task with local test verification at each step&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-speckit-workflow.png&quot; alt=&quot;Six stage cards in a 2×3 grid, each labelled with input/output and a load tag. Stage 01 specify, LOW LOAD, plain-English description → spec.md. Stage 02 clarify, HIGH LOAD (coral), 3–5 ambiguity questions → spec.md +Q/A. Stage 03 plan, HIGH LOAD (coral), data model · integrations → plan.md. Stage 04 tasks, MED LOAD, dependency-ordered list → tasks.md. Stage 05 analyze, MED LOAD, consistency sweep → cross-artifact log. Stage 06 implement, LOW LOAD, task-by-task execution → code · tests. Underneath, a &amp;#x22;Judgement load · where the decisions happen&amp;#x22; line chart inverse to typing — load rises at clarify and plan, drops through implement.&quot;&gt;
A real example. Feature &lt;code&gt;036-knowledge-user-context&lt;/code&gt; replaced three existing onboarding cards (work context, preferences, notifications) with a single conversational AI flow. The &lt;code&gt;spec.md&lt;/code&gt; has a literal “Clarifications” section with five Q/A pairs from the clarify stage. Each one is a decision that would have shipped as an unstated assumption without the process:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; Should Knowledge capture reporting structure (direct reports count, manager relationship, 1:1 cadence)?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A:&lt;/strong&gt; Yes, full: direct reports count, manager role/title (not name), 1:1 cadence. People names save to Contacts with encrypted realName, accessible via @-mention. User is informed of encryption and contact creation.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;That single clarification triggered an encryption decision, a Contacts integration, and a new user-facing string. Without the clarify step, at least one of those three would have been missed or decided silently in implementation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-speckit-clarify.png&quot; alt=&quot;A rendered view of specs/036-knowledge-user-context/spec.md § Clarifications. Title: &amp;#x22;Five ambiguities surfaced and resolved, before a task file existed.&amp;#x22; Five Q/A blocks: Q01 should Knowledge capture reporting structure (direct reports count, manager relationship, 1:1 cadence)? — yes, full; encryption decision, contacts integration, and a new user-facing string tagged underneath. Q02 should the conversational flow be skippable, or required to complete onboarding? — skippable, three prompts before dismiss; onboarding-state model and settings entry point tagged. Q03 what does Knowledge do with the three onboarding cards being replaced? — hard-remove, migrate values, telemetry event; migration script and telemetry tagged. Q04 locale: how should Knowledge handle non-English contact names with diacritics? — store NFC-normalized, accent-insensitive search, display preserves diacritics; search index spec tagged. Q05 what is the retention policy for raw Knowledge transcripts vs. structured fields? — transcripts purged after 30 days, structured fields persist; data-portability hook and privacy copy tagged.&quot;&gt;
By the time implementation starts, the ambiguity budget is spent and implementation is execution rather than exploration.&lt;/p&gt;
&lt;p&gt;The implement stage is the loudest in any AI-assisted workflow. It’s where the assistant does most of the visible typing. But on every feature in olllo, my time was spent in the earlier stages: reading research, refining the spec, choosing the harder long-term path over the easier short-term one the research suggested. The clarify and plan stages exist to make those judgment calls visible. The implement stage exists to make sure the visible decisions make it into the code. AI did most of the typing. None of the deciding.&lt;/p&gt;
&lt;p&gt;The spec directory for &lt;code&gt;036&lt;/code&gt; ends up with nine artifacts:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-speckit-artifacts.png&quot; alt=&quot;Annotated file tree of specs/036-knowledge-user-context/. Three slots tagged ALWAYS in coral: spec.md (what we&amp;#x27;re building and why), plan.md (how we&amp;#x27;re building it), tasks.md (dependency-ordered steps), quickstart.md (how to run the feature locally). Two slots tagged OFTEN: research.md (external references consulted), decisions.md (non-obvious choices). Three slots tagged AS-NEEDED: data-model.md (schema changes), contracts/ (API contracts), checklists/ (feature-specific verification).&quot;&gt;
Every feature gets the same nine slots. Not all slots are always full; some are a single sentence. But the shape is the same, which is what makes two-month-old features readable.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-constitution&quot;&gt;The constitution&lt;/h3&gt;
&lt;p&gt;The constitution is a single file, versioned like code: &lt;code&gt;.specify/memory/constitution.md&lt;/code&gt;, currently at version 1.7.0, last amended 2026-03-23. It declares seven non-negotiable principles.&lt;/p&gt;
&lt;p&gt;Each principle is declared, rule’d, and rationalised. Principle 5 exists because I lost data once. Principle 6 exists because GDPR Article 17 does. Principle 7 exists because I shipped changelog entries inconsistently until I made it mandatory.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-constitution-principles.png&quot; alt=&quot;Numbered list of seven non-negotiable constitution principles, each card showing rule, version, date, and the scar it came from. 01 Testing Strategy (1.0.0 · 2025-11-09): hybrid query approach, semantic queries for stable UI and data-testid for dynamic / i18n-sensitive content — scar: selectors that worked in en-US broke under Spanish copy. 02 Internationalization (1.1.0 · 2025-11-17): no hardcoded user-facing strings; everything routes through next-intl — scar: a toast string smuggled in as a literal during a hotfix. 03 Component Organization (1.2.0 · 2025-12-09): four-layer hierarchy primitives → shared → shell → feature-route _components — scar: two copies of the same composite drifted apart for three weeks. 04 Test Planning (1.3.0 · 2025-12-13): test tasks generated alongside implementation tasks; tests pass locally before PR — scar: coverage drifted when tests were always &amp;#x22;next ticket&amp;#x22;. 05 Database Migrations (1.4.0 · 2026-01-13): dotenv-wrapped commands only; production-only assumptions are explicit — scar: a migration run against the wrong env. 06 Data Portability &amp;#x26; Deletion (1.5.0 · 2026-01-13, highlighted coral on deep-teal): every new user-data model updates export + deletion services with audit logging — scar: GDPR Article 17 exists; quarterly review models almost didn&amp;#x27;t make it in. 07 Changelog Maintenance (1.6.0 · 2026-02-05): user-facing changes require an entry, internal changes don&amp;#x27;t, ambiguity resolved at clarify — scar: three features shipped with no changelog entry, no one to flag it.&quot;&gt;
The file is amendment-tracked with semantic versioning. Every amendment is a scar.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-constitution-ammendments.png&quot; alt=&quot;Vertical timeline of constitution amendments, each entry showing version, amendment, and date with a one-line italic note on the trigger. 1.0.0 (2025-11-09) Initial constitution — founding principle. 1.1.0 (2025-11-17) Added Internationalization — hardcoded strings discovered in the wild. 1.2.0 (2025-12-09) Added Component Organization — drift between near-duplicate components. 1.3.0 (2025-12-13) Added Test Planning — tests shipped as follow-up, coverage drifted. 1.3.1 (2026-01-02) Added Workaround Review governance — two workarounds outlived their reason. 1.4.0 (2026-01-13) Added Database Migrations — felt the absence in a production cutover. 1.5.0 (2026-01-13, amber-accented) Added Data Portability &amp;#x26; Deletion — GDPR/CCPA work surfaced the gap. 1.6.0 (2026-02-05) Added Changelog Maintenance — shipped without entries, made it mandatory. 1.7.0 (2026-03-23, coral-accented) Added Local Test Verification — a CI failure that would have taken 15s to catch locally.&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-gates&quot;&gt;The gates&lt;/h3&gt;
&lt;p&gt;Principles are nothing without enforcement. The gates are where the discipline compounds.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-gates.png&quot; alt=&quot;A &amp;#x22;code change → production&amp;#x22; track with six tick marks across the top, and a 2×3 grid of gate cards below. Gate 01 Pre-merge · CI (.github/workflows/ci.yml): unit tests against the database package, the app, and shared packages — any failing test, no &amp;#x22;I ran it locally&amp;#x22; exception. Gate 02 Pre-PR · workflow (speckit · implement stage): affected tests must run and pass locally before a PR is created — enforced by template and muscle memory. Gate 03 Documentation · artifacts (specs/NNN-*/): spec.md, plan.md, tasks.md exist; tasks cite principles by number — missing changelog entries surface at the analyze stage. Gate 04 Security · four workflows (sast / dast / secrets / dep-audit): SAST, DAST, secret scanning, dependency auditing on every push — a hardcoded API key wouldn&amp;#x27;t land even if I missed it in self-review. Gate 05 AI behavior · evals (.github/workflows/ai-eval.yml): model-output tests in CI, prompt changes gated on measurable behavior — catches prompt changes that drift downstream output. Gate 06 Living context · CLAUDE.md (repo root): assistant reads on every session, conventions update in the same PR — suggestions stay aligned instead of drifting toward training-data defaults.&quot;&gt;&lt;/p&gt;
&lt;h4 id=&quot;pre-merge-via-ci&quot;&gt;Pre-merge, via CI&lt;/h4&gt;
&lt;p&gt;Every PR runs unit tests against the database package, against the app, and against shared packages, via &lt;code&gt;.github/workflows/ci.yml&lt;/code&gt;. The build fails if any test fails. No “I ran it locally” exception.&lt;/p&gt;
&lt;h4 id=&quot;pre-pr-via-the-workflow-itself&quot;&gt;Pre-PR, via the workflow itself&lt;/h4&gt;
&lt;p&gt;The speckit &lt;code&gt;implement&lt;/code&gt; step refuses to create a PR if affected tests have not been run and passed locally. That rule lives in the constitution (Principle 4) and in the speckit templates, so it’s enforced both in the assistant’s behavior and in my own muscle memory.&lt;/p&gt;
&lt;h4 id=&quot;documentation-via-required-artifacts&quot;&gt;Documentation, via required artifacts&lt;/h4&gt;
&lt;p&gt;A feature is not complete until &lt;code&gt;spec.md&lt;/code&gt;, &lt;code&gt;plan.md&lt;/code&gt;, and &lt;code&gt;tasks.md&lt;/code&gt; exist. Implementation tasks reference constitution principles by number. When a new user-facing feature ships without a changelog entry, the &lt;code&gt;analyze&lt;/code&gt; stage flags it.&lt;/p&gt;
&lt;h4 id=&quot;security-via-four-dedicated-workflows&quot;&gt;Security, via four dedicated workflows&lt;/h4&gt;
&lt;p&gt;SAST, DAST, secret scanning, and dependency auditing run on every push. The existence of &lt;code&gt;security-secrets.yml&lt;/code&gt; alone means a hardcoded API key won’t land even if I miss it in self-review.&lt;/p&gt;
&lt;h4 id=&quot;ai-behavior-via-evals&quot;&gt;AI behavior, via evals&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;ai-eval.yml&lt;/code&gt; runs model-output tests in CI. Changing a prompt without verifying downstream output would be trivial to do by accident without this gate; with it, prompt changes are gated on measurable behavior.&lt;/p&gt;
&lt;h4 id=&quot;living-context-via-claudemd&quot;&gt;Living context, via CLAUDE.md&lt;/h4&gt;
&lt;p&gt;The assistant reads &lt;code&gt;CLAUDE.md&lt;/code&gt; on every session. When conventions evolve, the file is updated in the same PR as the convention change. The assistant’s suggestions stay aligned with current standards instead of drifting toward training-data defaults.&lt;/p&gt;
&lt;h4 id=&quot;what-the-gates-didnt-catch&quot;&gt;What the gates didn’t catch&lt;/h4&gt;
&lt;p&gt;Most case studies on engineering process tell the story of the time the gate caught a bug. This one tells a different story: the time the gate didn’t, and what came of it.&lt;/p&gt;
&lt;p&gt;On 2026-01-13, I added Principle 6 to the constitution: every new user-data model must update the export and deletion services. The principle was added during the GDPR/CCPA work, when I had just built the services and discovered that “every model” was a longer list than I’d assumed.&lt;/p&gt;
&lt;p&gt;Eighteen days later, on 2026-01-31, I pushed &lt;code&gt;ea14b27&lt;/code&gt; directly to master: &lt;code&gt;fix(database): add quarterly review models to user data services&lt;/code&gt;. A feature I’d shipped between those two dates had added new user-data models without updating the services. The principle existed. The enforcement was still me. I missed it during the feature, caught it in self-review later, and patched it with a direct push.&lt;/p&gt;
&lt;p&gt;That is a near-miss, and it is the kind of near-miss that the rest of the constitution exists to make rarer. Read the version history again with this in mind:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1.6.0 (2026-02-05) added Changelog Maintenance after I shipped features without changelog entries&lt;/li&gt;
&lt;li&gt;1.4.0 and 1.5.0 (2026-01-13) added Database Migrations and Data Portability the day I felt the absence of both&lt;/li&gt;
&lt;li&gt;1.3.0 (2025-12-13) added Test Planning after I shipped tests as follow-up work and watched coverage drift&lt;/li&gt;
&lt;li&gt;1.7.0 (2026-03-23) added Local Test Verification during feature 038, after a CI failure that would have taken fifteen seconds to catch locally&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Every amendment is a postmortem encoded as a rule. Read top to bottom, the version history is a curriculum: every lesson I’d want to teach the next person on day one, in the order I learned it the hard way.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-near-miss.png&quot; alt=&quot;Two stacked panels showing a near-miss. Top left (&amp;#x22;The Principle&amp;#x22;): Principle 06 · 2026-01-13 — every new user-data model must update the export and deletion services. Top right (&amp;#x22;The Patch&amp;#x22;, deep-teal): ea14b27 → master · 2026-01-31 — fix(database): add quarterly review models to user data services. Below, an &amp;#x22;18 days · principle existed · enforcement was still me&amp;#x22; timeline: Constitution v1.5.0 (2026-01-13) added Principle 6; Feature shipped (2026-01-18) added quarterly review models, services not updated; Subsequent merges (2026-01-25) codebase continues to grow on top of the gap; ea14b27 → master (2026-01-31, coral-accented) direct push, fix(database): add quarterly review models to user data services.&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-outcome&quot;&gt;The outcome&lt;/h3&gt;
&lt;p&gt;38 numbered feature branches shipped between December 2025 and March 2026. Branches &lt;code&gt;001&lt;/code&gt; through &lt;code&gt;040&lt;/code&gt;, with two gaps where features were consolidated. Every one followed the same six-stage workflow. Every one produced the same nine artifacts in its spec directory. Every one was gated on tests and documentation before merge.&lt;/p&gt;
&lt;p&gt;The measurable consistency: a feature from two months ago is readable in minutes, not hours. The spec tells me what we decided, the clarifications tell me why, the tasks tell me what was built, the constitution tells me what rules applied. There is no archaeology. There is only reading.&lt;/p&gt;
&lt;p&gt;The trade was never about speed; it was about debt. Every startup I have watched at scale has hit the same wall: things start breaking at year two because the early observability was thin, the early tests were inadequate, the early decisions were unwritten. Retrofitting those things at scale costs more than building them at month one would have. Speckit and the constitution were the bet that I could pay that cost up front, every feature, and ship a codebase without a debt cliff to climb later.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-id-port-to-a-team&quot;&gt;What I’d port to a team&lt;/h3&gt;
&lt;p&gt;Universal, ship day one:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The constitution as code, versioned with amendments&lt;/li&gt;
&lt;li&gt;The speckit &lt;code&gt;clarify&lt;/code&gt; stage as a required step for any feature spec&lt;/li&gt;
&lt;li&gt;Test tasks in the same task file as implementation tasks, not in a follow-up&lt;/li&gt;
&lt;li&gt;CI enforcement of every non-negotiable rule, not convention&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Solo-only, probably cut on a team:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The full nine-artifact spec directory. A team version would consolidate into three: &lt;code&gt;spec.md&lt;/code&gt;, &lt;code&gt;plan-plus-tasks.md&lt;/code&gt;, &lt;code&gt;decisions.md&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;implement&lt;/code&gt; stage’s requirement that the assistant runs tests locally. On a team, CI is the gate.&lt;/li&gt;
&lt;li&gt;Principle 7’s “when in doubt, ask” clarify step for changelog entries. A team would codify this in the PR template and skip the clarify roundtrip.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The biggest constraint, and the part I would reframe before bringing this approach to a team, is that Speckit was built for one author. The artifacts it produces (&lt;code&gt;spec.md&lt;/code&gt;, &lt;code&gt;plan.md&lt;/code&gt;, &lt;code&gt;tasks.md&lt;/code&gt;, &lt;code&gt;decisions.md&lt;/code&gt;) are excellent for me-to-future-me communication. They are hard to share for in-progress feedback. There is no good way for a teammate to comment on a plan that is still being written, no way for a designer to weigh in on a clarification before it is resolved, no way to fork a plan into alternatives and pick between them. A team version of this approach needs a collaborative layer: shared workspace for in-flight specs, async comment threads on clarifications, plan branching and review. That is the next thing I would build if I took this to a team.&lt;/p&gt;
&lt;p&gt;Secondary gap: a lightweight ADR workflow outside of features. Cross-cutting decisions (“switch from Postgres full-text search to pgvector”) currently live in whatever feature spec happens to touch them, which is the wrong home. The &lt;code&gt;specs/adr/&lt;/code&gt; directory exists but is underused.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/culture-as-code/images/olllo-culture-as-code-closing.png&quot; alt=&quot;Two stacked panels splitting the practices into &amp;#x22;Universal · ship day one&amp;#x22; (4 practices, deep-teal accent) and &amp;#x22;Solo-only · cut on a team&amp;#x22; (3 practices, coral accent). Universal: 01 Constitution as code — versioned with amendments, every team needs this; 02 Clarify as a required step — any feature spec runs through 3–5 ambiguity questions; 03 Test tasks alongside impl — same task file, not a follow-up ticket; 04 CI enforcement of every rule — non-negotiables live in CI, not in convention. Solo-only: nine-artifact spec directory (→ 3 files) — team version consolidates to spec.md, plan-plus-tasks.md, decisions.md; implement-stage local tests (→ CI only) — on a team, CI is the gate, the local pre-PR check is a solo crutch; clarify-for-changelog (→ PR template) — on a team, codify in the PR template, skip the clarify roundtrip. Footer card: &amp;#x22;The gap to close first&amp;#x22; — speckit was built for one author; a team needs a collaborative layer: shared in-flight specs, comments on clarifications, plan branching.&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;why-the-discipline-compounds&quot;&gt;Why the discipline compounds&lt;/h3&gt;
&lt;p&gt;Process discipline reads as overhead until the moment it isn’t, and the moment it isn’t is usually six weeks after a decision you made is now haunting you. Solo devs do not have a senior engineer down the hall to ask. They have their past self, who will either have left notes or not.&lt;/p&gt;</content:encoded></item><item><title>AI Product Craft</title><link>https://hypoth.ai/olllo/ai-product-craft</link><guid isPermaLink="true">https://hypoth.ai/olllo/ai-product-craft</guid><description>AI features only feel good when the model, the latency, the reliability, and the UX are designed together, not layered.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;ai-product-craft-when-the-user-is-in-the-moment&quot;&gt;AI Product Craft: When the User Is in the Moment&lt;/h2&gt;
&lt;p&gt;How models got matched to tasks, agents got scoped to outcomes, and structure beat prompting at every step. A walk through the four-tier model config, a multi-agent spec that replaced a broken one, and an eval harness that runs every night.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;opening&quot;&gt;Opening&lt;/h3&gt;
&lt;p&gt;The user is in the moment. They are typing a sentence about a meeting that just ended, rating the week they just had, or asking the assistant to help them prep for a 1:1 in fifteen minutes. Every AI call had to fit inside that moment.&lt;/p&gt;
&lt;p&gt;Three things picked the model that handled it: speed, correctness for the task, and cost. The first two are why this case study leads with latency budgets and tier-to-task fit. The third was a real factor in every decision, never the lead one. Cost is the reason the system does not run Opus on everything; it is not the reason any specific model got picked.&lt;/p&gt;
&lt;p&gt;Pick the wrong tier and the user feels the lag. Pick the wrong agent boundary and the assistant talks past them. Pick the wrong scope and the assistant talks about itself instead of about them. This is the case study for how I picked, three decisions deep: the tier, the agent, and the line between what the model gets to decide and what the code does.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-constraint-that-picked-the-model&quot;&gt;The constraint that picked the model&lt;/h3&gt;
&lt;p&gt;Model tier configuration lives in &lt;code&gt;packages/ai/lib/config/model-tiers.ts&lt;/code&gt;. Four tiers, each with a primary model, a fallback, a timeout, a retry count, a temperature, a max-token cap, and a budget cap.&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;micro     Gemini 2.0 Flash Lite   Haiku fallback     3s   t=0.1   512 tok    $0.005&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;fast      Claude Haiku 3.5        Gemini Flash Lite  5s   t=0.2   1024 tok   $0.01&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;balanced  Claude Sonnet 4.5       Gemini 2.0 Flash   15s  t=0.5   4096 tok   $0.10&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;advanced  Claude Sonnet 4.5       Claude Opus 4.5    30s  t=0.7   8192 tok   $0.50&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Read it left to right and the design intent is plain. Each tier is organized around a latency budget (3, 5, 15, 30 seconds across the four) because that is what the user feels first. Temperature climbs as the task gets more generative. Token caps grow with the task’s natural length. Cost grows with model size and with the chosen primary; the per-request budget caps exist so a runaway loop cannot bankrupt the system, not as the lever that decides which model runs.&lt;/p&gt;
&lt;p&gt;PII detection runs on the micro tier because the user is mid-sentence and the model cannot afford to think for two seconds. Refinement chat sits at balanced because the assistant is in dialogue, the user is reading the words as they stream, and a half-second to first chunk is good enough. Reflection conversation runs at advanced because it is the deepest, most generative use of the system, and by the time it starts the user has already committed to the moment.&lt;/p&gt;
&lt;p&gt;Every tier has a fallback model from a different provider. Every request routes through Vercel AI Gateway, which is &lt;code&gt;enabled: true&lt;/code&gt; everywhere with no opt-out. The gateway gives unified logging, request-level cost tracking, and rate limiting. That is the observability layer that makes everything else in this case study possible. If you cannot see what your AI is doing, you cannot improve it.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/ai-product-craft/images/olllo-ai-tiers.png&quot; alt=&quot;A latency-budget timeline at the top runs from 0s (the moment) to 30s (committed), with marks at 3s, 5s, 15s, and 30s. Below it, four tiers stacked: Tier 01 micro (3s budget) — Gemini 2.0 Flash Lite primary, Claude Haiku 3.5 fallback, t=0.1, 512 tokens, $0.005 budget, 2× retry, owns PII detection while user is mid-sentence. Tier 02 fast (5s) — Claude Haiku 3.5 primary, Gemini Flash Lite fallback, t=0.2, 1,024 tokens, $0.010, 2× retry, owns inline assists (tag, classify, route). Tier 03 balanced (15s) — Claude Sonnet 4.5 primary, Claude Haiku 3.5 fallback, t=0.5, 4,096 tokens, $0.100, 2× retry, owns refinement chat (streaming dialogue). Tier 04 advanced (30s) — Claude Sonnet 4.5 primary, Claude Opus 4.5 fallback, t=0.7, 8,192 tokens, $0.500, 3× retry, owns reflection conversation (deepest, most generative). Three columns at the bottom: Speed (the latency budget is the lead), Correctness (temperature climbs as the task gets more generative; token caps grow with natural length), Cost (per-request budget caps so a runaway loop can&amp;#x27;t bankrupt the system; never the lever that picks the model).&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;where-single-agent-got-stuck&quot;&gt;Where single-agent got stuck&lt;/h3&gt;
&lt;p&gt;For the first version of weekly reflection, I built what every AI product builds: a single agent with a long, well-prompted system message, a list of themes to cover, and a hard cap on total questions. The prompt told the agent how to move through the themes, what each one should cover, when to summarize, when to wrap up. It did not work.&lt;/p&gt;
&lt;p&gt;The diagnosis is in &lt;code&gt;specs/012-reflection-multi-agent/spec.md&lt;/code&gt;. The production data captured there:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Goals theme received 14 questions (target: 2)&lt;/li&gt;
&lt;li&gt;Well-being received 4 questions (target: 2)&lt;/li&gt;
&lt;li&gt;Engagement received 3 questions (target: 2)&lt;/li&gt;
&lt;li&gt;The agent sometimes declared completion while themes remained under-covered&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The root cause, written in my own words in that same spec, is the entire lesson of this section:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A single LLM cannot reliably track multi-theme state and enforce hard limits through prompting alone. The agent “forgets” constraints as the conversation progresses.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Prompts are wishes. The longer the conversation, the more the wish decays. A reflection that should have run twelve questions across five themes in three minutes was running twenty-five questions about goals, abandoning users halfway through.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/ai-product-craft/images/olllo-ai-single-agent-failure.png&quot; alt=&quot;Theme-by-theme question count from a single reflection: Goals received 14 questions against a target of 2 (BLEW PAST), Well-being received 4 (EXCEEDED), Engagement received 3 (EXCEEDED), Career, Performance, and Accomplishments each received 0 (SKIPPED). Three summary boxes underneath: TARGET 12 questions (2 each across six themes), ACTUAL 21 questions (three themes only), HARD CAP 15 total (the agent ignored it). Root cause from spec 012: &amp;#x22;A single LLM cannot reliably track multi-theme state and enforce hard limits through prompting alone. The agent forgets constraints as the conversation progresses.&amp;#x22;&quot;&gt;
The mistake I see most often in AI product work is to respond to this kind of failure with better prompts. Longer prompts. Cleverer prompts. The mistake is mine too; the early iterations of the reflection prompt are still in git history, each one longer than the last, trying to encode the constraints more emphatically. None of them moved the needle past a certain point, because the model was never the right place to encode hard constraints.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-orchestrator-specialist-pattern&quot;&gt;The orchestrator-specialist pattern&lt;/h3&gt;
&lt;p&gt;The fix was structural, not prompted. The new architecture, shipped as feature 012, replaces the monolithic agent with three layers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;An orchestration agent that controls flow and enforces limits with hard state checks in code, not in prompts&lt;/li&gt;
&lt;li&gt;Theme-specific agents (Engagement, Well-being, Career, Performance, Accomplishments) that each focus on a single domain for 2 to 4 questions&lt;/li&gt;
&lt;li&gt;A response-length heuristic that decides whether to continue or transition: continue if the user’s response is under 20 words, transition after 2 questions otherwise&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The orchestrator’s job is the part the model could not do reliably: state tracking. Has Engagement reached its minimum? Has any theme exceeded its maximum? Is total question count approaching the 15-question cap? Those are programmatic checks now, not prompt requests. The orchestrator hands the conversation to the next theme agent when its checks say so.&lt;/p&gt;
&lt;p&gt;Each theme agent is what an engineering manager would call an expert: scoped to a focused outcome, expert in reaching it, blind to everything else. Engagement does not know about Career. Career does not know about Well-being. The orchestrator is the only piece that knows the whole. That separation is what kept the conversation moving.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/ai-product-craft/images/olllo-ai-orchestration.png&quot; alt=&quot;Top: an Orchestration Agent block titled &amp;#x22;State machine, not a prompt,&amp;#x22; with four state checks — if min reached for current theme, if max exceeded for any theme, if total approaching 15-question cap, if user response &gt; 20 words. The orchestrator hands the conversation to theme agents and receives completion signals. Below it, five theme agents in a row: Engagement (team &amp;#x26; company, min 2 / max 4), Well-being (feelings live here, min 2 / max 4, FEELINGS SCOPE), Career (progress on goals, min 2 / max 4), Performance (outcomes, min 2 / max 4), Accomplishments (what shipped, min 2 / max 4). At the bottom, two cards comparing approaches: V0 monolith (one agent, long prompt, hard cap in words — state, scope, and flow all encoded as wishes the model forgets) versus V1 orchestrator-specialist (each agent expert in one theme, blind to the rest; orchestrator is the only piece that knows the whole).&quot;&gt;
The same pattern reappeared in goal-setting, in accomplishment refinement, anywhere the original temptation was “one big prompt that does it all.”&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-the-model-wanted-to-talk-about&quot;&gt;What the model wanted to talk about&lt;/h3&gt;
&lt;p&gt;The most useful thing I learned about AI product craft is that the model has biases that prompting alone cannot suppress.&lt;/p&gt;
&lt;p&gt;In the single-agent reflection, even with explicit prompts telling the agent to spend most of its questions on the user’s actual work and impact, the conversation kept drifting toward feelings. &lt;em&gt;How do you feel about your week? How did that meeting make you feel? How do you feel about that?&lt;/em&gt; It was not what users wanted, it was not what the prompt asked for, and the system kept doing it anyway.&lt;/p&gt;
&lt;p&gt;Telling the model to stop did not work. I tried. Several times.&lt;/p&gt;
&lt;p&gt;The fix was not a sharper prompt. The fix was scope. The Well-being agent talks about feelings; that is its job. The other four agents are not allowed to. Performance asks about outcomes. Career asks about progress toward stated goals. Engagement asks about team and company connection. Accomplishments asks about what shipped. None of them have permission to ask “how do you feel about that,” because that question lives in a different agent’s scope and the orchestrator hands the conversation off when it is time.&lt;/p&gt;
&lt;p&gt;The lesson that generalizes: when the model wants to do something the product does not want, give the unwanted behavior its own scope and gate it. Telling the model “do not do X” is a wish. Building the system so X is structurally unavailable outside the agent that handles X is law.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/ai-product-craft/images/olllo-ai-scope-as-law.png&quot; alt=&quot;Two side-by-side approaches. Approach A · Telling: one agent, prompted to stop, with four crossed-out instructions (&amp;#x22;do not ask about feelings&amp;#x22;, &amp;#x22;focus on work and impact&amp;#x22;, &amp;#x22;no &amp;#x27;how do you feel&amp;#x27;&amp;#x22;, &amp;#x22;stay on outcomes, not emotion&amp;#x22;); result: &amp;#x22;How do you feel about that?&amp;#x22; asked anyway, every theme. Approach B · Scoping: five agents, only Well-being owns feelings (How did that feel?), the others ask about team, goals, outcomes, what shipped; result: feelings only when Well-being is on stage, structurally unavailable elsewhere. The lesson card underneath: WISH — telling the model &amp;#x22;do not do X&amp;#x22; → LAW — building the system so X is unavailable outside the agent that handles X.&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-reliability-layer&quot;&gt;The reliability layer&lt;/h3&gt;
&lt;p&gt;The eval harness lives in &lt;code&gt;.github/workflows/ai-eval.yml&lt;/code&gt;. It runs nightly at 2 AM UTC and on every PR that touches &lt;code&gt;packages/ai/**&lt;/code&gt;, the AI services in the database package, or the eval workflow itself. The harness uses a versioned golden set, runs evaluations through &lt;code&gt;pnpm --filter @repo/ai test:evals&lt;/code&gt;, and grep’s the output for the literal string &lt;code&gt;REGRESSION DETECTED&lt;/code&gt;. Any regression fails the workflow and posts a comment on the PR. A nightly failure pages me through Slack.&lt;/p&gt;
&lt;p&gt;A specific regression the harness has caught, in four months of operation: none. That is worth being honest about.&lt;/p&gt;
&lt;p&gt;I cannot tell you whether that is because my prompt changes were never bad enough to trigger one, because the golden set is not broad enough to catch the subtle drift that would have shown up at scale, or because the discipline of having the harness shaped how I thought about prompt changes in the first place. All three are plausible.&lt;/p&gt;
&lt;p&gt;What the harness did was let me ship prompt changes more confidently, because there was a runnable check between me and production. That confidence was worth building even when nothing was ever flagged. At sufficient scale, a regression will land. When it does, the gap between having an eval system already running and needing to build one is the difference between a five-minute fix and a three-week one.&lt;/p&gt;
&lt;p&gt;The other reliability pieces are smaller but worth listing. Vercel AI Gateway in front of every request, for unified logging, observability, and a single rate-limit chokepoint. Upstash Redis for request-layer rate limiting beyond what the gateway handles. Retry policies declared per tier, two to three attempts depending on tier. Streaming timeouts that distinguish between total request time, time to first chunk, and chunk-to-chunk stalls. None of these are novel; all of them are the boring infrastructure that turns a demo into a product.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/ai-product-craft/images/olllo-ai-eval-harness.png&quot; alt=&quot;Two triggers feed the eval harness: a cron at 2:00 UTC every night (scheduled, main branch) and a pull-request trigger (on push, paths-filter on packages/ai/** and ai-eval.yml). Both fire the runner — pnpm --filter @repo/ai test:evals (versioned) — which executes a Golden set regression check in three steps: (1) inputs are frozen prompts and expected shapes, (2) run does live model calls and gateway logs, (3) diff greps stdout for &amp;#x22;REGRESSION DETECTED&amp;#x22; — the runner exits 1 on any match. Two result cards: PASS · 99% of runs is silence (workflow green, no comment, no page; ships confidently with a runnable check before prod); REGRESSION DETECTED fires two channels (auto-comment on the PR, @here Slack page). Footer note: 0 regressions caught in three months — either the changes weren&amp;#x27;t bad enough, the golden set isn&amp;#x27;t broad enough, or the discipline of having the harness shaped how I made changes. All three are plausible.&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-id-take-into-another-product&quot;&gt;What I’d take into another product&lt;/h3&gt;
&lt;p&gt;The tier abstraction. Picking a model per task is the wrong unit of decision; picking a tier and assigning tasks to it is the right one. The tier captures the latency budget, the temperature philosophy, the retry policy, and the safety caps in a single object that downstream code references by name. New AI feature, new tier assignment, zero fresh decisions about timeouts or fallbacks.&lt;/p&gt;
&lt;p&gt;The orchestrator-specialist split for any conversation that needs to move through structured phases. Reflection, goal-setting, performance prep, anything that has phases. The model is not good at state across long contexts. The code is. Let each one do what it is good at.&lt;/p&gt;
&lt;p&gt;The eval harness as a non-negotiable. Building AI features without an eval harness is shipping a product whose quality you cannot measure. Even a small golden set with regression detection on PR is enough to start; the discipline of running it grows the set over time.&lt;/p&gt;
&lt;p&gt;The boring observability. AI Gateway, request logs, cost tracking, rate limiting. None of this looks like AI product craft on a portfolio. All of it is what makes the AI product behave like a product instead of a demo.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-meta-point&quot;&gt;The meta-point&lt;/h3&gt;
&lt;p&gt;AI product work has three layers that get conflated. There is the model layer, where the choice is which model, what temperature, what budget. There is the orchestration layer, where the choice is what the model decides and what the code decides. There is the scope layer, where the choice is what each agent is allowed to talk about.&lt;/p&gt;
&lt;p&gt;The mistake I made early, and that I see often elsewhere, is to handle all three through prompting. Better prompts, longer prompts, cleverer prompts. None of it works past a certain point because the model was never the right place to encode hard constraints, define state machines, or scope domains.&lt;/p&gt;
&lt;p&gt;The right division of labor: the model decides language, the code decides flow, the scope decides domain. Once those three layers are separate, the model’s biases stop being problems and start being characteristics you design around.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/ai-product-craft/images/olllo-ai-layers.png&quot; alt=&quot;A header contrasts THE MISTAKE (handle all three through prompting — better, longer, cleverer) with THE RIGHT DIVISION (three layers, three places, three different tools). Below, three layers stacked. Layer 01 — Model layer (decides Language): which model? what temperature? what budget? Wrong: pick by gut feel, swap when latency complaints arrive. Right: tier owns budget + temp + retry; task slots into tier by latency. Layer 02 — Orchestration layer (decides Flow): what does the model decide vs the code? Wrong: encode flow in a long prompt, hope the agent remembers. Right: state checks in code, hand-offs between focused agents. Layer 03 — Scope layer (decides Domain): what is each agent allowed to talk about? Wrong: tell the model what not to do, repeatedly. Right: make unwanted behavior structurally unavailable outside its scope. A summary band underneath: MODEL decides Language · CODE decides Flow · SCOPE decides Domain.&quot;&gt;&lt;/p&gt;</content:encoded></item><item><title>Architecture &amp; Technology</title><link>https://hypoth.ai/olllo/architecture-and-technology</link><guid isPermaLink="true">https://hypoth.ai/olllo/architecture-and-technology</guid><description>A solo-built Turborepo with 40+ shipped features across web, native mobile, and PWA runs on a handful of load-bearing decisions.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;solo-architecture-nine-apps-twenty-one-packages&quot;&gt;Solo Architecture: Nine Apps, Twenty-One Packages&lt;/h2&gt;
&lt;p&gt;A walk through the olllo monorepo: what was built, what held up, and what I would reconsider with four months of hindsight.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/architecture-and-technology/images/olllo-architecture-monorepo.png&quot; alt=&quot;The olllo monorepo at a glance: nine deployed surfaces (app, api, web, mobile, email, workflows, docs, storybook, studio) on top of twenty-one shared packages grouped into four families — foundation (typescript-config, next-config, database, design-system), product domains (ai, chatbot, email, payments, notifications), cross-cutting (auth, feature-flags, internationalization, rate-limit, security, storage, webhooks, observability), and operational (analytics, seo, workflow-utils, next-config-shared).&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;opening&quot;&gt;Opening&lt;/h3&gt;
&lt;p&gt;Architecture for one is a different problem from architecture for many. The team-of-ten answer is to design for handoffs, parallelism, and reviewer load. The team-of-one answer has to design for someone else: future-you, who will inherit every decision in three months and not remember why.&lt;/p&gt;
&lt;p&gt;The olllo monorepo is the version of that I shipped. Nine apps, twenty-one packages, thirty-eight numbered features over four months. Most of it I would build the same way again. Some of it I would not. This case study is the honest version of both.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-shape&quot;&gt;The shape&lt;/h3&gt;
&lt;p&gt;The repo is a Turborepo workspace on top of pnpm. The split between apps and packages is the spine: apps are deployed surfaces, packages are libraries that two or more apps share.&lt;/p&gt;
&lt;p&gt;The apps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app&lt;/code&gt;: the authenticated product where users live, Next.js 16 on the App Router, 112 API routes&lt;/li&gt;
&lt;li&gt;&lt;code&gt;api&lt;/code&gt;: the webhook + cron + admin surface, deployed separately from &lt;code&gt;app&lt;/code&gt; (more on this below)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;web&lt;/code&gt;: the marketing site&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mobile&lt;/code&gt;: the iOS/Android client (Expo SDK 54)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;email&lt;/code&gt;: the React Email preview app&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docs&lt;/code&gt;, &lt;code&gt;storybook&lt;/code&gt;, &lt;code&gt;studio&lt;/code&gt;: internal tooling&lt;/li&gt;
&lt;li&gt;&lt;code&gt;workflows&lt;/code&gt;: background jobs deployed as a separate service&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The packages, grouped by what they do:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Foundation: &lt;code&gt;typescript-config&lt;/code&gt;, &lt;code&gt;next-config&lt;/code&gt;, &lt;code&gt;database&lt;/code&gt; (Prisma), &lt;code&gt;design-system&lt;/code&gt; (shadcn/ui-based, with the caveats in &lt;a href=&quot;/olllo/design-system-across-web-and-native&quot;&gt;Cross-Platform Consistency&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Product domains: &lt;code&gt;ai&lt;/code&gt;, &lt;code&gt;chatbot&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;payments&lt;/code&gt;, &lt;code&gt;notifications&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Cross-cutting: &lt;code&gt;auth&lt;/code&gt; (Clerk), &lt;code&gt;feature-flags&lt;/code&gt;, &lt;code&gt;internationalization&lt;/code&gt;, &lt;code&gt;rate-limit&lt;/code&gt;, &lt;code&gt;security&lt;/code&gt;, &lt;code&gt;storage&lt;/code&gt;, &lt;code&gt;webhooks&lt;/code&gt; (Svix-based outbound), &lt;code&gt;observability&lt;/code&gt; (Sentry + Logtail)&lt;/li&gt;
&lt;li&gt;Operational: &lt;code&gt;analytics&lt;/code&gt;, &lt;code&gt;seo&lt;/code&gt;, &lt;code&gt;workflow-utils&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is more surfaces than a typical solo product. The shape is intentional: each app is a separate deployment with separate concerns, each package is a boundary I knew I would want to swap or scale independently.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/architecture-and-technology/images/olllo-architecture-dependency-graph.png&quot; alt=&quot;A dependency grid: each row is a package (typescript-config, next-config, design-system, database, auth, feature-flags, internationalization, ai, chatbot, email, payments, notifications, rate-limit, security, storage, webhooks, observability, analytics, seo, workflow-utils), each column is an app (app, api, web, mobile, email, workflows, docs, story, studio). Filled cells show which app uses which package. Footer bars show packages-used per app: app uses 18, api uses 14, web uses 8, mobile uses 9, email uses 4, workflows uses 7, docs uses 3, story uses 2, studio uses 1.&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;decisions-that-held-up&quot;&gt;Decisions that held up&lt;/h3&gt;
&lt;p&gt;Four architectural choices I would defend in a hostile interview.&lt;/p&gt;
&lt;h4 id=&quot;splitting-appsapi-from-appsapp&quot;&gt;Splitting &lt;code&gt;apps/api&lt;/code&gt; from &lt;code&gt;apps/app&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;apps/api&lt;/code&gt; is its own Next.js deployment on its own port (3002 in dev). It serves only inbound webhooks (Clerk auth, Stripe payments, Resend email), cron endpoints (keep-alive, drip emails), and a thin admin API. The user-facing routes live in &lt;code&gt;apps/app&lt;/code&gt;, which has 112 API routes for product features.&lt;/p&gt;
&lt;p&gt;The reasons are practical. Webhooks need signature verification and no Clerk session; product routes need exactly the opposite. BotID protection lives on &lt;code&gt;api&lt;/code&gt;, not on &lt;code&gt;app&lt;/code&gt;. A flaky Stripe handler should not be able to take down the user-facing app. Webhook load is bursty and external; product load is smooth and authenticated. None of those concerns alone are decisive; the combination is what made the split worth maintaining.&lt;/p&gt;
&lt;p&gt;The cost is one extra Next.js app to deploy and a slightly more complex local-dev story (the Stripe CLI forwards to port 3002, the app runs on port 3000). The benefit is that I never had to think about “should this middleware run on the webhook” or “what if this cron starves the product.”&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/architecture-and-technology/images/olllo-architecture-webhook.png&quot; alt=&quot;Two stacked panels. Top: apps/api on port 3002, with no Clerk session, BotID protection, and signature verify; inbound routes are webhooks (Stripe, Clerk, Resend, Knock — all signature-verified), crons (keep-alive every 5m, drip hourly), and a thin admin surface. Bottom: apps/app on port 3000, Clerk session required, authed product surface; inbound routes are the (authenticated)/dashboard, /reflections, /accomplishments, plus /api/* (×112), all Clerk + RLS gated. Underneath both: shared packages — @repo/database, @repo/auth, @repo/observability, @repo/email, @repo/payments, @repo/notifications, plus Vercel Postgres. Four pull-out points: (01) Auth posture — webhook signatures vs Clerk sessions are exact opposites. (02) Blast radius — a flaky Stripe handler can&amp;#x27;t take the product down. (03) Load shape — webhooks are bursty + external; product is smooth + authed. (04) BotID — lives on api, not on app. One place to enforce it.&quot;&gt;
&lt;em&gt;The four reasons the split is worth maintaining, side by side. Auth posture and load shape diverge cleanly across the boundary; blast radius and BotID enforcement only work because the boundary exists.&lt;/em&gt;&lt;/p&gt;
&lt;h4 id=&quot;vercel-ai-gateway-as-a-universal-chokepoint&quot;&gt;Vercel AI Gateway as a universal chokepoint&lt;/h4&gt;
&lt;p&gt;Every AI request in the system routes through Vercel AI Gateway. The config in &lt;code&gt;packages/ai/lib/config/model-tiers.ts&lt;/code&gt; declares &lt;code&gt;enabled: true&lt;/code&gt; and there is no opt-out. &lt;a href=&quot;/olllo/ai-product-craft&quot;&gt;AI Product Craft&lt;/a&gt; covers the AI design in detail; architecturally, the gateway is the single most consequential decision in the AI surface.&lt;/p&gt;
&lt;p&gt;One observability surface. One rate-limit boundary. One cost-tracking layer. One place to swap providers underneath. When Anthropic releases a faster model, the swap is a config change. When I want to know what the system spent on AI yesterday, the answer is one screen. When I add a new AI feature, none of the observability work is fresh.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/architecture-and-technology/images/olllo-architecture-ai-gateway.png&quot; alt=&quot;Six callers route through one gateway: Reflection multi-agent (apps/app, authed), Accomplishment refine (apps/app, chat), Voice → STAR extract (apps/app, pipeline), Onboarding chatbot (apps/app, @repo/chatbot), Background summaries (apps/workflows), Drip personalization (apps/api, cron) — all into @repo/ai&amp;#x27;s Vercel AI Gateway, which provides observability (one panel), rate-limit (one boundary), cost tracking (one layer), and provider swap (config change). Three providers behind it: Claude Haiku 4.5 (fast/cheap), Claude Sonnet 4.5 (default tier), Claude Opus 4 (deep reasoning).&quot;&gt;&lt;/p&gt;
&lt;h4 id=&quot;single-postgres--single-prisma-schema&quot;&gt;Single Postgres + single Prisma schema&lt;/h4&gt;
&lt;p&gt;The data layer is one Vercel Postgres instance with one Prisma schema covering everything: user data, marketing consent, subscriptions, accomplishments, reflections, goals, contacts, notification events. No separate marketing DB, no separate auth DB, no separate analytics DB.&lt;/p&gt;
&lt;p&gt;The temptation to split was always there. Each domain feels like it deserves its own schema. The reason I did not split is that the data needs to join. Marketing consent correlates to billing status. Onboarding completion correlates to reflection cadence. Notification eligibility correlates to feature flag state. Splitting the database would have produced sync work for no real isolation gain: the same JOIN logic, just spread across services and probably implemented as a custom event bus that nobody asked for.&lt;/p&gt;
&lt;p&gt;A team would eventually outgrow this default. A solo product never reached the size where the limits showed.&lt;/p&gt;
&lt;h4 id=&quot;the-four-layer-component-hierarchy&quot;&gt;The four-layer component hierarchy&lt;/h4&gt;
&lt;p&gt;Components live in one of four places, codified in the constitution:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;packages/design-system/components/&lt;/code&gt;: UI primitives, no business logic (shadcn/ui base + form wrappers)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;apps/app/components/{domain}/&lt;/code&gt;: used across two or more features&lt;/li&gt;
&lt;li&gt;&lt;code&gt;apps/app/app/(authenticated)/_components/&lt;/code&gt;: layout shell (sidebar, header, nav)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;apps/app/app/(authenticated)/{feature}/_components/&lt;/code&gt;: feature-route components&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The decision tree fits on one page of the constitution. New component, single question: where does this live? The answer is determined, not negotiated. There is a promotion path when a component graduates from feature-route to app-shared to design system.&lt;/p&gt;
&lt;p&gt;The reason this held up: solo, the temptation is to treat every component as worthy of promotion to design system because you have seen it twice. The hierarchy makes you wait until the third use, which is when the abstraction is actually safe. Most components stay in &lt;code&gt;_components&lt;/code&gt; folders, which is exactly where they should be.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/architecture-and-technology/images/olllo-architecture-layers.png&quot; alt=&quot;Four stacked layers of the component hierarchy: Layer 04 feature-route components (one-off composites scoped to a route, in apps/app/app/(features)/.../components/), Layer 03 layout-shell components (page chrome, navigation, route scaffolds, in apps/app/components/layout/), Layer 02 app-shared components (composed primitives reused across features, in apps/app/components/shared/), and Layer 01 design-system primitives (shadcn/ui, 50+ components owned in tree, in packages/design-system/components/ui/).&quot;&gt;
&lt;em&gt;Four layers, one decision tree. New component? Single question: where does this live? The answer is determined, not negotiated.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;decisions-id-reconsider&quot;&gt;Decisions I’d reconsider&lt;/h3&gt;
&lt;h4 id=&quot;expo-for-mobile&quot;&gt;Expo for mobile&lt;/h4&gt;
&lt;p&gt;The mobile app is Expo SDK 54. I would not use Expo if I were starting today.&lt;/p&gt;
&lt;p&gt;The reasoning is direct experience. Since shutting down olllo, I have spent time in pure native iOS development, and the experience of controlling the build, debugging across the simulator and a physical device, and shipping changes is materially better without the translation layer. Expo’s promise is “write JavaScript, ship to two platforms.” The promise is real. The cost is also real: when something breaks, the surface area of “is this a JS bug, an Expo SDK bug, a Metro bundler bug, an iOS-Expo-config issue, or a real native bug” is wide. With native, the surface is narrower, the tools are sharper, and the debugger does not lie to you about what is on the device.&lt;/p&gt;
&lt;p&gt;For a solo product genuinely targeting both iOS and Android from day one, Expo is still defensible. For a solo product where iOS is the primary surface (which is what olllo’s mobile ended up being once usage data showed where the users actually were), I would write Swift.&lt;/p&gt;
&lt;h4 id=&quot;the-half-built-analytics-abstraction&quot;&gt;The half-built analytics abstraction&lt;/h4&gt;
&lt;p&gt;&lt;code&gt;packages/analytics&lt;/code&gt; is named for swappability. Call sites import &lt;code&gt;analytics&lt;/code&gt; from &lt;code&gt;@repo/analytics&lt;/code&gt; rather than from &lt;code&gt;posthog-js&lt;/code&gt;, which is exactly what you would want if you ever needed to swap providers. The boundary is at the import path.&lt;/p&gt;
&lt;p&gt;The methods underneath are not abstracted. &lt;code&gt;analytics.capture()&lt;/code&gt;, &lt;code&gt;analytics.identify()&lt;/code&gt;, &lt;code&gt;analytics.flush()&lt;/code&gt; are PostHog method names. The package re-exports posthog-js directly with a renamed identifier. The server-side variant instantiates &lt;code&gt;new PostHog(...)&lt;/code&gt; and the noop development shim mimics the PostHog interface.&lt;/p&gt;
&lt;p&gt;Now that I am moving off PostHog (the UI is harder to use than I expected, the Slack integration is shallow, the views I want are easier to build myself), the incompleteness is visible. Two paths forward:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build the homegrown analytics layer to match PostHog’s method API. The package swap stays at one file.&lt;/li&gt;
&lt;li&gt;Update all the call sites to a new API. The package shape changes too.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Either works. The lesson is that a half-abstraction is worse than no abstraction in one specific way: it makes you think the swap is cheaper than it is. A package boundary named for swappability looks like an interface; it isn’t, until the methods underneath are wrapped too.&lt;/p&gt;
&lt;p&gt;If I were doing this again, I would either go all the way (a generic capture/identify interface, PostHog wrapped inside it) or not at all (call sites import &lt;code&gt;posthog-js&lt;/code&gt; directly, with a one-time migration cost when the swap eventually came). The middle path is the expensive one.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;bought-vs-built&quot;&gt;Bought vs built&lt;/h3&gt;
&lt;p&gt;Things I bought:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Clerk (auth, sessions, social login, org management, webhooks)&lt;/li&gt;
&lt;li&gt;Stripe (subscriptions, billing, customer portal)&lt;/li&gt;
&lt;li&gt;Resend (transactional + marketing email)&lt;/li&gt;
&lt;li&gt;Knock (multi-channel notification routing, user preferences)&lt;/li&gt;
&lt;li&gt;Vercel (hosting, edge runtime, AI Gateway)&lt;/li&gt;
&lt;li&gt;Anthropic via the gateway (Claude models)&lt;/li&gt;
&lt;li&gt;Sentry (error tracking)&lt;/li&gt;
&lt;li&gt;Logtail (structured logs)&lt;/li&gt;
&lt;li&gt;Upstash Redis (rate limiting)&lt;/li&gt;
&lt;li&gt;Sanity (marketing copy CMS)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Things I built:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The reflection multi-agent flow (see &lt;a href=&quot;/olllo/ai-product-craft&quot;&gt;AI Product Craft&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The accomplishment refinement chat&lt;/li&gt;
&lt;li&gt;The waitlist + invite + free-forever-grant system (see &lt;a href=&quot;/olllo/growth-engineering-experiments&quot;&gt;Growth Engineering&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The marketing email consent + tokenized unsubscribe (see &lt;a href=&quot;/olllo/growth-engineering-experiments&quot;&gt;Growth Engineering&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The voice capture pipeline (audio captured, transcribed, then extracted into a STAR-format entry)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The split is roughly: bought every commodity, built every product surface. The instinct was right in nearly every case. Auth, payments, and email delivery are all commodities. Reflection conversation is the product, and a custom build was the only way it could have worked.&lt;/p&gt;
&lt;p&gt;Two vendor decisions look different in hindsight than I thought they would when I picked them.&lt;/p&gt;
&lt;p&gt;PostHog for a homegrown analytics layer. The product was harder to use and integrate than I expected, especially around chart customization and Slack alerts. I have started building lightweight in-app analytics tailored to the metrics olllo actually needed, and it has been surprisingly cheap. There is complexity I might be missing that PostHog handles for free: funnel tools, retention cohort math, session replay. Whether the homegrown version stays simple as I add use cases is the open question.&lt;/p&gt;
&lt;p&gt;Knock turned out to be narrower than I bought it for. I picked Knock for multi-channel notification routing across email, in-app, and push, with user preferences and a send-history API. In practice I used it only for schedule management. Keeping email styling consistent across olllo meant rendering templates inside my own &lt;code&gt;@repo/email&lt;/code&gt; package and sending them through Resend, so the real flow is: Knock fires a scheduled webhook, &lt;code&gt;apps/api&lt;/code&gt; listens, the email package renders, Resend delivers. The promise of Knock-as-multi-channel-sender did not survive the practical need for consistent email design. With hindsight, I would consider replacing the Knock dependency with a cron and my own scheduling, since the value I extracted was the schedule, not the multi-channel send.&lt;/p&gt;
&lt;p&gt;The rest of the bought stack paid off cleanly. Clerk, Stripe, Resend, Vercel + AI Gateway, Sentry, Logtail, Upstash, and Sanity all behaved as advertised and saved meaningful build time on day one.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/architecture-and-technology/images/olllo-architecture-bought-vs-built.png&quot; alt=&quot;Three sections. BOUGHT · COMMODITY (every piece I didn&amp;#x27;t need to be a craftsman about): Clerk (auth, sessions, org), Stripe (subscriptions, billing), Resend (email delivery), Vercel (hosting, edge, gateway), Anthropic (Claude via gateway), Sentry (error tracking), Logtail (structured logs), Upstash (Redis, rate limit), Sanity (marketing CMS) — all marked &amp;#x22;as advertised.&amp;#x22; BOUGHT · WOULD RECONSIDER (the two that didn&amp;#x27;t survive contact with the design): PostHog (analytics — usability, experience and effort was not ideal), Knock (multi-channel notifications — narrower scope used, scheduling only, which I could have built). BUILT · PRODUCT (the five surfaces a custom build was the only way): Reflection multi-agent flow (the conversational core), Accomplishment refinement chat (the editing surface), Waitlist + invite + grant system (controlled rollout), Marketing consent + tokenized unsub (compliance, day one), Voice → STAR extraction pipeline (audio capture to entry).&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;why-solo&quot;&gt;Why solo&lt;/h3&gt;
&lt;p&gt;The honest answer to “why solo” is not “I prefer working alone.”&lt;/p&gt;
&lt;p&gt;I started olllo with a co-builder, and after about a month they became unavailable. I had two options at that point: pause and find another co-builder, or absorb the second seat and keep going. The architecture decisions documented above are mostly downstream of choosing to keep going.&lt;/p&gt;
&lt;p&gt;A co-built version of olllo would probably look different. Some of the structural discipline I imposed on myself (the constitution, the spec gates, the four-layer hierarchy) exists because the team-of-one cannot rely on review to enforce taste, so it has to be codified. With a second engineer, more of the discipline could have been informal. With a third or fourth, more of the package boundaries would have been driven by ownership rather than coupling.&lt;/p&gt;
&lt;p&gt;The architecture is what it is partly because of who built it. That is worth saying out loud rather than pretending the structure was always the plan.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-id-take-into-another-product&quot;&gt;What I’d take into another product&lt;/h3&gt;
&lt;p&gt;The apps + packages split, with a Turborepo backbone and a pnpm workspace. Universal default for any product more complex than a single Next.js app.&lt;/p&gt;
&lt;p&gt;The webhook/cron deployment isolation. Anything that runs on someone else’s schedule (Stripe, Clerk, Resend, cron) belongs in its own deployment with its own auth posture. This will not feel necessary on day one. It will feel obvious by month three.&lt;/p&gt;
&lt;p&gt;The four-layer component hierarchy, codified in a constitution. Cheap to enforce, expensive to retrofit, scales to a team without modification.&lt;/p&gt;
&lt;p&gt;The AI Gateway pattern. Whatever provider you pick, route everything through one chokepoint, get observability for free, treat model swaps as config rather than refactors.&lt;/p&gt;
&lt;p&gt;The single-database default. Split when you have a reason. Do not split because the domains feel different.&lt;/p&gt;
&lt;p&gt;What I would not bring forward: Expo, the half-abstracted analytics layer, and the way I integrated Knock without checking that the multi-channel send promise would survive my own design constraints.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-meta-point&quot;&gt;The meta-point&lt;/h3&gt;
&lt;p&gt;The point of architecture in a solo + AI build is not to look like a senior engineer. It is to make decisions that compound, draw boundaries that future-you will recognize, and skip the abstractions that are flattering on day one and expensive on day ninety.&lt;/p&gt;
&lt;p&gt;Most of what I built was the right shape. The pieces I would swap are the ones where I drew a line and then did not finish enforcing it. A package called &lt;code&gt;@repo/analytics&lt;/code&gt; that exposes PostHog’s method surface is dishonest in a small but consequential way. Future-you will believe the package name and discover the cost at the exact moment the swap was supposed to be cheap.&lt;/p&gt;
&lt;p&gt;The architectural taste I want to carry forward is unsentimental about that. Either the boundary is the interface, or the boundary is just a folder. Both are fine. The thing to avoid is the boundary that pretends to be an interface and is not.&lt;/p&gt;</content:encoded></item><item><title>Growth Engineering Experiments</title><link>https://hypoth.ai/olllo/growth-engineering-experiments</link><guid isPermaLink="true">https://hypoth.ai/olllo/growth-engineering-experiments</guid><description>I built the growth stack (waitlist, surveys, drip sequences, free-forever grants) and even the free cohort didn&apos;t show up. Here&apos;s the instrumentation story.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;growth-infrastructure-cannot-manufacture-habit&quot;&gt;Growth Infrastructure Cannot Manufacture Habit&lt;/h2&gt;
&lt;p&gt;A case study for what I learned trying. Four months of growth engineering across waitlist, surveys, drips, consent, and free-forever grants. What each piece was supposed to do, what each piece actually did, and the lesson I would take into the next product.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;opening&quot;&gt;Opening&lt;/h3&gt;
&lt;p&gt;This is the twin of the post-mortem. &lt;a href=&quot;/olllo/the-honest-post-mortem&quot;&gt;The post-mortem&lt;/a&gt; tells the story of behavior: people identified the pain, said it was real, and even with the price gate removed did not reliably show up. This case study is the part where I tried to fix that with infrastructure, and discovered that infrastructure is not the lever.&lt;/p&gt;
&lt;p&gt;That is the thesis in one sentence: growth infrastructure cannot manufacture habit. If the habit is there, infrastructure amplifies it. If the habit is thin, infrastructure measures the thinness with very high precision. Pricing experiments are the visible failure inside that frame. Behavior is the deeper one.&lt;/p&gt;
&lt;p&gt;I built a lot of infrastructure.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-shape-of-the-growth-stack&quot;&gt;The shape of the growth stack&lt;/h3&gt;
&lt;p&gt;Five layers, each with a job.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Top of funnel&lt;/strong&gt; — a waitlist on the marketing site, fed by LinkedIn posts, with a short survey to qualify intent. Admit users in waves.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Activation&lt;/strong&gt; — drip tips, onboarding reminders, and progress nudges, sent to bring people back on day two, three, and seven.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Virality&lt;/strong&gt; — referral codes with two-sided extensions, scaffolded into the billing path but never deployed to users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Goodwill&lt;/strong&gt; — free-forever grants for early users who stayed engaged through rough edges. Gratitude as an entitlement.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compliance&lt;/strong&gt; — a marketing email consent system with tokenized unsubscribe and per-topic preferences, in place before any marketing email got sent.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Underneath: a transactional email pipeline (queue → cron → delivery → telemetry) shared by all four user-facing layers, with the platform caveats covered in &lt;a href=&quot;/olllo/architecture-and-technology&quot;&gt;Solo Architecture&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;That is more growth machinery than most pre-launch products carry. The shape is intentional in some places, premature in others. The case study is about which was which.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-each-piece-was-actually-for&quot;&gt;What each piece was actually for&lt;/h3&gt;
&lt;p&gt;The honest version of why I built each layer. None of it was generic “growth hacking.” Each piece had a real reason that made sense at the time.&lt;/p&gt;
&lt;h4 id=&quot;waitlist--survey&quot;&gt;Waitlist + survey&lt;/h4&gt;
&lt;p&gt;Two jobs at once: get the name &lt;em&gt;olllo&lt;/em&gt; in front of people through LinkedIn marketing, and control the rate at which they entered the beta. I wanted to admit users in waves, because if there was a problem (a bug, a thin feature, an onboarding rough edge), I would rather hit it on twenty users than on two hundred. The waitlist was a safety mechanism as much as a marketing one.&lt;/p&gt;
&lt;p&gt;The survey on the waitlist replaced an earlier Typeform setup. Bringing it in-app meant the responses landed in the same database as the rest of the user data, so I could correlate “what someone said on the survey” with “what they actually did once admitted.”&lt;/p&gt;
&lt;h4 id=&quot;free-forever-grants&quot;&gt;Free-forever grants&lt;/h4&gt;
&lt;p&gt;Not a pricing experiment. A thank-you to the early users who provided feedback, helped surface unknown technical issues, and stayed engaged through rough edges. The grant system was deliberate gratitude, encoded as an entitlement that would persist even after I shut the product down.&lt;/p&gt;
&lt;h4 id=&quot;drip-tips--onboarding-reminders&quot;&gt;Drip tips + onboarding reminders&lt;/h4&gt;
&lt;p&gt;Celebration and continuation. Each tip was framed around progress the user had made, with a reason to come back and a reminder of why the next step mattered. Not aggressive re-engagement, not manufactured urgency. Closer to what I would have written by hand if I had been emailing each beta user individually.&lt;/p&gt;
&lt;h4 id=&quot;marketing-email-consent&quot;&gt;Marketing email consent&lt;/h4&gt;
&lt;p&gt;Compliance from day one. Tokenized unsubscribe links signed with JWT, per-topic preferences, an audit trail of consent events. I built this before any marketing email got sent, because I never wanted to be in the position of retrofitting compliance after a campaign had already gone out.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-the-data-actually-showed&quot;&gt;What the data actually showed&lt;/h3&gt;
&lt;p&gt;The infrastructure worked. The metrics did not.&lt;/p&gt;
&lt;p&gt;Top of funnel filled the way LinkedIn-driven waitlists usually do, in ones and twos:&lt;/p&gt;

















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Top of funnel&lt;/th&gt;&lt;th&gt;Count&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Waitlist signups&lt;/td&gt;&lt;td&gt;30&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Survey completions&lt;/td&gt;&lt;td&gt;14&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;From the thirty on the waitlist, I split admissions into two cohorts to learn something about the price gate. Ten were granted accounts with no payment step. Twenty were given a Stripe subscription path that required a credit card up front, in exchange for the beta plan: a 60-day free trial then $48/year (50% off the $96/year retail), with two extra months free on the annual plan.&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Activation&lt;/th&gt;&lt;th&gt;Cohort A — no payment (n=10)&lt;/th&gt;&lt;th&gt;Cohort B — Stripe gate (n=20)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Account created&lt;/td&gt;&lt;td&gt;9&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Onboarding completed&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Active at 1 month (4+ weekly logs)&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Paid conversion&lt;/td&gt;&lt;td&gt;n/a&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Two readings sit on top of each other.&lt;/p&gt;
&lt;p&gt;The credit-card gate was a hard wall. Eighteen of twenty users never finished account setup, and the two who did finish failed onboarding. The deal on offer was generous (60-day free trial, $48/year at 50% off retail, two more months free on annual), and it did not matter. Asking for a credit card before a user has done anything in the product is asking the user to commit before they have a reason to.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/growth-engineering-experiments/images/olllo-growth-cohort-b.png&quot; alt=&quot;Cohort B funnel: 20 invited from the waitlist, with a Stripe subscription requiring a credit card up front in exchange for the beta plan ($48/year at 50% off retail, 60-day free trial, two more months free on annual). 2 of 20 created accounts, 18 abandoned at the card wall. 0 completed onboarding. 0 active at one month.&quot;&gt;
The free cohort surfaced the deeper signal. With the payment friction removed, the funnel still narrowed: half dropped at onboarding, seven of ten had stopped logging by week four. The daily-capture habit did not form for most of the cohort that articulated the pain most clearly — and that is the same shape fitness apps and weight-loss apps live with: a population that names the goal correctly and does not reliably show up to do the work. It is the signal &lt;a href=&quot;/olllo/the-honest-post-mortem&quot;&gt;the post-mortem&lt;/a&gt; leans on.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/growth-engineering-experiments/images/olllo-growth-cohort-a.png&quot; alt=&quot;Cohort A funnel: 10 invited from the waitlist with no card required. 9 of 10 created an account, 5 of 10 completed onboarding, 3 of 10 were active at one month with four repeat weeks of logging.&quot;&gt;
&lt;em&gt;Cohort A. With the payment friction removed, the funnel still narrowed: 9 of 10 created accounts, 5 of 10 onboarded, 3 of 10 still logging weekly at week four. Even free, the habit didn’t form for most of the cohort that named the pain most clearly.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;By layer, the same pattern shows up everywhere infrastructure was supposed to do the lifting:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/growth-engineering-experiments/images/olllo-growth-layers.png&quot; alt=&quot;A per-layer summary table. Waitlist: signups arrived, queued, admitted in waves — but of those admitted, almost none became paid users. Survey: responses landed in the same DB as user data — but &amp;#x22;what they said&amp;#x22; rarely predicted &amp;#x22;what they did.&amp;#x22; Drip tips: open rates within normal range, clicks happened — but opens didn&amp;#x27;t convert to return visits. Referrals: codes generated, two-sided extensions credited — but few codes were ever sent and fewer redeemed. Grants: free-forever entitlements issued and persisted — but recipients didn&amp;#x27;t bring more recipients.&quot;&gt;
&lt;em&gt;Each layer worked. Each layer also failed to produce the outcome it was built for.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The instrumentation surfaces the conclusion cleanly, which is what good instrumentation is supposed to do. The growth stack did its job. Its job was to measure, and the measurement was not what I wanted it to be.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-order-of-operations-was-wrong&quot;&gt;The order of operations was wrong&lt;/h3&gt;
&lt;p&gt;The chronology of what shipped is the part of this case study I would change if I were doing it again.&lt;/p&gt;
&lt;p&gt;The features shipped in roughly this order across the back half of the project:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Late January: referrals scaffolded into the Stripe billing feature (never operationalized)&lt;/li&gt;
&lt;li&gt;Late January: marketing email consent system&lt;/li&gt;
&lt;li&gt;Early February: drip tip system + onboarding reminder emails&lt;/li&gt;
&lt;li&gt;Late March: free-forever grant migration&lt;/li&gt;
&lt;li&gt;Late March: in-app waitlist survey, replacing the earlier Typeform&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Read that list against what each piece is supposed to do, and the inversion is visible. I built virality scaffolding (referrals) and compliance (consent) early, then activation (drips, reminders), and shipped the waitlist survey near the end — the piece that should have come first to tell me whether to build the rest came last.&lt;/p&gt;
&lt;p&gt;The order I shipped reflects what was easy to build at each moment, not what would have validated demand fastest. Referrals were a natural extension of the Stripe billing work, so they came when billing did. Compliance was a clean self-contained project. Drips required content and scheduling infrastructure, which took longer to set up. The waitlist survey came late because the Typeform version had been good enough to delay the migration.&lt;/p&gt;
&lt;p&gt;A more disciplined order would have started with activation. If new users were not coming back on day two, none of the other layers were going to help.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/growth-engineering-experiments/images/olllo-growth-order-of-operations.png&quot; alt=&quot;Two stacked timelines. &amp;#x22;What I shipped&amp;#x22; runs Late Jan virality (referrals, built into Stripe), Late Jan compliance (marketing consent, JWT unsubscribe), Early Feb activation (drip tips + onboarding nudges), Late Mar goodwill (free-forever for early users), Late Mar funnel (waitlist survey replaced Typeform). &amp;#x22;What I would ship next time&amp;#x22; runs Week 1 activation (drips + reminders), Week 2 top of funnel (waitlist + survey), Week 4 compliance (before any send), Week 6 virality (after the curve bends), Week 8 goodwill (for early users).&quot;&gt;
&lt;em&gt;What shipped versus what I would ship next time. The next-time ordering puts activation first; virality only earns its slot after the curve bends without help.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-i-would-do-differently&quot;&gt;What I would do differently&lt;/h3&gt;
&lt;p&gt;The next time I build a product like this, I would not invest in scale-up infrastructure until I had verified, hands-on, that the product was meeting users’ needs and that the early audience was generating organic word-of-mouth without my help.&lt;/p&gt;
&lt;p&gt;Concretely, that means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Running the first ten or twenty users through the product manually, in close one-on-one observation, before any drip system gets built&lt;/li&gt;
&lt;li&gt;Driving week-over-week adoption through that hands-on attention until the curve bends without me touching it&lt;/li&gt;
&lt;li&gt;Only then investing in the scale machinery (drips, waitlists, surveys, and virality after the curve bends) that would let me step back&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Growth infrastructure is a multiplier. Applied to zero demand it produces zero growth; applied to a real signal it compounds. The mistake I made was reaching for the multiplier before the signal was there.&lt;/p&gt;
&lt;p&gt;The cleanest version of this lesson, the one I would tell another founder who asked: &lt;strong&gt;measure with people first, instrumentation second.&lt;/strong&gt; Fifteen-minute calls with the first cohort tell you more in a week than a drip system tells you in three months. The instrumentation has its place, but not before the calls.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-i-would-defend&quot;&gt;What I would defend&lt;/h3&gt;
&lt;p&gt;The compliance-first instinct.&lt;/p&gt;
&lt;p&gt;Most products build the marketing system, run a campaign, and then bolt on consent management once the legal requirement gets noticed or once a user complaint surfaces. I built the consent system before the campaign system, with tokenized unsubscribe and per-topic preferences in place from the first marketing email I ever sent. The audit trail of consent events meant I could prove, for any subscriber, when they opted in and to what.&lt;/p&gt;
&lt;p&gt;That work was invisible to users (which is exactly what compliance work should be) and would have saved me a hard problem if olllo had ever scaled to a place where data subject requests started arriving. It was also cheap to do up front and expensive to retrofit. The instinct to build compliance before need is one I would carry into every product going forward.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/growth-engineering-experiments/images/olllo-growth-instincts-to-defend.png&quot; alt=&quot;Three pieces, in place from day one. JWT-signed unsubscribe links carrying a per-recipient, per-topic token in the URL — one click, one verifiable revocation, no login or plaintext IDs. Marketing Preference makes topics first-class so a user can opt in to product tips and out of newsletters in the same row. Marketing Consent Event records every change as an immutable, append-only, timestamped, attributed event.&quot;&gt;
&lt;em&gt;The three compliance primitives that were in place before the first marketing email got sent. Each was cheap to build day-one and expensive to retrofit later.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-honest-takeaway&quot;&gt;The honest takeaway&lt;/h3&gt;
&lt;p&gt;Growth infrastructure measures and amplifies. It does not generate the underlying habit. The temptation in solo building is to mistake the act of building growth machinery for the act of growing, because the machinery is concrete and visible while the underlying habit is abstract and uncomfortable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/growth-engineering-experiments/images/olllo-growth-demand.png&quot; alt=&quot;Two side-by-side equations. Left, &amp;#x22;what I had&amp;#x22;: signal (nil, 0) × infrastructure (5) = growth (nil, 0), labelled &amp;#x22;thin demand × five layers of infrastructure = thin signal, instrumented precisely.&amp;#x22; Right, &amp;#x22;what infrastructure is for&amp;#x22;: signal (4) × infrastructure (5) = growth (20), labelled &amp;#x22;real demand × the same five layers = compounding. Drips bring people back. Referrals reach new ones. Goodwill keeps them.&amp;#x22; Underneath: &amp;#x22;Measure with people first, instrumentation second.&amp;#x22;&quot;&gt;
&lt;em&gt;Growth = signal × infrastructure. Multiply five layers by zero and you still get zero, instrumented precisely.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;I felt that temptation. Every drip email I shipped felt like progress. Every referral mechanic felt like traction. Every survey response felt like signal. None of it was wrong, and none of it was the bottleneck.&lt;/p&gt;
&lt;p&gt;The bottleneck was upstream. The post-mortem covers what was upstream and what it taught me. This case study is the instrumentation that surfaced the conclusion. Without the growth stack, I would have taken longer to read the demand signal accurately. With the growth stack, the read was unambiguous, and the decision to stop was easier to make.&lt;/p&gt;
&lt;p&gt;That is worth something. It is just not what I built the stack to be worth.&lt;/p&gt;</content:encoded></item><item><title>Design System Across Web + Native</title><link>https://hypoth.ai/olllo/design-system-across-web-and-native</link><guid isPermaLink="true">https://hypoth.ai/olllo/design-system-across-web-and-native</guid><description>Keeping a PWA, a Next.js web app, and an Expo native app visually consistent without a design team is a systems problem, not a components problem.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;h2 id=&quot;cross-platform-consistency-is-a-systems-problem-until-its-a-platform-problem&quot;&gt;Cross-Platform Consistency Is a Systems Problem Until It’s a Platform Problem&lt;/h2&gt;
&lt;p&gt;What I learned trying to keep three surfaces consistent without a design team. Why component libraries are no longer enough in the AI era. And the case for what I would build instead, the next time someone hands me a blank monorepo and a Claude API key.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;opening&quot;&gt;Opening&lt;/h3&gt;
&lt;p&gt;Cross-platform consistency is a systems problem until it’s a platform problem. That is the short version of what I learned trying to keep three surfaces of olllo visually coherent.&lt;/p&gt;
&lt;p&gt;The longer version is two layers deep, and the case study has to walk both. The first layer is the work itself: shadcn/ui as a primitives base, NativeWind to carry Tailwind syntax to the Expo mobile app, a four-layer component hierarchy enforced by the project constitution, a Next.js manifest making the web installable as a PWA. The second layer is the realization that what I had was not a design system. It was a component library wearing the words “design system” on its package.&lt;/p&gt;
&lt;p&gt;That distinction was uncomfortable in a useful way. I had built four design systems before olllo (an Angular system, a web components system, a React system, and one on top of Chakra UI) and across all of them the lesson had been the same. The components are not the system. The patterns are. The instructions for how a button, a header, and a section interact when they appear together are. The tokens for color, spacing, type scale, and motion are. The accessibility contract is. The visualization layer that lets a non-engineer see what the system produces is. A component library is one ingredient. A design system is the recipe.&lt;/p&gt;
&lt;p&gt;I knew this going in. I built a component library anyway, called it a design system, and moved on. Then AI started composing my components, and the gap between what I had and what I needed got loud.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-i-actually-built&quot;&gt;What I actually built&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;packages/design-system&lt;/code&gt; looks like a design system from the outside. Inside, it is a well-organized component library:&lt;/p&gt;
&lt;pre class=&quot;astro-code github-dark&quot; style=&quot;background-color:#24292e;color:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;plaintext&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span&gt;packages/design-system/&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;├── components/&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;│   ├── ui/         # shadcn/ui primitives (50+ components)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;│   ├── kibo-ui/    # chat &amp;#x26; AI surface components&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;│   ├── forms/      # form field wrappers (FormInputField, FormSelectField)&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;│   └── pricing/    # pricing-related composites&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;├── hooks/&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;├── providers/&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;├── styles/&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;├── lib/&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;├── components.json # shadcn config: New York style, neutral base, CSS variables&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;└── postcss.config.mjs&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Underneath it: shadcn/ui in the New York style, neutral base color, CSS variables for theming, lucide for icons, and a kibo-ui component family pulled in for chat and AI surfaces specifically. The mobile app uses the same Tailwind class vocabulary via NativeWind, with a separate &lt;code&gt;tailwind.config.ts&lt;/code&gt; mirroring the web config where it can. The PWA is the Next.js app with a manifest declared at &lt;code&gt;apps/app/app/manifest.ts&lt;/code&gt;, so “three surfaces” is honest but the third surface is a wrapped second surface.&lt;/p&gt;
&lt;p&gt;Component placement is governed by the four-layer hierarchy from the project constitution: design-system primitives, app-shared components, layout-shell components, and feature-route components. That hierarchy was the closest thing in the project to actual pattern documentation, and it is documented and enforced through speckit (&lt;a href=&quot;/olllo/culture-as-code&quot;&gt;Culture as Code&lt;/a&gt; covers it).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/design-system-across-web-and-native/images/olllo-layers.png&quot; alt=&quot;Four stacked layers of the component hierarchy: Layer 01 design-system primitives in packages/design-system/components/ui/, Layer 02 app-shared composed primitives, Layer 03 layout-shell page chrome, Layer 04 feature-route components scoped to a single route.&quot;&gt;
&lt;em&gt;The four-layer hierarchy from the project constitution. The boundary each layer enforces is the only formal pattern documentation the system has.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;What is missing from this picture, viewed against any of the four design systems I had worked on before, is everything that turns a component library into a design system: documented composition patterns (when to use a Card versus an Item versus a Field, and what they should contain), motion tokens and motion guidelines, an opinionated accessibility contract beyond what shadcn ships, a visualization layer that demonstrates patterns rather than individual components, and a place where a non-engineer could review the system’s output without reading code.&lt;/p&gt;
&lt;p&gt;I built none of that. I shipped on what was good enough for one engineer composing components by hand or by prompt, and the case study below is what that decision cost.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/design-system-across-web-and-native/images/olllo-component-library.png&quot; alt=&quot;A well-organized component library: 50+ shadcn/ui primitives boxed up, with dashed arrows pointing out toward unanswered questions about tokens, patterns, and the accessibility contract.&quot;&gt;
&lt;img src=&quot;/olllo/design-system-across-web-and-native/images/olllo-what-a-design-system-needs.png&quot; alt=&quot;Components plus the layers that constrain them: tokens, patterns, accessibility, and visualization linked around a central components node, with the components node marked as the only piece actually shipped.&quot;&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-shadcnui-gave-me-and-what-it-didnt&quot;&gt;What shadcn/ui gave me, and what it didn’t&lt;/h3&gt;
&lt;p&gt;shadcn/ui was the right primitives layer for olllo at the moment olllo got built. The flexibility is real (every component lives in your codebase, customizable to the file), the breadth is meaningful (50+ ui primitives plus kibo-ui’s chat and AI components saved me weeks on the assistant surfaces specifically), and the integration with Tailwind and the AI tooling around it was unmatched in early 2026.&lt;/p&gt;
&lt;p&gt;It is also not a design system. Nothing about shadcn/ui tells you when to use a &lt;code&gt;Card&lt;/code&gt; versus an &lt;code&gt;Item&lt;/code&gt; versus a &lt;code&gt;Field&lt;/code&gt; for a list of accomplishments. Nothing about it constrains a button’s size to match the page-header pattern. Nothing about it documents composition. The library hands you components and gets out of the way, which is exactly what makes it useful and exactly what makes it insufficient as the only artifact in the design system slot.&lt;/p&gt;
&lt;p&gt;Would I pick shadcn/ui again? No. The calculus has shifted in two ways since olllo started, and both push toward building something custom.&lt;/p&gt;
&lt;p&gt;The first shift is in what I weight. Long-term stability matters more to me now than it did at the time. shadcn ships components into your codebase that you own, which is good, but the conventions around them keep moving. Tailwind has its own breaking version cycle. The NativeWind plus Tailwind combination on the mobile side adds another moving part. A custom system has a single stability surface, the code I wrote, with no version mismatches between layers and no upstream conventions evolving underneath me.&lt;/p&gt;
&lt;p&gt;The second shift is AI capability. The reason “build from scratch” was prohibitively slow at the start of olllo was that custom design systems are months of repetitive scaffolding work. AI assistance has improved enough that the same work moves significantly faster now. The build-custom path that was infeasible against the product work is feasible today, and would land me in a better long-term place than reaching for shadcn would.&lt;/p&gt;
&lt;p&gt;Net: shadcn was right for where I was at. The tradeoff has changed. The next system I build will be one I own end to end, with AI assistance accelerating the construction rather than a third-party library accelerating my dependence.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-ai-composition-problem&quot;&gt;The AI composition problem&lt;/h3&gt;
&lt;p&gt;The clearest moment I have for this case study is small, specific, and recurring.&lt;/p&gt;
&lt;p&gt;The standard SaaS page-header pattern is a title on the left, optional breadcrumbs above it, and a primary call-to-action button on the far right. Across olllo’s authenticated surface, that pattern appears on every list view and most detail views: Accomplishments, Goals, Reflections, Settings, every one. The button on the far right is the page’s primary action: &lt;em&gt;New Accomplishment&lt;/em&gt;, &lt;em&gt;Add Goal&lt;/em&gt;, &lt;em&gt;Start Reflection&lt;/em&gt;. There is one canonical visual treatment for that button, and there should never be variation.&lt;/p&gt;
&lt;p&gt;Across thirty-eight numbered features, the AI sometimes rendered that button as &lt;code&gt;size=&quot;default&quot;&lt;/code&gt; and sometimes as &lt;code&gt;size=&quot;sm&quot;&lt;/code&gt;. Not because the prompt asked for variation. Not because I wanted variation. The model would pick a size, often the right one, sometimes a smaller one, with no reliable way to predict which.&lt;/p&gt;
&lt;p&gt;I added checks and balances. Component conventions in CLAUDE.md. Examples in the closest spec file. A note in the constitution. Type-level constraints where I could push them down. The variation kept happening.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/design-system-across-web-and-native/images/olllo-button-drift.png&quot; alt=&quot;Side-by-side capture of two page headers from the olllo app: Accomplishments with size=default (36px) on the left, Goals with size=sm (28px) on the right, both rendered from the same canonical pattern.&quot;&gt;
&lt;em&gt;The variation users feel without naming. The component library allows it; the design system that should have prevented it does not exist.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The diagnosis is two parts.&lt;/p&gt;
&lt;p&gt;The first part is a failure of the component library. shadcn’s &lt;code&gt;Button&lt;/code&gt; component takes a &lt;code&gt;size&lt;/code&gt; prop with &lt;code&gt;default&lt;/code&gt;, &lt;code&gt;sm&lt;/code&gt;, &lt;code&gt;lg&lt;/code&gt;, and &lt;code&gt;icon&lt;/code&gt; as values, and &lt;em&gt;the component does not encode the page-header pattern&lt;/em&gt;. There is no &lt;code&gt;Button&lt;/code&gt; variant called &lt;code&gt;pageHeaderPrimary&lt;/code&gt; that is locked to the canonical size. The component library is correctly generic and incorrectly silent on the pattern.&lt;/p&gt;
&lt;p&gt;The second part is the AI part, and it is the new part. A solo developer composing components by hand, with a component library and no design system, will be reasonably consistent over time because their hands have a memory the file system doesn’t. A solo developer composing components with an AI assistant has none of that hand-memory advantage. The assistant has an opinion about button size every time it generates a page header, and the opinion drifts. Today’s prompt produces &lt;code&gt;size=&quot;default&quot;&lt;/code&gt;. Next week’s prompt, with no relevant change in context, produces &lt;code&gt;size=&quot;sm&quot;&lt;/code&gt;. The model is not wrong; the model is correctly inferring from a library that does not constrain the choice.&lt;/p&gt;
&lt;p&gt;This is not a shadcn problem. It is a category problem. Component libraries assumed a developer was the constraint on consistency. With AI in the loop, the assistant is making the composition decisions, and a library that does not encode patterns will be composed inconsistently.&lt;/p&gt;
&lt;p&gt;The AI era moves the design system requirement from &lt;em&gt;useful&lt;/em&gt; to &lt;em&gt;necessary&lt;/em&gt;. Without one, every prompt is a small bet on whether the model remembers what consistency looks like in your project. Some of those bets land. Enough of them land badly that a careful reader can feel the inconsistency even if they cannot name it.&lt;/p&gt;
&lt;p&gt;That feeling is what users mean when they say a product feels off without being able to point to anything specific. It is the texture of an inconsistent system, and component libraries cannot prevent it on their own.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;why-flexibility-is-the-cost&quot;&gt;Why flexibility is the cost&lt;/h3&gt;
&lt;p&gt;The deeper read on this applies to any flexible component library used as the foundation for a consistent product, not just shadcn or Tailwind specifically.&lt;/p&gt;
&lt;p&gt;The more flexible the library, the more variations an AI assistant can choose from on any given prompt. Every prop, every variant, every size, every spacing class is a degree of freedom for the model. A library with five button sizes generates more visual variation than a library with two. A library where Cards can contain anything generates more variation than one with a strict slot pattern. A library where margin can be any of twenty Tailwind classes generates more variation than one with three predefined spacing tokens.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/design-system-across-web-and-native/images/olllo-flexibility-is-the-cost.png&quot; alt=&quot;A line chart plotting perceived inconsistency against library degrees of freedom. The human-composed line rises gently from low to mid; the AI-composed line tracks it at a tightly-constrained system but climbs steeply through shadcn/Tailwind territory and into &amp;#x22;high&amp;#x22; at fully flexible.&quot;&gt;
&lt;em&gt;AI composition compounds flexibility into inconsistency much faster than a human composer does. The crossover happens around shadcn/Tailwind’s degrees of freedom.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This is exactly why people love shadcn and Tailwind. The flexibility is the feature. Pre-AI, that flexibility let solo developers ship fast and tailor everything. In the AI tooling era, the same flexibility is what makes v0, Lovable, Bolt, and similar generators work at all: the model can satisfy almost any prompt because the underlying primitives can be assembled into almost any output.&lt;/p&gt;
&lt;p&gt;The same property that makes a library good for AI tools that build is what makes it bad for AI tools that compose inside an existing product. When the goal is an opinionated UI driving consistent feel across forty-plus surfaces, flexibility is the enemy. The best design systems are the ones with the most constraints: one right way to render a page header, one right way to lay out a card, one right way to space a form. Constraints are how the system stays the system across hundreds of compositions.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/design-system-across-web-and-native/images/olllo-build-vs-compose.png&quot; alt=&quot;Two cards labeled BUILD MODE (v0, Lovable, Bolt — flexibility lets the model satisfy any prompt) and COMPOSE MODE (inside an opinionated product — flexibility is the source of drift), under the headline &amp;#x22;the same property that makes a library good for tools that build is what makes it bad for tools that compose inside an existing product.&amp;#x22;&quot;&gt;
&lt;em&gt;Build mode and compose mode want opposite properties from the same primitives.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;shadcn and Tailwind sit at exactly the wrong end of that spectrum for the consistency goal. That is not a critique of the libraries; it is a recognition that the same primitives used in two modes (build a thing fast, or compose inside an existing thing consistently) require opposite properties.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-platform-problem-reveals-itself&quot;&gt;The platform problem reveals itself&lt;/h3&gt;
&lt;p&gt;Even if every component had been perfectly consistent across the codebase, cross-platform consistency would still have been the wrong goal in places.&lt;/p&gt;
&lt;p&gt;NativeWind let me carry Tailwind class syntax into the Expo mobile app, which made styling cheap to author. What it did not carry was platform conventions. iOS users expect a sheet to slide up from the bottom with a specific easing curve, dismiss with a specific gesture, and use the system’s blur and depth conventions. Android users expect different defaults. A web user expects neither. Tailwind classes do not translate any of this; they translate visual properties.&lt;/p&gt;
&lt;p&gt;The result was a mobile app that &lt;em&gt;looked&lt;/em&gt; consistent with the web app at the pixel level and &lt;em&gt;felt&lt;/em&gt; slightly off in the hand. Not broken. Not unusable. But the kind of subtle wrongness that native developers spot in a second and that translation-layer apps never quite shake.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/olllo/architecture-and-technology&quot;&gt;Solo Architecture&lt;/a&gt; covers the broader Expo reconsideration in detail. The design system angle on it is specific: the goal of cross-platform consistency was, in retrospect, the wrong target for half the surface area. Native iOS users do not benefit from a button that looks identical to its web counterpart. They benefit from a button that uses iOS-native press behavior, haptic feedback, and platform-typical visual weight. The cross-platform consistency I was protecting was protecting nobody.&lt;/p&gt;
&lt;p&gt;The right framing, with hindsight: there are surfaces where cross-platform consistency is a feature (brand, copy, identity color), and surfaces where it is a tax (interaction patterns, transitions, gesture vocabulary). A design system that does not distinguish between those surfaces will get both wrong.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-storybook-gap&quot;&gt;The Storybook gap&lt;/h3&gt;
&lt;p&gt;Component libraries need a visualization layer. Storybook is the default answer in the React community, and Storybook is its own friction.&lt;/p&gt;
&lt;p&gt;The version compatibility story is the worst part. Major version upgrades break stories, sometimes silently. Add-on ecosystems lag the core release schedule. CSF 2 to CSF 3 was not a free migration. A monorepo running Storybook against a Next.js 16 app and a separate Vite-based design system has at least three places where versions can disagree, and they sometimes do.&lt;/p&gt;
&lt;p&gt;I shipped Storybook in &lt;code&gt;apps/storybook&lt;/code&gt; because the alternative was no visualization layer at all. I did not maintain it as actively as the rest of the monorepo. Stories drifted from their components. Some were rewritten on every Storybook upgrade. By the end of the project, Storybook was a graveyard of partly-true documentation, which is worse than no documentation in one specific way: a reader trusts a partly-true Storybook the same way they trust a complete one, and gets misled.&lt;/p&gt;
&lt;p&gt;The lesson is not that Storybook is bad. Storybook solves a real problem and there is no obvious better answer in early 2026. The lesson is that the visualization layer being a separate piece of infrastructure with its own upgrade cycle, addon catalog, and configuration is a structural mistake the industry has not yet corrected.&lt;/p&gt;
&lt;p&gt;A design system worthy of the name should not require its visualization layer to be a separate framework with separate breakages. Components, patterns, tokens, accessibility tests, and visual documentation should live in one system that upgrades together.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;what-id-take-into-another-product&quot;&gt;What I’d take into another product&lt;/h3&gt;
&lt;p&gt;Build the primitives layer myself, with AI assistance, rather than reaching for shadcn/ui. The build-custom path is feasible today in a way it was not when olllo started. Long-term stability (owning every component, every token, every pattern, with no version mismatches between layers) is worth more to me now than the day-one acceleration shadcn provided.&lt;/p&gt;
&lt;p&gt;Treat the component library as one ingredient, not the whole system. Document composition patterns explicitly, in a place AI assistants will read on every session. CLAUDE.md is one such place; a richer version would be a &lt;code&gt;patterns.md&lt;/code&gt; per package, with concrete examples of what good composition looks like and what to avoid.&lt;/p&gt;
&lt;p&gt;Distinguish cross-platform consistency from cross-platform translation. Brand and identity should be consistent across surfaces. Interaction patterns should follow platform convention. Carry Tailwind syntax across surfaces if it helps, but stop pretending the result is the same product everywhere.&lt;/p&gt;
&lt;p&gt;Skip Storybook unless and until something fundamental changes about how it is maintained. Use a smaller scoped solution (a single docs route in the design-system package, generated from real code, updated at build time) until the industry produces a unified visualization layer that does not break on its own.&lt;/p&gt;
&lt;p&gt;The thing I would not bring forward at all is the unspoken belief that a component library plus tokens equals a design system. It does not, and the next product I build will be honest about that from day one.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;the-future-id-build-toward&quot;&gt;The future I’d build toward&lt;/h3&gt;
&lt;p&gt;The future of design systems in the AI era is a single integrated system, not a piecemeal of separate ones.&lt;/p&gt;
&lt;p&gt;Today the responsible solo setup glues several pieces together: a primitives library, a token layer, separate accessibility testing, pattern documentation in CLAUDE.md or similar, Storybook for visual review, a motion library, and the developer’s hand-memory holding it all together. Each piece has its own upgrade cycle and its own way of being out of date. The cracks between them are where AI composes inconsistently.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/design-system-across-web-and-native/images/olllo-system-ai-era-needs.png&quot; alt=&quot;A two-column list of the seven pieces that glue together to approximate a design system today: primitives library (shadcn/ui), token layer (Tailwind config), accessibility tests (separate), pattern docs (CLAUDE.md), visualization (Storybook), motion library (separate), hand-memory (the developer). Footer: 7 upgrade cycles · 7 ways to be out of date.&quot;&gt;
&lt;em&gt;Seven pieces, seven upgrade cycles. The cracks between them are where AI composes inconsistently.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The system I would build would unify these into a single source of truth that both humans and AI assistants can read and respect:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Components, tokens, and patterns in one package, versioned together&lt;/li&gt;
&lt;li&gt;Composition patterns expressed as types, so the AI sees the constraint and the human sees the demonstration&lt;/li&gt;
&lt;li&gt;Accessibility contracts encoded into component types, not retroactively tested&lt;/li&gt;
&lt;li&gt;A built-in visualization layer generated from the same source as the components, with no separate Storybook to drift&lt;/li&gt;
&lt;li&gt;A pattern enforcement layer that catches “wrong size for this context” the way TypeScript catches “wrong type for this argument”&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The pieces exist in fragments today. Stitching them together is what the AI era is asking for. Someone will build it, because the cost of not having it compounds with every prompt that adds a small inconsistency to a product supposed to feel coherent.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/olllo/design-system-across-web-and-native/images/olllo-unified-system-architecture.png&quot; alt=&quot;Concentric rings labeled, from the center out: Tokens (one source), Components, Motion, Accessibility, Patterns, Visualization, all enclosed by a dashed arc reading &amp;#x22;one version · ships together&amp;#x22; and a single version number underneath.&quot;&gt;
&lt;em&gt;The system the AI era needs. One source of truth, one version, one place where humans and assistants both go to learn what consistency looks like in this product.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&quot;where-this-leaves-us&quot;&gt;Where this leaves us&lt;/h3&gt;
&lt;p&gt;A design system in the AI era is no longer optional infrastructure for products that want to feel coherent. The composition decisions are happening whether or not the system encodes them; the question is whether they happen with constraints or with drift.&lt;/p&gt;
&lt;p&gt;Component libraries solved a real problem in the developer-as-composer era. That era has changed underneath us, and the libraries have not caught up. The interim discipline (explicit composition patterns in places AI will read, treating consistency as a contract instead of a hope, distinguishing the surfaces where consistency helps from the ones where it hurts) is the work of bridging the gap until the industry produces a system that closes it.&lt;/p&gt;
&lt;p&gt;What I built for olllo was the best I could ship solo in the time I had. What I learned building it is the more interesting half of this case study, and the part I would carry into anything I build next.&lt;/p&gt;</content:encoded></item></channel></rss>