AI-PoweredShipped to ProdConcept to CodeAgent NativeStartups

SG Resale: Agent-Native Retail

Shop by vibe, not by database query.

Role
Web App ·
Full-Stack AI Engineer
Status
Shipped
Year
2025–2026
Stack
Gemini · pgvector ·
SSE · Cloud SQL
Try it live
01 · The Problem

Secondhand platforms force you to search like a database, not like a person.

Nobody shops by typing “blue jeans size 32.” They shop by vibe — “y2k denim under $80” or “quiet luxury with a worn-in feel.” The gap between how people talk about fashion and how resale platforms let you search for it is the entire product opportunity.

02 · The Thesis

The agent loop is the experience.

Google Gemini runs in an agent loop where each user message triggers a sequence of tool calls until the agent decides to respond. Hybrid model routing splits the work — Flash handles tool selection (cheap, fast, 10× less cost), Pro is reserved for synthesis only (quality), with rate-limit quarantine forcing Flash fallback under pressure. Six choices below shaped how the agent and user share that loop.

Input
User message
10× cheaper, fast
Tool selection
Hybrid retrieval
19-dim filter + pgvector search
Output
Streamed response (SSE)
The agent decides what to do. The rules decide what it is allowed to do.
Operating principle
03 · Key Design Decisions

Six choices that shaped the agent.

Decision 01

Confidence evaluation is rule-based, not model-based

Deterministic logic evaluates whether the agent should act or ask for clarification — not a second model call assessing its own confidence.

Why it matters

A model evaluating its own confidence is circular and unauditable. Deterministic rules are debuggable, testable, and add zero latency.

Decision 02

User Supremacy

The agent can clear filters it set, but cannot override filters the user set. Every filter tracks provenance (agent | user | collaborative) so this rule is enforceable at the system level.

Why it matters

An agent that overrides your choices is not helpful — it is adversarial. Filter provenance makes this a system guarantee, not a prompt instruction.

Decision 03

$300 confirmation gate — with a compare-list bypass

Any cart addition over $300 requires explicit user confirmation. Items already in the compare list skip the gate — they’ve been reviewed in detail, so the friction would be redundant.

Why it matters

Friction belongs where review hasn’t happened, not as a blanket rule. The threshold balances convenience (most resale items are under $300) with risk; the bypass keeps the gate from punishing the user for thinking ahead.

Decision 04

Hybrid model routing (Flash + Pro)

Flash for tool selection (cheap, fast). Pro for synthesis and complex reasoning (quality). Automatic fallback with rate-limit quarantine prevents cascading failures.

Why it matters

Pro for everything costs 10× more and adds latency on tool calls where quality does not matter. Flash for everything degrades synthesis quality. The hybrid approach optimizes cost and quality per task type.

Decision 05

Hard iteration cap of 10 tool calls per query

No runaway loops possible. The agent must resolve within 10 tool calls or surface what it has.

Why it matters

Unbounded agent loops are unpredictable and expensive. A cap forces efficient tool use and guarantees response time.

Decision 06

Emergence testing over unit testing

300+ emergence test cases across 25 suites validate behaviors that emerge from tool composition — not individual tool correctness. Tests like "a vague query should trigger clarification" verify system-level outcomes.

Why it matters

Unit tests verify tools work in isolation. Emergence tests verify the system produces useful outcomes when tools interact — the gap where real bugs hide in agent systems.

04 · What Broke and What It Taught Us

Three production failures, three architectural rewrites.

These aren’t post-mortems for show. Each one rewrote part of the system below the prompt — the parts a model can’t fix on its own.

Failure 01

State Thrashing

What broke

A user asks for "cooling percale sheets." The agent sets filters — material: percale, feel: cooling — runs the search, returns results. The user removes the feel: cooling filter to browse all percale options. On the next message, the agent re-reads its original interpretation from session state and silently re-applies the removed filter. From the user's perspective: "I keep removing this filter and it keeps coming back." The agent was fighting the user for control of the interface.

How it was fixed

The fix was architectural, not a prompt change. Every filter now carries a provenance tag — agent or user — tracked per-filter, per-session. When a user removes an agent-inferred filter, that removal is recorded with user provenance. On the next turn, the agent's context includes the fact that the user explicitly rejected that filter. A skill-level instruction tells the agent: never re-apply a filter the user has removed in this session. The prompt governs the agent's judgment; the code enforces that user-provenance actions are immutable.

Failure 02

The Hallucination of Emptiness

What broke

On the first turn of a new session, the UI shows default products — bestsellers, trending items. The agent had no way to distinguish between "I searched and found these results" and "the user is looking at default catalog items." It would sometimes treat defaults as if the user had already searched, or claim "I didn't find any results" when results were displayed — because they came from the default catalog load, not the agent's own search tool. The agent was confidently narrating a version of reality that didn't match what the user was seeing.

How it was fixed

The SessionState object now tracks results provenance — whether displayed items came from an agent search, a default catalog load, or a user-initiated filter change. The agent's context includes this metadata so it knows the difference between "results I found for you" and "products that were already on screen." Tool results were also compressed from 30+ fields per product down to 5 (id, name, price, match_score, match_summary), with a separate get_product_details tool for deep dives. This mirrors how humans scan results: skim the list, then click into what looks promising.

Failure 03

Context Fatigue Loops

What broke

Gemini 2.5 Pro generates ~730 thinking tokens per call with no API control to disable them. Every agent call — even extracting "red dress under $100" into a tool schema — had a 6–17 second time-to-first-token. The agent loop makes two LLM calls minimum, so users waited 12–34 seconds before seeing anything. Worse, as tool results accumulated in the context window, Pro lost track of its completion instructions. It would acknowledge in its reasoning that it had enough results, then immediately re-search instead of responding. In the worst cases, it looped 3–4 times — a query that should take 3–5 seconds took 30–60+.

How it was fixed

Pro was removed from tool-call iterations and reserved for synthesis only. Tool-call phases now run Gemini 3 Flash with thinking_level: MINIMAL (entity extraction, schema mapping); synthesis runs Pro at thinking_level: LOW, with rate-limit quarantine forcing Flash fallback under pressure. End-to-end latency dropped from 30–60s to 3.5–5s. The right model for an agent loop isn't the smartest model — it's the model whose reasoning overhead you can actually control.

05 · Glass Box UI

An agent that helps and a user who can override it.

Glass Box UI screenshot — the dresses filter chip shows its provenance ('inferred from casey afternoon work attire') alongside the agent's natural-language response and three follow-up suggestions.
Glass Box UI — the dresses filter shows it was inferred, not user-set

Every filter carries a provenance tag — agent or user — visually distinguished in the UI. Click an agent-set filter and ownership transfers to user provenance; the agent sees this and won't touch it again. Remove an agent filter twice in a session and similar suggestions get suppressed.

The moment you add friction to overriding the agent, you've broken the trust contract.
Trust contract
06 · What Happened

Shipped and live.

The platform runs a 31-tool agent system across 25,295 lines of TypeScript backend code and 27 React components, indexing 4,000+ resale products with zero manual curation.

The agent handles natural-language queries end-to-end: parsing intent, selecting tools via hybrid Flash/Pro routing, applying filters across 19 semantic dimensions, evaluating confidence through deterministic rules, and synthesizing responses with source attribution. Ambiguous terms like vintage, designer, and affordable trigger clarification rather than a guess.

The system includes durable checkpointing (state persists after every tool call), SSE streaming with heartbeat keepalive, and a Glass Box UI that shows users exactly which filters the agent inferred versus which they set themselves.

31
tool agent system
25,295
lines of TS backend
27
React components
4,000+
indexed products
19
filter dimensions
3.5–5s
end-to-end latency

Building something complex? Let's talk.

Looking for my next role designing and building AI products.

Get in Touch

Designed & Built by Drew Miller

© 2026. Version 3.3