SG Resale: Agent-Native Retail logo

SG Resale: Agent-Native Retail

Shop by vibe, not by database query

Web AppFull-Stack AI Product Engineer
Shipped

The Problem

Secondhand platforms force you to search like a database, not like a person. Nobody shops by typing "blue jeans size 32." They shop by vibe: "y2k denim under $80" or "quiet luxury with a worn-in feel." The gap between how people talk about fashion and how resale platforms let you search for it is the entire product opportunity.

What It Does

1

Natural language shopping across 4,000 scraped Depop products

2

41-tool agent system across 8 categories — search, filter, sort, compare, cart, inventory, style, and conversation

3

Hybrid pgvector semantic search + structured filter queries across 19 dimensions: era, condition, brand tier, fit, aesthetic, price, color, occasion, material, pattern, and more

4

Real-time SSE streaming with dual-path routing and heartbeat keepalive — the POST request is the stream

5

Filter provenance tracking: every active filter records whether it was set by the agent, the user, or negotiated collaboratively

6

Durable checkpointing — agent state persisted to Cloud SQL after every tool call with optimistic locking, so the agent resumes cleanly after instance rotation

7

Maintains full conversation context across multi-turn sessions, not just single queries

What the AI Does

This is not AI-assisted search. The AI is the entire shopping experience. Google Gemini runs in an agent loop where each user message triggers a sequence of tool calls until the agent decides to respond.

The agent uses hybrid model routing: Flash handles tool selection (cheap, fast, 10x less cost), Pro handles synthesis and complex reasoning (quality). Automatic fallback with rate-limit quarantine ensures the experience degrades gracefully, not catastrophically.

Complex behaviors emerge from simple tools — "compare the cheapest and most expensive" triggers search, sort, add-to-compare, and get-comparison in sequence. The agent self-evaluates confidence before acting: low confidence triggers a clarifying question, not a guess.

Every filter tracks its source through a Glass Box UI: was it inferred by the agent, set by the user, or negotiated collaboratively? This is not cosmetic — it determines clearing behavior. User Supremacy means agent-cleared filters behave differently from user-cleared filters. The user can always see what the agent decided and why.

The agent decides what to do. The rules decide what it is allowed to do.

Key Design Decisions

Confidence evaluation is rule-based, not model-based

Deterministic logic evaluates whether the agent should act or ask for clarification — not a second model call assessing its own confidence.

A model evaluating its own confidence is circular and unauditable. Deterministic rules are debuggable, testable, and add zero latency.

User Supremacy

The agent can clear filters it set, but cannot override filters the user set. Every filter tracks provenance (agent | user | collaborative) so this rule is enforceable at the system level.

An agent that overrides your choices is not helpful — it is adversarial. Filter provenance makes this a system guarantee, not a prompt instruction.

$300 confirmation gate

Any cart addition over $300 requires explicit user confirmation. The agent cannot spend freely above this threshold.

High-value actions require human approval. The threshold balances convenience (most resale items are under $300) with risk. This is the line between autonomy and oversight.

Hybrid model routing (Flash + Pro)

Flash for tool selection (cheap, fast). Pro for synthesis and complex reasoning (quality). Automatic fallback with rate-limit quarantine prevents cascading failures.

Pro for everything costs 10x more and adds latency on tool calls where quality does not matter. Flash for everything degrades synthesis quality. The hybrid approach optimizes cost and quality per task type.

Hard iteration cap of 10 tool calls per query

No runaway loops possible. The agent must resolve within 10 tool calls or surface what it has.

Unbounded agent loops are unpredictable and expensive. A cap forces efficient tool use and guarantees response time.

Emergence testing over unit testing for agent behavior

25 tests validate behaviors that emerge from tool composition — not individual tool correctness. Tests like "a vague query should trigger clarification" verify system-level outcomes.

Unit tests verify tools work in isolation. Emergence tests verify the system produces useful outcomes when tools interact — the gap where real bugs hide in agent systems.

What Broke and What It Taught Us

State Thrashing

A user asks for "cooling percale sheets." The agent sets filters — material: percale, feel: cooling — runs the search, returns results. The user removes the feel: cooling filter to browse all percale options. On the next message, the agent re-reads its original interpretation from session state and silently re-applies the removed filter. From the user's perspective: "I keep removing this filter and it keeps coming back." The agent was fighting the user for control of the interface.

The fix was architectural, not a prompt change. Every filter now carries a provenance tag — agent or user — tracked per-filter, per-session. When a user removes an agent-inferred filter, that removal is recorded with user provenance. On the next turn, the agent's context includes the fact that the user explicitly rejected that filter. A skill-level instruction tells the agent: never re-apply a filter the user has removed in this session. The prompt governs the agent's judgment; the code enforces that user-provenance actions are immutable.

The Hallucination of Emptiness

On the first turn of a new session, the UI shows default products — bestsellers, trending items. The agent had no way to distinguish between "I searched and found these results" and "the user is looking at default catalog items." It would sometimes treat defaults as if the user had already searched, or claim "I didn't find any results" when results were displayed — because they came from the default catalog load, not the agent's own search tool. The agent was confidently narrating a version of reality that didn't match what the user was seeing.

The SessionState object now tracks results provenance — whether displayed items came from an agent search, a default catalog load, or a user-initiated filter change. The agent's context includes this metadata so it knows the difference between "results I found for you" and "products that were already on screen." Tool results were also compressed from 30+ fields per product down to 5 (id, name, price, match_score, match_summary), with a separate get_product_details tool for deep dives. This mirrors how humans scan results: skim the list, then click into what looks promising.

Context Fatigue Loops

Gemini 2.5 Pro generates ~730 thinking tokens per call with no API control to disable them. Every agent call — even extracting "red dress under $100" into a tool schema — had a 6-17 second time-to-first-token. The agent loop makes two LLM calls minimum, so users waited 12-34 seconds before seeing anything. Worse, as tool results accumulated in the context window, Pro lost track of its completion instructions. It would acknowledge in its reasoning that it had enough results, then immediately re-search instead of responding. In the worst cases, it looped 3-4 times — a query that should take 3-5 seconds took 30-60+.

Pro was removed from the production loop entirely. The architecture migrated to Gemini 3 Flash with dynamic reasoning modulation — thinking_level: MINIMAL for tool-call phases (entity extraction, schema mapping) and thinking_level: LOW for synthesis. This dropped end-to-end latency from 30-60s to 3.5-5s. The key insight: the right model for an agent loop isn't the smartest model — it's the model whose reasoning overhead you can actually control.

Designing Shared Control

This agent controls three UI surfaces simultaneously — a chat interface, a search results panel, and a generative filter sidebar. The user can interact with any of them at any time. How do you design the contract between an agent that's trying to help and a user who might override it mid-thought? Most agent systems solve this by either giving the agent full control (chatbot mode) or giving the user full control (tool mode). SG Resale had to do both simultaneously.

The answer turned out to be a concept borrowed from version control: provenance. Every filter carries a provenance tag — agent or user — visually distinguished in the UI (AI Aura styling for agent-set filters, standard solid styling for user-set). The key rule: user-explicit actions always override agent-inferred ones. When a user clicks on an agent-set filter — even just to confirm it — ownership transfers from agent to user provenance. The agent sees this and won't touch it. Removing an agent filter is logged as a "Non-Verbal Utterance" — a negative signal. Two removals of the same filter type in a session suppresses similar suggestions. Three across sessions lowers the confidence threshold for that filter type for that user.

What this taught us wasn't about filters — it was about trust. Users develop a sense of control when they can see what the AI inferred and correct it without friction. The critical UX requirement is that removing an agent-set filter must feel identical to removing a user-set filter — one click, no confirmation dialog. The moment you add friction to overriding the agent, you've broken the trust contract.

The moment you add friction to overriding the agent, you've broken the trust contract.

Building an Agent That Tests the Agent

An agent system produces different outputs for the same input depending on context window state, model temperature, and conversation history. You can't test it the way you test a deterministic UI — clicking through flows and checking states. The combinatorial space is too large, the failures are too subtle (results that look right but miss the vibe), and regressions surface in specific multi-turn sequences that manual testing rarely reproduces consistently. So you build an agent that tests the agent.

The system has four layers, each catching what the one before it can't. Scored queries (exploratory, rated 1-5 against ideal outcomes) test general quality. Probes (regression, pass/fail) are generated from previous failures — a failed query in Run 1 becomes a probe in Run 2. Multi-turn sequences test context retention: Turn 1 sets a budget, Turn 2 adds a category, Turn 3 adds a material — the system checks that all three constraints survive. Verification passes act as an independent audit layer: after every test, the system reads the DOM directly to confirm visible results actually match the active filters, regardless of what the agent reported. Cross-model evaluation — Claude judging Gemini's responses — eliminates self-preference bias in quality scoring.

But the most important design choice wasn't any single layer — it was that the system never declares "done." Areas move through a maturity ladder: Uncharted → Proven (requires 2+ consecutive passes) → Known-bug (probe fails 3+ consecutive runs, auto-files a bug). A concrete example: BUG005 tracked a multi-turn context loss where adding a condition filter after a search dropped prior search context. The probe caught it, tracked it failing for 6 consecutive runs, and confirmed the fix when it finally landed. Another: BUG006, a "temporary error" regression that was fixed in one PR, then resurfaced two runs later in a different query set — the probe caught the regression that would have slipped through manual testing. These aren't edge cases; they're the kinds of subtle, intermittent failures that define whether an agent system actually works in production.

You can't manually test emergent behavior at scale. So you build an agent that tests the agent — and you design the system so it never believes its own passing grade.

What Happened

Shipped and live. The platform runs a 41-tool agent system across 25,295 lines of TypeScript backend code and 28 React components. It indexes 4,000+ resale products with zero manual curation.

The agent handles natural language queries end-to-end: parsing intent, selecting tools via hybrid Flash/Pro routing, applying filters across 19 semantic dimensions, evaluating confidence through deterministic rules, and synthesizing responses with source attribution. Ambiguous terms like "vintage," "designer," and "affordable" trigger clarification rather than a guess.

The system includes durable checkpointing (state persists after every tool call), SSE streaming with heartbeat keepalive, and a Glass Box UI that shows users exactly which filters the agent inferred versus which they set themselves.

Measuring Success

What we track to know the agent is working.

Correction rate

How often users override agent-inferred filters. A declining rate means the agent interprets vibes more accurately over time.

Clarification rate

How often confidence drops low enough to ask versus act. Too high means the agent is timid; too low means it is guessing.

Filter accuracy by provenance

Breakdown of agent-set vs user-set vs collaborative filters. Reveals whether the agent is doing useful work or just adding noise.

Cost per query (Flash vs Pro split)

Hybrid routing effectiveness — what percentage of tool calls use Flash versus Pro, and whether synthesis quality holds.

Zero-result rate

Queries that return nothing reveal vocabulary gaps in the 19-dimension filter system.

Building something complex? Let's talk.

Looking for my next role designing and building AI products.

Get in Touch

Designed & Built by Drew Miller

© 2026. Version 3.3