SG Resale: Agent-Native Retail logo

SG Resale: Agent-Native Retail

Shop by vibe, not by database query

Web AppFull-Stack AI Product Engineer
Shipped

The Problem

Secondhand platforms force you to search like a database, not like a person. Nobody shops by typing "blue jeans size 32." They shop by vibe: "y2k denim under $80" or "quiet luxury with a worn-in feel." The gap between how people talk about fashion and how resale platforms let you search for it is the entire product opportunity.

What It Does

  • Natural language shopping across 4,000 scraped Depop products
  • 41-tool agent system across 8 categories — search, filter, sort, compare, cart, inventory, style, and conversation
  • Hybrid pgvector semantic search + structured filter queries across 19 dimensions: era, condition, brand tier, fit, aesthetic, price, color, occasion, material, pattern, and more
  • Real-time SSE streaming with dual-path routing and heartbeat keepalive — the POST request is the stream
  • Filter provenance tracking: every active filter records whether it was set by the agent, the user, or negotiated collaboratively
  • Durable checkpointing — agent state persisted to Cloud SQL after every tool call with optimistic locking, so the agent resumes cleanly after instance rotation
  • Maintains full conversation context across multi-turn sessions, not just single queries

What the AI Does

This is not AI-assisted search. The AI is the entire shopping experience. Google Gemini runs in an agent loop where each user message triggers a sequence of tool calls until the agent decides to respond.

The agent uses hybrid model routing: Flash handles tool selection (cheap, fast, 10x less cost), Pro handles synthesis and complex reasoning (quality). Automatic fallback with rate-limit quarantine ensures the experience degrades gracefully, not catastrophically.

Complex behaviors emerge from simple tools — "compare the cheapest and most expensive" triggers search, sort, add-to-compare, and get-comparison in sequence. The agent self-evaluates confidence before acting: low confidence triggers a clarifying question, not a guess.

Every filter tracks its source through a Glass Box UI: was it inferred by the agent, set by the user, or negotiated collaboratively? This is not cosmetic — it determines clearing behavior. User Supremacy means agent-cleared filters behave differently from user-cleared filters. The user can always see what the agent decided and why.

The agent decides what to do. The rules decide what it is allowed to do.

Key Design Decisions

Confidence evaluation is rule-based, not model-based

Deterministic logic evaluates whether the agent should act or ask for clarification — not a second model call assessing its own confidence.

A model evaluating its own confidence is circular and unauditable. Deterministic rules are debuggable, testable, and add zero latency.

User Supremacy

The agent can clear filters it set, but cannot override filters the user set. Every filter tracks provenance (agent | user | collaborative) so this rule is enforceable at the system level.

An agent that overrides your choices is not helpful — it is adversarial. Filter provenance makes this a system guarantee, not a prompt instruction.

$300 confirmation gate

Any cart addition over $300 requires explicit user confirmation. The agent cannot spend freely above this threshold.

High-value actions require human approval. The threshold balances convenience (most resale items are under $300) with risk. This is the line between autonomy and oversight.

Hybrid model routing (Flash + Pro)

Flash for tool selection (cheap, fast). Pro for synthesis and complex reasoning (quality). Automatic fallback with rate-limit quarantine prevents cascading failures.

Pro for everything costs 10x more and adds latency on tool calls where quality does not matter. Flash for everything degrades synthesis quality. The hybrid approach optimizes cost and quality per task type.

Hard iteration cap of 10 tool calls per query

No runaway loops possible. The agent must resolve within 10 tool calls or surface what it has.

Unbounded agent loops are unpredictable and expensive. A cap forces efficient tool use and guarantees response time.

Emergence testing over unit testing for agent behavior

25 tests validate behaviors that emerge from tool composition — not individual tool correctness. Tests like "a vague query should trigger clarification" verify system-level outcomes.

Unit tests verify tools work in isolation. Emergence tests verify the system produces useful outcomes when tools interact — the gap where real bugs hide in agent systems.

What Happened

Shipped and live. The platform runs a 41-tool agent system across 25,295 lines of TypeScript backend code and 28 React components. It indexes 4,000+ resale products with zero manual curation.

The agent handles natural language queries end-to-end: parsing intent, selecting tools via hybrid Flash/Pro routing, applying filters across 19 semantic dimensions, evaluating confidence through deterministic rules, and synthesizing responses with source attribution. Ambiguous terms like "vintage," "designer," and "affordable" trigger clarification rather than a guess.

The system includes durable checkpointing (state persists after every tool call), SSE streaming with heartbeat keepalive, and a Glass Box UI that shows users exactly which filters the agent inferred versus which they set themselves.

Measuring Success

What we track to know the agent is working.

Correction rate

How often users override agent-inferred filters. A declining rate means the agent interprets vibes more accurately over time.

Clarification rate

How often confidence drops low enough to ask versus act. Too high means the agent is timid; too low means it is guessing.

Filter accuracy by provenance

Breakdown of agent-set vs user-set vs collaborative filters. Reveals whether the agent is doing useful work or just adding noise.

Cost per query (Flash vs Pro split)

Hybrid routing effectiveness — what percentage of tool calls use Flash versus Pro, and whether synthesis quality holds.

Zero-result rate

Queries that return nothing reveal vocabulary gaps in the 19-dimension filter system.

Got more churn than a butter factory?

Let's whip your product into shape!

Get in Touch

Designed & Built by Drew Miller

© 2026. Version 3.2.0