Project 1 · Dec 2025 – Present

sanctuary: Cross-Border E-Commerce Support Automation

Technical Support Engineer @ Global-e · TypeScript, MCP, MSSQL, Snowflake, Coralogix, Playwright

The Problem

Global-e operates a cross-border e-commerce platform. Technical support requires investigating across 10+ disconnected systems: admin portal, payment gateways (Adyen, Stripe, Klarna, PayPal, Worldpay), MSSQL/Snowflake databases, observability, JIRA/ Confluence, Zendesk, email logs, and merchant Shopify stores.

A single ticket investigation took 2+ hours and relied on tribal knowledge that was hard to transfer.

The Solution

I built sanctuary: one MCP server that brings the whole investigation lifecycle behind a single interface. It began as a multi-service stack in Python and Go. I rebuilt it as a single TypeScript MCP server, which cut operational overhead and, to my surprise, widened what it could do.

Key Features

Unified Order Investigation

A single composite tool pulls the complete picture in under a minute: DB joins across 6 pricing fields, payment transaction history, delivery timeline, refund chain, merchant configuration, and commercial invoice (CI) rendering, all with automatic discrepancy detection.

Multi-PSP Payment Diagnostics

Automated investigation across Adyen, Stripe, Klarna, PayPal, and Worldpay. The platform normalizes different APIs, authentication methods, and data formats into one diagnostic view, including a 444M-row Snowflake archive for payment history beyond MSSQL retention.

Measuring My Own Thoroughness

A 5-stage framework that measures my own investigation quality: it tracks false-positive rates, root-cause discovery frequency, and resolution-direction recall, with data-driven gates between stages. Stages 1–3 deployed; Stage 4 in an active measurement window.

Response Quality Assurance via Codified Rules

Before any customer-facing response goes out, a checker validates it against 216 documented learnings captured as feedback-memory files. 109 of them are codified into machine-enforced YAML gates that block the violation at the tool boundary. Coverage spans prohibited terminology, escalation-path correctness, factual accuracy, and tone calibration per recipient.

From Mistake to Rule to Hook

When a mistake recurs, the system codifies it: write a feedback memory → add a YAML gate → install a pre-tool-use hook → block the failure mode at the tool boundary. This brought 100+ recurring error categories down to near-zero recurrence in production.

Local, No-Egress Inference

Some investigations touch sensitive customer data (PII, payment details) that must never leave the machine. For those, a local LLM utility (ollama) runs extraction and classification with zero network egress: the data never crosses a network boundary. The local model handles only extraction and classification; masking stays on deterministic regex.

Bulk Operations with Multi-Layer Safety

Order status changes, uninvoiced processing, and merchant data exports are all gated by (a) dry-run mode by default, (b) an approval-token workflow, (c) a status-ID block list for irreversible actions, (d) post-write DB verification, and (e) audit logging to a checkpoint database.

Rebuilt from a Bare Machine

In mid-2026 a hardware failure during a motherboard swap wiped the machine. I rebuilt the whole system from a bare repository in three days. Backups came first: an encrypted archive plus a recovery rehearsal that proves the backup actually restores. Then the tools. I used the reset to redesign what I'd have kept out of habit: the routing layer, data-format conventions, and path handling. A golden-master harness now guards refactors by capturing each tool's output byte-for-byte, so a split or cleanup that changes behavior by even one byte fails the gate. That's how I split six oversized modules without shipping a single behavior change.

Seven Specialized SQLite Databases

Rather than one monolithic store, sanctuary uses 7 specialized SQLite databases, each owning a single concern: investigation.db (per-ticket checkpoint chain), tickets.db (cached ticket bodies with full-text search), conversation_log.db (cross-session metadata), mistakes.db (structured failure log feeding the codification pipeline), metrics.db (self-instrumented tool/hook metrics), knowledge.db (reusable domain knowledge), and forter_learning.db (domain-specific ticket cohort for active learning).

Splitting by concern keeps the system inspectable. Because failures live in their own database, I can ask "what mistakes happened this week?" and get an answer in one query, instead of digging them out of a generic events table.

Results

60s

Investigation per case (was 2hr+)

379

MCP tools

15,943

Cached tickets

109

YAML gates (from 216 learnings)

Person team

5 PSPs unified into a single diagnostic view
7 services across K8s + Docker → 1 unified TypeScript server
Investigation quality measurement: none → 5-stage framework with quantified gates

What I Learned

Consolidation beats orchestration. The earlier 7-service architecture was technically more impressive. The later one ships faster, debugs easier, and costs less to operate.

Behavior matters more than tool count. Giving an agent 379 tools is half the work. The other half is codifying how it should behave: what to verify before concluding, and when to ask instead of guessing.

Measure your own engineering. The Investigation Thoroughness framework treats my investigation quality as something I can put a number on. There's a real gap between believing I'm thorough and having the data show it, and that's the gap I care about closing.

A forced rebuild is a brutal audit. When a hardware failure wiped the machine, standing it back up took days, not the six months that preceded it. It came out better too, because I only rebuilt what had earned its place.

The Learning System Behind the Tools

Sanctuary's tools speed up the work. The learning system makes sure the work compounds.

Domain learning loop

For each support domain (payment fraud, shipping, customs, etc.), a 30-minute cycle two or three times a week: read reference → recall a real past ticket → simulate a customer/colleague exchange. The AI plays the other side of that exchange.

Two-repo portfolio

A private master in the engineering repo, a public Astro site. Polished essays are promoted from learning sessions through a slash command that enforces voice interviews and IP-masking gates.

Living retrospectives

Each essay is published in EN and KO together; JA follows selectively, only when the voice survives the language jump. The translation discipline forces clarity.

This system is why the portfolio exists at all. Without it, the work would vanish into closed tickets the moment they're marked solved; instead, it becomes something I can point to.

Read the retrospective →