Matching Engine Deep Dive

VantageDash's product matching engine identifies which competitor products correspond to your brand's products. It supports multiple matching strategies with configurable scoring, industry profiles, and vector similarity search.

Matching Methods

1. Fuzzy Matching (`product_matcher.py`)

Uses RapidFuzz for fast string similarity comparison. This is the original matcher and remains the default when use_new_matcher=false.

Algorithm: Weighted ensemble of three RapidFuzz metrics:

40% token_sort_ratio — order-independent word matching
35% token_set_ratio — handles extra/missing words
25% partial_ratio — substring matching for partial titles

Pipeline:

Load brand products from brand_products table
Load all competitor products from product_tracking table
Build a category index (O(n)) for pre-filtering — eliminates cross-category false positives
For each brand product:
- Extract category from title using keyword matching
- Find candidate competitors in the same category
- Calculate fuzzy score using the weighted ensemble
- Apply scoring adjustments (size, material, price, etc.)
- Products above active_threshold (default 65) → auto-approved match
- Products between pending_threshold (40) and active → pending review
Write matches to product_groups table with variant protection (prevents duplicate matches)

Scoring Dimensions (when using industry profiles):

Dimension	Match Bonus	Mismatch Penalty
Category	+15	-40
Size	+30	-100
Unit/Capacity	+10	-25
Quantity	+10	—
Material	+10	-30
Price proximity	+5	-50
Keyword overlap	+12	-10
Specialty brand	—	-100
Stale product (>30 days)	—	-5

2. AI Matching (`ai_matcher.py`)

Uses GPT-4o-mini for semantic understanding. More accurate than fuzzy matching but slower and costs ~$0.01-0.02 per brand product.

Pipeline:

Load brand products from brand_products
For each brand product:
- Candidate retrieval: Try pgvector similarity search first, fall back to ILIKE substring search
- Attribute extraction: GPT-4o-mini extracts structured attributes (category, size, dimensions, material, etc.) from product names
- Semantic matching: GPT-4o-mini compares brand product attributes against candidates, returns confidence 0-100 with reasoning
Products above ai_active_threshold (80) → auto-approved
Products between ai_pending_threshold (50) and active → pending review
Write matches to product_groups

AI Prompt Architecture:

Extraction prompt: Dynamic — reads category list from industry profile (or hardcoded defaults). Extracts: category, size, size_ml, dimensions, material, pack_quantity, color, finish_type
Matching prompt: Compares two products with rules injected from industry profile. Returns: match (bool), confidence (0-100), reason (string)
Retry with backoff: Failed API calls retry with exponential backoff

3. Hybrid Matching

Combines fuzzy and AI matching:

Run fuzzy matching first (fast, cheap)
Products with confidence 40-79% (uncertain zone) are sent to AI verification
AI upgrades or downgrades the match confidence
Best of both: fuzzy speed for clear matches, AI accuracy for edge cases

4. Vector Similarity (pgvector)

Used as the candidate retrieval step for AI matching, replacing slow ILIKE substring queries.

How it works:

Products get text embeddings via OpenAI text-embedding-3-small (1536 dimensions)
Embeddings stored in embedding vector(1536) columns on product_tracking and brand_products
HNSW indexes enable fast approximate nearest-neighbor search
match_products_by_embedding() RPC function returns top-N similar products
Results are diversified across competitors and filtered by price range

Cost: ~$0.0001 per product (one-time at scrape/sync)

5. Parallel Validation

A/B testing tool for comparing old (hardcoded) vs new (profile-driven) matchers:

POST /api/match/validate with configurable sample_size (1-200)
Runs both matchers on the same sample of brand products
Produces per-product verdicts:
- both_match_same — same match, same confidence range
- both_match_different — both matched but different confidence
- old_only — old matcher found a match, new didn't
- new_only — new matcher found a match, old didn't
- both_nomatch — neither found a match
Results stored as JSON in the session row (read-only, no product_groups writes)
Retrieve via GET /api/match/validate/{session_id}

Candidate Retrieval

The matcher needs to find potential competitor matches for each brand product. There are two strategies:

Vector Search (preferred when embeddings exist)

Brand product text → embedding → pgvector cosine similarity → top candidates

Uses match_products_by_embedding() RPC function
Threshold: 0.3 similarity minimum
Diversifies across competitors (max results per competitor)
Filters by price range (10x ratio guard)

ILIKE Fallback

Brand product title → keyword extraction → ILIKE '%keyword%' queries

Uses category keywords and search terms from industry profile
Slower and less accurate than vector search
Always available (no embedding generation needed)

Price Guards

A 10x price ratio limit prevents absurd matches:

If brand product is $5 and competitor is $100, they won't match
Configurable via scoring_config.price_ratio_limit
Price proximity within 20% gets a +5 bonus
Extreme price divergence (>5x) gets a -50 penalty

Stale Product Detection

Products scraped more than 30 days ago receive a -5 confidence penalty. This prevents matching against outdated pricing data.

Confidence Distribution Tracking

Every matching session tracks the distribution of confidence scores:

High: 80-100%
Medium: 50-79%
Low: below 50%

This helps measure matching quality across runs.

Feature Flag: `use_new_matcher`

The industry_profiles table has a use_new_matcher boolean (default false):

false: All matchers use hardcoded logic (original behavior)
true: Matchers read categories, scoring weights, and rules from the industry profile

This allows safe rollout — validate with parallel validation first, then flip the flag.

Session 33 Bugs Fixed (2026-03-18)

Three critical silent-failure bugs in ai_matcher.py were discovered during end-to-end testing with real packaging competitors:

Non-existent **image_url** column — find_candidates() selected image_url from product_tracking (doesn't exist). PostgREST returned 400 on every ILIKE query, silently caught by try/except → 0 candidates found.
Missing **tenant_id** in product_groups insert — inserts failed against NOT NULL FK constraint. Matches found but never saved.
Non-existent **status**/**price_comparison** columns — insert included fields that don't exist on product_groups. Fixed by removing status, renaming price_comparison → differences.
matching_logs missing tenant_id column — table lacked the column entirely, causing all log inserts to fail silently. Fixed via ALTER TABLE.

Result: 19 product matches at 80-92% confidence across 4 competitors from 6,897 scraped products.

Variant Quantity Extraction (extract_variant_quantity() in price_utils.py):
- Parses Shopify variant titles: "Pack of 100", "Case of 1000", "Single (1)", "24ct", "Bundle of 20"
- Falls back to product title parsing via extract_quantity() for competitor products
Brand PPU Computation (compute_brand_product_ppu()):
- Analyzes all variants of a brand product
- Picks the variant whose price matches the base price
- Computes PPU = price / pack_quantity
Closest Variant Selection (find_closest_variant_to_quantity()):
- When comparing against a competitor with a known pack size (e.g., 1000ct)
- Finds the brand variant with the closest pack quantity
- Used by AI matcher for accurate per-unit comparison

AI Matcher Integration

The AI matcher now computes PPU server-side instead of relying on GPT-4o-mini math:

Extracts competitor pack quantity from product name
Picks best brand variant for that pack size
Saves brand_pack_quantity, brand_ppu, matched_pack_quantity, matched_ppu to product_groups
comparison_method is always "ppu"

Frontend Display

Comparison page: Shows PPU when pack sizes differ (e.g., "$0.2499/unit vs $0.149/unit") with pack context ("100pk @ $24.99 vs 1000pk @ $149.00")
Products page: Pack Qty + PPU columns in table
Falls back to raw price comparison when pack sizes are equal

Scraper Integration

Both scraper and Shopify sync auto-compute pack_quantity + ppu on product insert:

Competitor products: extracted from product name ("1000 Count", "Pack of 50")
Brand products: extracted from Shopify variant titles
Products without quantity indicators default to pack_quantity=1