Matching Engine Deep Dive
Fuzzy, AI, hybrid, and pgvector matching pipelines
VantageDash's product matching engine identifies which competitor products correspond to your brand's products. It supports multiple matching strategies with configurable scoring, industry profiles, and vector similarity search.
Matching Methods
1. Fuzzy Matching (product_matcher.py)
Uses RapidFuzz for fast string similarity comparison. This is the original matcher and remains the default when use_new_matcher=false.
Algorithm: Weighted ensemble of three RapidFuzz metrics:
- 40%
token_sort_ratio— order-independent word matching - 35%
token_set_ratio— handles extra/missing words - 25%
partial_ratio— substring matching for partial titles
Pipeline:
- Load brand products from
brand_productstable - Load all competitor products from
product_trackingtable - Build a category index (O(n)) for pre-filtering — eliminates cross-category false positives
- For each brand product:
- Extract category from title using keyword matching
- Find candidate competitors in the same category
- Calculate fuzzy score using the weighted ensemble
- Apply scoring adjustments (size, material, price, etc.)
- Products above
active_threshold(default 65) → auto-approved match - Products between
pending_threshold(40) and active → pending review
- Write matches to
product_groupstable with variant protection (prevents duplicate matches)
Scoring Dimensions (when using industry profiles):
| Dimension | Match Bonus | Mismatch Penalty |
|---|---|---|
| Category | +15 | -40 |
| Size | +30 | -100 |
| Unit/Capacity | +10 | -25 |
| Quantity | +10 | — |
| Material | +10 | -30 |
| Price proximity | +5 | -50 |
| Keyword overlap | +12 | -10 |
| Specialty brand | — | -100 |
| Stale product (>30 days) | — | -5 |
2. AI Matching (ai_matcher.py)
Uses GPT-4o-mini for semantic understanding. More accurate than fuzzy matching but slower and costs ~$0.01-0.02 per brand product.
Pipeline:
- Load brand products from
brand_products - For each brand product:
- Candidate retrieval: Try pgvector similarity search first, fall back to ILIKE substring search
- Attribute extraction: GPT-4o-mini extracts structured attributes (category, size, dimensions, material, etc.) from product names
- Semantic matching: GPT-4o-mini compares brand product attributes against candidates, returns confidence 0-100 with reasoning
- Products above
ai_active_threshold(80) → auto-approved - Products between
ai_pending_threshold(50) and active → pending review - Write matches to
product_groups
AI Prompt Architecture:
- Extraction prompt: Dynamic — reads category list from industry profile (or hardcoded defaults). Extracts: category, size, size_ml, dimensions, material, pack_quantity, color, finish_type
- Matching prompt: Compares two products with rules injected from industry profile. Returns: match (bool), confidence (0-100), reason (string)
- Retry with backoff: Failed API calls retry with exponential backoff
3. Hybrid Matching
Combines fuzzy and AI matching:
- Run fuzzy matching first (fast, cheap)
- Products with confidence 40-79% (uncertain zone) are sent to AI verification
- AI upgrades or downgrades the match confidence
- Best of both: fuzzy speed for clear matches, AI accuracy for edge cases
4. Vector Similarity (pgvector)
Used as the candidate retrieval step for AI matching, replacing slow ILIKE substring queries.
How it works:
- Products get text embeddings via OpenAI
text-embedding-3-small(1536 dimensions) - Embeddings stored in
embedding vector(1536)columns onproduct_trackingandbrand_products - HNSW indexes enable fast approximate nearest-neighbor search
match_products_by_embedding()RPC function returns top-N similar products- Results are diversified across competitors and filtered by price range
Cost: ~$0.0001 per product (one-time at scrape/sync)
5. Parallel Validation
A/B testing tool for comparing old (hardcoded) vs new (profile-driven) matchers:
POST /api/match/validatewith configurablesample_size(1-200)- Runs both matchers on the same sample of brand products
- Produces per-product verdicts:
both_match_same— same match, same confidence rangeboth_match_different— both matched but different confidenceold_only— old matcher found a match, new didn'tnew_only— new matcher found a match, old didn'tboth_nomatch— neither found a match
- Results stored as JSON in the session row (read-only, no
product_groupswrites) - Retrieve via
GET /api/match/validate/{session_id}
Candidate Retrieval
The matcher needs to find potential competitor matches for each brand product. There are two strategies:
Vector Search (preferred when embeddings exist)
Brand product text → embedding → pgvector cosine similarity → top candidates- Uses
match_products_by_embedding()RPC function - Threshold: 0.3 similarity minimum
- Diversifies across competitors (max results per competitor)
- Filters by price range (10x ratio guard)
ILIKE Fallback
Brand product title → keyword extraction → ILIKE '%keyword%' queries- Uses category keywords and search terms from industry profile
- Slower and less accurate than vector search
- Always available (no embedding generation needed)
Price Guards
A 10x price ratio limit prevents absurd matches:
- If brand product is $5 and competitor is $100, they won't match
- Configurable via
scoring_config.price_ratio_limit - Price proximity within 20% gets a +5 bonus
- Extreme price divergence (>5x) gets a -50 penalty
Stale Product Detection
Products scraped more than 30 days ago receive a -5 confidence penalty. This prevents matching against outdated pricing data.
Confidence Distribution Tracking
Every matching session tracks the distribution of confidence scores:
- High: 80-100%
- Medium: 50-79%
- Low: below 50%
This helps measure matching quality across runs.
Feature Flag: use_new_matcher
The industry_profiles table has a use_new_matcher boolean (default false):
- false: All matchers use hardcoded logic (original behavior)
- true: Matchers read categories, scoring weights, and rules from the industry profile
This allows safe rollout — validate with parallel validation first, then flip the flag.
Session 33 Bugs Fixed (2026-03-18)
Three critical silent-failure bugs in ai_matcher.py were discovered during end-to-end testing with real packaging competitors:
- Non-existent
**image_url**column —find_candidates()selectedimage_urlfromproduct_tracking(doesn't exist). PostgREST returned 400 on every ILIKE query, silently caught by try/except → 0 candidates found. - Missing
**tenant_id**in product_groups insert — inserts failed against NOT NULL FK constraint. Matches found but never saved. - Non-existent
**status**/**price_comparison**columns — insert included fields that don't exist on product_groups. Fixed by removingstatus, renamingprice_comparison→differences. - matching_logs missing tenant_id column — table lacked the column entirely, causing all log inserts to fail silently. Fixed via ALTER TABLE.
Result: 19 product matches at 80-92% confidence across 4 competitors from 6,897 scraped products.
use_new_matcher Status
As of Session 33, use_new_matcher is true for the production tenant (cannabis_packaging profile).
PPU Normalization (Session 34)
Price-per-unit normalization enables fair comparison between products sold at different pack sizes (e.g., brand sells 100-packs, competitor sells 1000-count).
How It Works
- Variant Quantity Extraction (
extract_variant_quantity()inprice_utils.py):- Parses Shopify variant titles: "Pack of 100", "Case of 1000", "Single (1)", "24ct", "Bundle of 20"
- Falls back to product title parsing via
extract_quantity()for competitor products
- Brand PPU Computation (
compute_brand_product_ppu()):- Analyzes all variants of a brand product
- Picks the variant whose price matches the base price
- Computes PPU = price / pack_quantity
- Closest Variant Selection (
find_closest_variant_to_quantity()):- When comparing against a competitor with a known pack size (e.g., 1000ct)
- Finds the brand variant with the closest pack quantity
- Used by AI matcher for accurate per-unit comparison
AI Matcher Integration
The AI matcher now computes PPU server-side instead of relying on GPT-4o-mini math:
- Extracts competitor pack quantity from product name
- Picks best brand variant for that pack size
- Saves
brand_pack_quantity,brand_ppu,matched_pack_quantity,matched_pputo product_groups comparison_methodis always "ppu"
Frontend Display
- Comparison page: Shows PPU when pack sizes differ (e.g., "$0.2499/unit vs $0.149/unit") with pack context ("100pk @ $24.99 vs 1000pk @ $149.00")
- Products page: Pack Qty + PPU columns in table
- Falls back to raw price comparison when pack sizes are equal
Scraper Integration
Both scraper and Shopify sync auto-compute pack_quantity + ppu on product insert:
- Competitor products: extracted from product name ("1000 Count", "Pack of 50")
- Brand products: extracted from Shopify variant titles
- Products without quantity indicators default to pack_quantity=1