VantageDash

Matching Engine Deep Dive

Fuzzy, AI, hybrid, and pgvector matching pipelines

VantageDash's product matching engine identifies which competitor products correspond to your brand's products. It supports multiple matching strategies with configurable scoring, industry profiles, and vector similarity search.

Matching Methods

1. Fuzzy Matching (product_matcher.py)

Uses RapidFuzz for fast string similarity comparison. This is the original matcher and remains the default when use_new_matcher=false.

Algorithm: Weighted ensemble of three RapidFuzz metrics:

  • 40% token_sort_ratio — order-independent word matching
  • 35% token_set_ratio — handles extra/missing words
  • 25% partial_ratio — substring matching for partial titles

Pipeline:

  1. Load brand products from brand_products table
  2. Load all competitor products from product_tracking table
  3. Build a category index (O(n)) for pre-filtering — eliminates cross-category false positives
  4. For each brand product:
    • Extract category from title using keyword matching
    • Find candidate competitors in the same category
    • Calculate fuzzy score using the weighted ensemble
    • Apply scoring adjustments (size, material, price, etc.)
    • Products above active_threshold (default 65) → auto-approved match
    • Products between pending_threshold (40) and active → pending review
  5. Write matches to product_groups table with variant protection (prevents duplicate matches)

Scoring Dimensions (when using industry profiles):

DimensionMatch BonusMismatch Penalty
Category+15-40
Size+30-100
Unit/Capacity+10-25
Quantity+10
Material+10-30
Price proximity+5-50
Keyword overlap+12-10
Specialty brand-100
Stale product (>30 days)-5

2. AI Matching (ai_matcher.py)

Uses GPT-4o-mini for semantic understanding. More accurate than fuzzy matching but slower and costs ~$0.01-0.02 per brand product.

Pipeline:

  1. Load brand products from brand_products
  2. For each brand product:
    • Candidate retrieval: Try pgvector similarity search first, fall back to ILIKE substring search
    • Attribute extraction: GPT-4o-mini extracts structured attributes (category, size, dimensions, material, etc.) from product names
    • Semantic matching: GPT-4o-mini compares brand product attributes against candidates, returns confidence 0-100 with reasoning
  3. Products above ai_active_threshold (80) → auto-approved
  4. Products between ai_pending_threshold (50) and active → pending review
  5. Write matches to product_groups

AI Prompt Architecture:

  • Extraction prompt: Dynamic — reads category list from industry profile (or hardcoded defaults). Extracts: category, size, size_ml, dimensions, material, pack_quantity, color, finish_type
  • Matching prompt: Compares two products with rules injected from industry profile. Returns: match (bool), confidence (0-100), reason (string)
  • Retry with backoff: Failed API calls retry with exponential backoff

3. Hybrid Matching

Combines fuzzy and AI matching:

  1. Run fuzzy matching first (fast, cheap)
  2. Products with confidence 40-79% (uncertain zone) are sent to AI verification
  3. AI upgrades or downgrades the match confidence
  4. Best of both: fuzzy speed for clear matches, AI accuracy for edge cases

4. Vector Similarity (pgvector)

Used as the candidate retrieval step for AI matching, replacing slow ILIKE substring queries.

How it works:

  1. Products get text embeddings via OpenAI text-embedding-3-small (1536 dimensions)
  2. Embeddings stored in embedding vector(1536) columns on product_tracking and brand_products
  3. HNSW indexes enable fast approximate nearest-neighbor search
  4. match_products_by_embedding() RPC function returns top-N similar products
  5. Results are diversified across competitors and filtered by price range

Cost: ~$0.0001 per product (one-time at scrape/sync)

5. Parallel Validation

A/B testing tool for comparing old (hardcoded) vs new (profile-driven) matchers:

  1. POST /api/match/validate with configurable sample_size (1-200)
  2. Runs both matchers on the same sample of brand products
  3. Produces per-product verdicts:
    • both_match_same — same match, same confidence range
    • both_match_different — both matched but different confidence
    • old_only — old matcher found a match, new didn't
    • new_only — new matcher found a match, old didn't
    • both_nomatch — neither found a match
  4. Results stored as JSON in the session row (read-only, no product_groups writes)
  5. Retrieve via GET /api/match/validate/{session_id}

Candidate Retrieval

The matcher needs to find potential competitor matches for each brand product. There are two strategies:

Vector Search (preferred when embeddings exist)

Brand product text → embedding → pgvector cosine similarity → top candidates
  • Uses match_products_by_embedding() RPC function
  • Threshold: 0.3 similarity minimum
  • Diversifies across competitors (max results per competitor)
  • Filters by price range (10x ratio guard)

ILIKE Fallback

Brand product title → keyword extraction → ILIKE '%keyword%' queries
  • Uses category keywords and search terms from industry profile
  • Slower and less accurate than vector search
  • Always available (no embedding generation needed)

Price Guards

A 10x price ratio limit prevents absurd matches:

  • If brand product is $5 and competitor is $100, they won't match
  • Configurable via scoring_config.price_ratio_limit
  • Price proximity within 20% gets a +5 bonus
  • Extreme price divergence (>5x) gets a -50 penalty

Stale Product Detection

Products scraped more than 30 days ago receive a -5 confidence penalty. This prevents matching against outdated pricing data.

Confidence Distribution Tracking

Every matching session tracks the distribution of confidence scores:

  • High: 80-100%
  • Medium: 50-79%
  • Low: below 50%

This helps measure matching quality across runs.

Feature Flag: use_new_matcher

The industry_profiles table has a use_new_matcher boolean (default false):

  • false: All matchers use hardcoded logic (original behavior)
  • true: Matchers read categories, scoring weights, and rules from the industry profile

This allows safe rollout — validate with parallel validation first, then flip the flag.

Session 33 Bugs Fixed (2026-03-18)

Three critical silent-failure bugs in ai_matcher.py were discovered during end-to-end testing with real packaging competitors:

  1. Non-existent **image_url** columnfind_candidates() selected image_url from product_tracking (doesn't exist). PostgREST returned 400 on every ILIKE query, silently caught by try/except → 0 candidates found.
  2. Missing **tenant_id** in product_groups insert — inserts failed against NOT NULL FK constraint. Matches found but never saved.
  3. Non-existent **status**/**price_comparison** columns — insert included fields that don't exist on product_groups. Fixed by removing status, renaming price_comparisondifferences.
  4. matching_logs missing tenant_id column — table lacked the column entirely, causing all log inserts to fail silently. Fixed via ALTER TABLE.

Result: 19 product matches at 80-92% confidence across 4 competitors from 6,897 scraped products.

use_new_matcher Status

As of Session 33, use_new_matcher is true for the production tenant (cannabis_packaging profile).

PPU Normalization (Session 34)

Price-per-unit normalization enables fair comparison between products sold at different pack sizes (e.g., brand sells 100-packs, competitor sells 1000-count).

How It Works

  1. Variant Quantity Extraction (extract_variant_quantity() in price_utils.py):
    • Parses Shopify variant titles: "Pack of 100", "Case of 1000", "Single (1)", "24ct", "Bundle of 20"
    • Falls back to product title parsing via extract_quantity() for competitor products
  2. Brand PPU Computation (compute_brand_product_ppu()):
    • Analyzes all variants of a brand product
    • Picks the variant whose price matches the base price
    • Computes PPU = price / pack_quantity
  3. Closest Variant Selection (find_closest_variant_to_quantity()):
    • When comparing against a competitor with a known pack size (e.g., 1000ct)
    • Finds the brand variant with the closest pack quantity
    • Used by AI matcher for accurate per-unit comparison

AI Matcher Integration

The AI matcher now computes PPU server-side instead of relying on GPT-4o-mini math:

  • Extracts competitor pack quantity from product name
  • Picks best brand variant for that pack size
  • Saves brand_pack_quantity, brand_ppu, matched_pack_quantity, matched_ppu to product_groups
  • comparison_method is always "ppu"

Frontend Display

  • Comparison page: Shows PPU when pack sizes differ (e.g., "$0.2499/unit vs $0.149/unit") with pack context ("100pk @ $24.99 vs 1000pk @ $149.00")
  • Products page: Pack Qty + PPU columns in table
  • Falls back to raw price comparison when pack sizes are equal

Scraper Integration

Both scraper and Shopify sync auto-compute pack_quantity + ppu on product insert:

  • Competitor products: extracted from product name ("1000 Count", "Pack of 50")
  • Brand products: extracted from Shopify variant titles
  • Products without quantity indicators default to pack_quantity=1