Parts Data Aggregation Platform (50M+ Records)

Overview

Designed and built the backend and dashboard for a parts-catalog aggregator that ingests over 50 million product records from three high-volume supplier sites. The scraping pipeline handles each site's anti-bot protections and schema differences, with incremental updates feeding a unified PostgreSQL catalog and a FastAPI-powered monitoring dashboard for extraction health and data quality. Outcome: Replaced fragmented manual extraction with a single 50M+ record catalog the client can query, monitor, and refresh — with full visibility into per-source health and per-field data quality.

Architecture & Pipeline

flowchart LR
    n0["
Supplier Sites (×3)
High-volume parts catalogs
"] n1["
Per-Site Scrapers
Python · Selenium · Scrapy
"] n2["
Anti-Bot Handling
Per-site protections handled
"] n3["
Schema Normalization
Unify per-supplier data models
"] n4["
Incremental Updates
Diff against last crawl
"] n5["
PostgreSQL Catalog
50M+ unified records
"] n6["
FastAPI + Dashboard
Health & data-quality monitoring
"] n0 --> n1 n1 --> n2 n2 --> n3 n3 --> n4 n4 --> n5 n5 --> n6 classDef step0 fill:#f1f5f9,stroke:#64748b,color:#1e293b,stroke-width:2px,rx:10,ry:10; classDef step1 fill:#ecfeff,stroke:#06b6d4,color:#1e293b,stroke-width:2px,rx:10,ry:10; classDef step2 fill:#f0fdfa,stroke:#0d9488,color:#1e293b,stroke-width:2px,rx:10,ry:10; classDef step3 fill:#ecfdf5,stroke:#10b981,color:#1e293b,stroke-width:2px,rx:10,ry:10; classDef step4 fill:#fffbeb,stroke:#f59e0b,color:#1e293b,stroke-width:2px,rx:10,ry:10; class n0 step0; class n1 step1; class n2 step1; class n3 step2; class n4 step3; class n5 step3; class n6 step4;

End-to-end flow derived from this project's scope and tech stack. Tap View Fullscreen for a larger view, or scroll horizontally on small screens.

Key Features

  • Three-source scraping pipeline with per-site anti-bot handling
  • Incremental updates instead of full re-crawls (lower cost & latency)
  • Unified PostgreSQL schema reconciling differing supplier data models
  • Monitoring dashboard for extraction health and data quality
  • FastAPI backend exposing the catalog to downstream apps
  • Designed to scale to additional supplier sources
  • Tech Stack:** Python, Selenium, Scrapy, PostgreSQL, FastAPI