Parts Data Aggregation Platform (50M+ Records)
Overview
Designed and built the backend and dashboard for a parts-catalog aggregator that ingests over 50 million product records from three high-volume supplier sites. The scraping pipeline handles each site's anti-bot protections and schema differences, with incremental updates feeding a unified PostgreSQL catalog and a FastAPI-powered monitoring dashboard for extraction health and data quality. Outcome: Replaced fragmented manual extraction with a single 50M+ record catalog the client can query, monitor, and refresh — with full visibility into per-source health and per-field data quality.
Architecture & Pipeline
flowchart LR
n0["Supplier Sites (×3)High-volume parts catalogs"]
n1["Per-Site ScrapersPython · Selenium · Scrapy"]
n2["Anti-Bot HandlingPer-site protections handled"]
n3["Schema NormalizationUnify per-supplier data models"]
n4["Incremental UpdatesDiff against last crawl"]
n5["PostgreSQL Catalog50M+ unified records"]
n6["FastAPI + DashboardHealth & data-quality monitoring"]
n0 --> n1
n1 --> n2
n2 --> n3
n3 --> n4
n4 --> n5
n5 --> n6
classDef step0 fill:#f1f5f9,stroke:#64748b,color:#1e293b,stroke-width:2px,rx:10,ry:10;
classDef step1 fill:#ecfeff,stroke:#06b6d4,color:#1e293b,stroke-width:2px,rx:10,ry:10;
classDef step2 fill:#f0fdfa,stroke:#0d9488,color:#1e293b,stroke-width:2px,rx:10,ry:10;
classDef step3 fill:#ecfdf5,stroke:#10b981,color:#1e293b,stroke-width:2px,rx:10,ry:10;
classDef step4 fill:#fffbeb,stroke:#f59e0b,color:#1e293b,stroke-width:2px,rx:10,ry:10;
class n0 step0;
class n1 step1;
class n2 step1;
class n3 step2;
class n4 step3;
class n5 step3;
class n6 step4;
End-to-end flow derived from this project's scope and tech stack. Tap View Fullscreen for a larger view, or scroll horizontally on small screens.
Key Features
- Three-source scraping pipeline with per-site anti-bot handling
- Incremental updates instead of full re-crawls (lower cost & latency)
- Unified PostgreSQL schema reconciling differing supplier data models
- Monitoring dashboard for extraction health and data quality
- FastAPI backend exposing the catalog to downstream apps
- Designed to scale to additional supplier sources
- Tech Stack:** Python, Selenium, Scrapy, PostgreSQL, FastAPI