Grocery Store Web Scraper (Spoonful Inc.)

Overview

A large-scale data extraction pipeline built for Spoonful Inc. to centralize product data from major grocery retailers (Kroger, Walmart Food, Tesco, Tesco.ie, Woolworths). The scraper captures ingredients, allergens, nutrition facts, and product metadata while complying with each site's access structure. Outcome: Automated multi-region grocery data collection at 99% accuracy, replacing manual research and powering Spoonful's analytics and price-comparison features.

Architecture & Pipeline

flowchart LR
    n0["
Source Sites
Kroger · Walmart · Tesco · Tesco.ie · Woolworths
"] n1["
Custom Scrapers
Scrapy · Selenium · BeautifulSoup
"] n2["
Anti-Bot Layer
Rotating proxies · NordVPN · Anti-Captcha
"] n3["
Filter Non-Food
Relevance check
"] n4["
Standardize Fields
GTIN / UPC normalization
"] n5["
Quality Gates
Retry · logging · rate-limit
"] n6["
Deliver Datasets
CSV / Excel for Spoonful Inc.
"] n0 --> n1 n1 --> n2 n2 --> n3 n3 --> n4 n4 --> n5 n5 --> n6 classDef step0 fill:#f1f5f9,stroke:#64748b,color:#1e293b,stroke-width:2px,rx:10,ry:10; classDef step1 fill:#ecfeff,stroke:#06b6d4,color:#1e293b,stroke-width:2px,rx:10,ry:10; classDef step2 fill:#f0fdfa,stroke:#0d9488,color:#1e293b,stroke-width:2px,rx:10,ry:10; classDef step3 fill:#ecfdf5,stroke:#10b981,color:#1e293b,stroke-width:2px,rx:10,ry:10; classDef step4 fill:#fffbeb,stroke:#f59e0b,color:#1e293b,stroke-width:2px,rx:10,ry:10; class n0 step0; class n1 step1; class n2 step1; class n3 step2; class n4 step3; class n5 step3; class n6 step4;

End-to-end flow derived from this project's scope and tech stack. Tap View Fullscreen for a larger view, or scroll horizontally on small screens.

Key Features

  • Custom scrapers for five major grocery retailers
  • Filters to exclude non-food items and keep data relevant
  • Anti-bot handling with rotating proxies, NordVPN, and Anti-Captcha
  • API reverse-engineering for higher speed and accuracy
  • Standardized GTIN/UPC fields across all sources
  • Clean CSV/Excel deliverables with retry, logging, and rate-limiting
  • Tech Stack:** Python, Scrapy, BeautifulSoup, Selenium, Requests, Pandas