PDF Document Extraction Bot

Overview

A Python automation that searches a target website for companies, navigates to each company's document directory, and downloads every PDF. Files are hashed to detect duplicates, and NordVPN is used to rotate the bot's IP location regularly to avoid bot detection. Outcome: Hands-off PDF collection at scale, with deduplication and IP rotation that keep the pipeline reliable over long runs.

Architecture & Pipeline

flowchart LR
    n0["Scheduler
Recurring runs"]
    n1["Search Companies
Target site"]
    n2["Open Document Directory
Selenium"]
    n3["Download PDFs
Per-company batch"]
    n4["Hash & Dedupe
Drop duplicates"]
    n5["IP Rotation
NordVPN"]
    n6["Deliver Files
Dropbox · Mega"]
    n0 --> n1
    n1 --> n2
    n2 --> n3
    n3 --> n4
    n4 --> n5
    n5 --> n6
classDef step0 fill:#f1f5f9,stroke:#64748b,color:#1e293b,stroke-width:2px,rx:10,ry:10;
classDef step1 fill:#ecfeff,stroke:#06b6d4,color:#1e293b,stroke-width:2px,rx:10,ry:10;
classDef step2 fill:#f0fdfa,stroke:#0d9488,color:#1e293b,stroke-width:2px,rx:10,ry:10;
classDef step3 fill:#ecfdf5,stroke:#10b981,color:#1e293b,stroke-width:2px,rx:10,ry:10;
classDef step4 fill:#fffbeb,stroke:#f59e0b,color:#1e293b,stroke-width:2px,rx:10,ry:10;
    class n0 step0;
    class n1 step1;
    class n2 step1;
    class n3 step2;
    class n4 step3;
    class n5 step3;
    class n6 step4;

End-to-end flow derived from this project's scope and tech stack. Tap View Fullscreen for a larger view, or scroll horizontally on small screens.

Key Features

Per-company directory navigation and full PDF download
Hash-based deduplication of downloaded files
Regular IP rotation via NordVPN
File delivery via Dropbox and Mega
Deployed on an Ubuntu remote desktop server
Tech Stack:** Python, Selenium, Linux

Overview

Architecture & Pipeline

Key Features

More Web Automation projects

Binance Top-100 Crypto Pairs Bot

Marriott Hotel Price Checker