Data Engineering Book Scraper

Automated web scraping pipeline built with Apache Airflow to extract and process book data from Bol.com

Apache AirflowBeautifulSoupDockerPostgreSQL

View on GitHub

Project Architecture

Data Engineering Book Scraper Architecture Flow

Extract

Web scraping Bol.com using BeautifulSoup to extract book titles, authors, and prices from search results

Transform

Data cleaning and processing, Not much cleaning required in this case, all books were fit to standards

Load

Storing processed data in PostgreSQL database with automated table creation and data insertion

Technical Deep Dive

Airflow DAG Structure

# DAG Dependencies
create_table_task >> fetch_book_data_task >> insert_book_data_task

# Task Flow:
1. PostgresOperator - Create table
2. PythonOperator - Web scraping
3. PythonOperator - Data insertion

Scheduled Execution

Daily automated runs with retry logic and error handling

XCom Integration

Data passing between tasks using Airflow's XCom system

Web Scraping Logic

# BeautifulSoup Implementation
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find("main", {"id": "mainContent"})
books = book_group.find_all("li", {"class":"product-item--row"})

Dynamic Parsing

Handles multiple authors and complex price formatting automatically

Data Validation

Duplicate removal and data type conversion for consistency

Learning Journey & Adaptability

🚀 New Technology Mastery

Apache Airflow: Learned DAG creation, task dependencies, and workflow orchestration from scratch

PostgreSQL Hooks: Implemented database connections and automated data insertion patterns

Docker Orchestration: Containerized the entire pipeline for consistent deployment

🔄 Problem-Solving Approach

Adapted scraping logic to handle dynamic website structure changes

Optimized data processing pipeline for better performance

⚡ Technical Challenges Overcome

Complex HTML Parsing: Navigated nested DOM structures and handled inconsistent data formats

Data Pipeline Integration: Connected multiple technologies seamlessly with proper error handling

Production Deployment: Configured Docker containers and database connections for scalability

📚 Continuous Learning Mindset

Researched best practices for data engineering workflows

Applied software engineering principles to data pipelines

Embraced infrastructure-as-code with Docker containers

Key Features & Capabilities

Automated Scraping

Daily scheduled extraction with intelligent parsing

Database Integration

PostgreSQL storage with automated table management

Docker Deployment

Containerized infrastructure for easy deployment

Project Impact & Professional Growth

100%

Automation Success

Fully automated daily data collection

New Technologies

Mastered in single project

24/7

System Reliability

Continuous monitoring and execution

This project demonstrates my ability to quickly adapt to new technologies and build production-ready solutions. By combining web scraping, workflow orchestration, and database management, I created a comprehensive data pipeline that showcases both technical depth and practical problem-solving skills.