Data Engineering Book Scraper

Automated web scraping pipeline built with Apache Airflow to extract and process book data from Bol.com

Apache AirflowBeautifulSoupDockerPostgreSQL

Project Architecture

Data Engineering Book Scraper Architecture Flow
1

Extract

Web scraping Bol.com using BeautifulSoup to extract book titles, authors, and prices from search results

2

Transform

Data cleaning and processing, Not much cleaning required in this case, all books were fit to standards

3

Load

Storing processed data in PostgreSQL database with automated table creation and data insertion

Technical Deep Dive

Airflow DAG Structure

# DAG Dependencies
create_table_task >> fetch_book_data_task >> insert_book_data_task

# Task Flow:
1. PostgresOperator - Create table
2. PythonOperator - Web scraping
3. PythonOperator - Data insertion

Scheduled Execution

Daily automated runs with retry logic and error handling

XCom Integration

Data passing between tasks using Airflow's XCom system

Web Scraping Logic

# BeautifulSoup Implementation
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find("main", {"id": "mainContent"})
books = book_group.find_all("li", {"class":"product-item--row"})

Dynamic Parsing

Handles multiple authors and complex price formatting automatically

Data Validation

Duplicate removal and data type conversion for consistency

Learning Journey & Adaptability

🚀 New Technology Mastery

Apache Airflow: Learned DAG creation, task dependencies, and workflow orchestration from scratch
PostgreSQL Hooks: Implemented database connections and automated data insertion patterns
Docker Orchestration: Containerized the entire pipeline for consistent deployment

🔄 Problem-Solving Approach

Adapted scraping logic to handle dynamic website structure changes
Optimized data processing pipeline for better performance

âš¡ Technical Challenges Overcome

Complex HTML Parsing: Navigated nested DOM structures and handled inconsistent data formats
Data Pipeline Integration: Connected multiple technologies seamlessly with proper error handling
Production Deployment: Configured Docker containers and database connections for scalability

📚 Continuous Learning Mindset

Researched best practices for data engineering workflows
Applied software engineering principles to data pipelines
Embraced infrastructure-as-code with Docker containers

Key Features & Capabilities

Automated Scraping

Daily scheduled extraction with intelligent parsing

Database Integration

PostgreSQL storage with automated table management

Docker Deployment

Containerized infrastructure for easy deployment

Project Impact & Professional Growth

100%
Automation Success
Fully automated daily data collection
5+
New Technologies
Mastered in single project
24/7
System Reliability
Continuous monitoring and execution

This project demonstrates my ability to quickly adapt to new technologies and build production-ready solutions. By combining web scraping, workflow orchestration, and database management, I created a comprehensive data pipeline that showcases both technical depth and practical problem-solving skills.