Data Engineering Book Scraper
Automated web scraping pipeline built with Apache Airflow to extract and process book data from Bol.com
Project Architecture

Extract
Web scraping Bol.com using BeautifulSoup to extract book titles, authors, and prices from search results
Transform
Data cleaning and processing, Not much cleaning required in this case, all books were fit to standards
Load
Storing processed data in PostgreSQL database with automated table creation and data insertion
Technical Deep Dive
Airflow DAG Structure
# DAG Dependencies create_table_task >> fetch_book_data_task >> insert_book_data_task # Task Flow: 1. PostgresOperator - Create table 2. PythonOperator - Web scraping 3. PythonOperator - Data insertion
Scheduled Execution
Daily automated runs with retry logic and error handling
XCom Integration
Data passing between tasks using Airflow's XCom system
Web Scraping Logic
# BeautifulSoup Implementation soup = BeautifulSoup(page.content, "html.parser") results = soup.find("main", {"id": "mainContent"}) books = book_group.find_all("li", {"class":"product-item--row"})
Dynamic Parsing
Handles multiple authors and complex price formatting automatically
Data Validation
Duplicate removal and data type conversion for consistency
Learning Journey & Adaptability
🚀 New Technology Mastery
🔄 Problem-Solving Approach
âš¡ Technical Challenges Overcome
📚 Continuous Learning Mindset
Key Features & Capabilities
Automated Scraping
Daily scheduled extraction with intelligent parsing
Database Integration
PostgreSQL storage with automated table management
Docker Deployment
Containerized infrastructure for easy deployment
Project Impact & Professional Growth
This project demonstrates my ability to quickly adapt to new technologies and build production-ready solutions. By combining web scraping, workflow orchestration, and database management, I created a comprehensive data pipeline that showcases both technical depth and practical problem-solving skills.