Portfolio | Digital Garden

Project Overview

High-performance data pipeline designed to process over 10 million purchase records with strict memory constraints and rigorous data quality checks.

Optimized Processing: Built a chunked data processing pipeline to analyze 10M+ purchase records with bounded memory usage, ensuring stability on constrained resources.
Schema Validation: Implemented schema validation and business rules using Pandera, systematically catching missing, malformed, and inconsistent records before aggregation.
Data Normalization: Performed country normalization (using PyCountry), cleaning, and complex aggregations to produce accurate customer-level and regional metrics.
Robustness: Added comprehensive unit tests and structured logging, ensuring deterministic results and enabling safe refactoring as data volume scales.

This project demonstrates a focus on production-grade data engineering practices, emphasizing reliability, testability, and efficiency.