Day 1 Notes: DataOps Fundamentals

What is DataOps?

[!NOTE] DataOps is a discipline at the intersection of data engineering, DevOps, and agile analytics.

Focus: Creating robust, automated, production-grade data pipelines that deliver clean, governed, and ML-ready data at scale.

Core Practices:

Version Control (Git)
Testing (Unit/Integration/Data Quality)
Continuous Delivery (CI/CD)
Observability & Feedback Loops

Who is the DataOps Engineer?

A bridge between infrastructure, data science, and product teams.

Build: Transform raw, messy data into feature-rich, trustworthy assets.
Automate: Ensure pipelines are automated, versioned, tested, and observed.

How to Stand Out?

Think Systems, Not Scripts: Architect solutions, employ modular design, and use orchestration (e.g., Airflow).
Automate Everything: If it runs twice, script it. From validation to deployment.
Build for Quality: Frequent testing, schema checks, and observability. Trust is currency.
Collab like a SWE: Git workflows, code reviews, reproducible environments.
Understand End Users: Data must be usable and trustworthy for the consumer (Dashboard/DS).
Stay Curious: Read docs, contribute to open source, keep learning.

Soft Skills

Communication
Analytical thinking
Collaboration
Patience and rigor

Pillars of DataOps

Collaboration: Breaking silos between data, dev, and ops.
Automation: Reducing manual toil.
CI/CD: Continuous Integration and Continuous Deployment.
Metrics and Monitoring: Knowing health and performance.
Data Quality and Governance: Ensuring reliability and compliance.
Data Quality and Governance

virtual environment

an isolated working copy of python that allows the user to install packages and dependencies separately from the system python or other projects (a python sandbox if you will) you get to:

control which packages are installed
avoid conflicts between projects
ensure your projects behaves the same on another machine

Why

IRL pipelines:

avoid "it works on my machine" problems
enable version control for dependencies
deploying ML pipelines or workflows without isolation leads to bugs that only show up in production
global installations lead to dependency hell

When

always

How

a new folder is created called venv or .venv
the folder contains a local python interpreter and its own version of pip
activating the environments tells python and pip to use that isolated setup

Creating and using a venv

step 1: creating the env python -m venv .venv step 2: activate venv source .venv/bin/activate step 3: install dependencies pip install [pandas](/brain/Python_for_Data_Engineering) flask step 4: deactivate deactivate

What is reproducibility

Reproducibility means that anyone can run the code and get the same result with the same dependencies

Screenshot 2025-07-28 at 12.47.43.png

Tools for reproducibility

requirements.txt generate a txt file that has the version of each dependency pip freeze > requirements.txt the requirements.txt freezes the exact versions anyone can reproduce the env by simply running pip install -r requirements.txt

pyproject.toml (more modern) used in modern packaging systems like poetry, it defines:

project metadata
exact dependencies
build backend

Git Cheatsheet for DataOps

[!TIP] Treat your data pipelines like production code. Version control is non-negotiable.

Configuration

git config --global user.name "Your Name"
git config --global user.email "you@example.com"
git config --global core.editor "code --wait"
git config --global init.defaultBranch main

Basic Workflow

Initialize: git init
Status: git status (Check what's changed)
Stage: git add . (Add all) or git add filename
Commit: git commit -m "feat: add data validation step"
- Use conventional commits: feat:, fix:, docs:, chore:.

Branching & Merging

Create Branch: git checkout -b feature/new-pipeline
Switch Branch: git checkout main
Merge: git merge feature/new-pipeline

Undoing Changes

Unstage file: git restore --staged filename
Discard local changes: git restore filename
Amend last commit: git commit --amend

Remote (GitHub)

Add Remote: git remote add origin <url>
Push: git push -u origin main
Pull: git pull origin main

.gitignore

Always ignore large data files, credentials, and virtual environments.

__pycache__/
*.csv
*.parquet
.env
.DS_Store
venv/

Foundations of Testing for Preprocessing Pipelines

Testing a data pipeline means writing automated checks to ensure:

The input data is valid
The transformations are applied correctly
The output conforms to expected structure and values

There are several kinds of tests that apply to data workflows:

Test type description

Unit Test	Validates one transformation step (e.g. scaling logic)
Integration Test	Checks a full preprocessing pipeline end-to-end
Regression Test	Compares new pipeline output with golden datasets
Contract Test	Asserts schema expectation (columns, dtypes, ranges)
Smoke Test	Verifies pipeline run successfully on a small sample

Why?

In traditional software, a bug might crash a feature. But in data pipelines, a bug can:

Corrupt millions of rows
Break downstream ML models
Produce wrong dashboards or business decisions – silently Testing ensures your pipelines are:
Correct – data transformations behave as expected
Stable – schema and values don't change without warning
Reproducible – same inputs -> same outputs
Scalable – they fail fast and gracefully with large or dirty data In short: Testing = data trust + pipeline confidence

When?

While building transformations (test-driven data engineering)
before pushing to prod
during CI/CD deployment
after a schema change or new data source
periodically

How? (Foundational Tests)

Start with schema checks

Goal: validate columns, dtypes, missing values

def test_column_schema(df):
	expected_columns = ['id' , 'age' , 'income']
	asset list(df.columns) == expected_columns
	assert df['age'].dtype == 'int64'
	assert df['income'].dtype == 'float64'

Unit Tests for transformations

Example: test age binning logic

def test_age_binning():
	df=pd.DataFrame({'age':[15, 25, 70]})
	result = bin_age(df)
	expected = ['child', 'adult', 'senior']
	assert result['age_group'].tolist()==expected

Smoke tests on sample data

Run the full pipeline on a 10-row sample to ensure it doesn't crash

def test_pipeline_smoke():
	sample = load_sample_data()
	output = run_full_pipeline (sample)
	assert not output.empty

Contract tests for external inputs

Goal: Ensure inputs conform to expected schemas before processing

def test_input_contract(df):
	assert 'user_id' in df
	assert pd.[api](/brain/Web_Scraping_and_APIs).types.is_integer_dtype(df['user_id])

Note: Testing a data pipeline is like setting up guards at every step. Before I process anything, I check if the input looks right, then i test each function that transforms my data. finally, i make sure the whole pipeline runs on a small sample. This way, if something breads – like a missing column or a bad type – I catch it early, not when i'm processing 10 million rows

Test-Driven Development (TDD) for Data Pipelines

TDD for data pipelines means this loop:

Write a test that fails (define what the transformation should do )
Write just enough pipeline logic to make the test pass
refactor the code for clarity and performance
repeat for example, before writing normalize_income()function, you write a test:

def test_income_normalization():
	df=pd.DataFrame({'income':[0,50,100]})
	result = normalize_income(df)
	assert result['income_scaled'].tolist() == [0.0, 0.5, 1.0 ]

only after that test exists, you implement normalize_income()

Why use TDD for data pipelines?

In traditional software, TDD helps devs write only the code they need, guided by the tests. In data pipelines, TDD helps you:

Think clearly about the expected data transformations
Prevent broken logic, unhandled edge cases, and schema drift
Build modular, well-scoped and testable pipelines
Catch bugs before they affect millions of records In short: TDD flips the mindset – write the test first, then write the code to pass it

When to apply TDD?

while designing new transformations
when refactoring existing pipelines
during onboarding of new data sources
while fixing a bug – write a test that reproduces the bug first You don't need TDD for every small helper function, but for critical logic (encoding, scaling, feature extraction...), it's invaluable

How to practice TDD with data?

Treat each transformation as a testable unit

example: binning age -> test expected bins

def test_bin_age():
	df = pd.DataFrame({'age':[10,30,80]})
	result = bin_age(df)
	assert result['age_group'].tolist() == ['child', 'adult', 'senior']

Design your pipeline modularly

Break it down into small functions, so each one can be tested in isolation.

Use test fixtures or mock datasets

Keep small, hard-coded datasets in your tests. No need for CSVs or databases.

Write tests – even before any logic

Get used to thinking. "What should the output look like?" before coding.

NOTE: TDD means i write the tests first, before the actual pipeline logic. for example, if i'm going to normalize income, i first write a test that says, 'I expect income_scaled to go from 0 to 1 if income is between 0 and 100'. Then I write the coe to make that test pass. This way, I always know what my functions are supposed to do before they exist

DataOps_Culture_and_Tools