Logo
Brain/DataOps_Culture_and_Tools

DataOps_Culture_and_Tools

#dataops#git#testing#tdd#cicd

Day 1 Notes: DataOps Fundamentals

What is DataOps?

[!NOTE] DataOps is a discipline at the intersection of data engineering, DevOps, and agile analytics.

Focus: Creating robust, automated, production-grade data pipelines that deliver clean, governed, and ML-ready data at scale.

Core Practices:

  • Version Control (Git)
  • Testing (Unit/Integration/Data Quality)
  • Continuous Delivery (CI/CD)
  • Observability & Feedback Loops

Who is the DataOps Engineer?

A bridge between infrastructure, data science, and product teams.

  • Build: Transform raw, messy data into feature-rich, trustworthy assets.
  • Automate: Ensure pipelines are automated, versioned, tested, and observed.

How to Stand Out?

  1. Think Systems, Not Scripts: Architect solutions, employ modular design, and use orchestration (e.g., Airflow).
  2. Automate Everything: If it runs twice, script it. From validation to deployment.
  3. Build for Quality: Frequent testing, schema checks, and observability. Trust is currency.
  4. Collab like a SWE: Git workflows, code reviews, reproducible environments.
  5. Understand End Users: Data must be usable and trustworthy for the consumer (Dashboard/DS).
  6. Stay Curious: Read docs, contribute to open source, keep learning.

Soft Skills

  • Communication
  • Analytical thinking
  • Collaboration
  • Patience and rigor

Pillars of DataOps

  • Collaboration: Breaking silos between data, dev, and ops.
  • Automation: Reducing manual toil.
  • CI/CD: Continuous Integration and Continuous Deployment.
  • Metrics and Monitoring: Knowing health and performance.
  • Data Quality and Governance: Ensuring reliability and compliance.
  • Data Quality and Governance

virtual environment

an isolated working copy of python that allows the user to install packages and dependencies separately from the system python or other projects (a python sandbox if you will) you get to:

  • control which packages are installed
  • avoid conflicts between projects
  • ensure your projects behaves the same on another machine

Why

IRL pipelines:

  • avoid "it works on my machine" problems
  • enable version control for dependencies
  • deploying ML pipelines or workflows without isolation leads to bugs that only show up in production
  • global installations lead to dependency hell

When

always

How

  • a new folder is created called venv or .venv
  • the folder contains a local python interpreter and its own version of pip
  • activating the environments tells python and pip to use that isolated setup Screenshot 2025-07-28 at 12.10.42.png

Creating and using a venv

step 1: creating the env python -m venv .venv step 2: activate venv source .venv/bin/activate step 3: install dependencies pip install [pandas](/brain/Python_for_Data_Engineering) flask step 4: deactivate deactivate

What is reproducibility

Reproducibility means that anyone can run the code and get the same result with the same dependencies

Screenshot 2025-07-28 at 12.47.43.png

Tools for reproducibility

requirements.txt generate a txt file that has the version of each dependency pip freeze > requirements.txt the requirements.txt freezes the exact versions anyone can reproduce the env by simply running pip install -r requirements.txt

pyproject.toml (more modern) used in modern packaging systems like poetry, it defines:

  • project metadata
  • exact dependencies
  • build backend Screenshot 2025-07-28 at 12.56.10.png

Git Cheatsheet for DataOps

[!TIP] Treat your data pipelines like production code. Version control is non-negotiable.

Configuration

git config --global user.name "Your Name"
git config --global user.email "you@example.com"
git config --global core.editor "code --wait"
git config --global init.defaultBranch main

Basic Workflow

  1. Initialize: git init
  2. Status: git status (Check what's changed)
  3. Stage: git add . (Add all) or git add filename
  4. Commit: git commit -m "feat: add data validation step"
    • Use conventional commits: feat:, fix:, docs:, chore:.

Branching & Merging

  • Create Branch: git checkout -b feature/new-pipeline
  • Switch Branch: git checkout main
  • Merge: git merge feature/new-pipeline

Undoing Changes

  • Unstage file: git restore --staged filename
  • Discard local changes: git restore filename
  • Amend last commit: git commit --amend

Remote (GitHub)

  • Add Remote: git remote add origin <url>
  • Push: git push -u origin main
  • Pull: git pull origin main

.gitignore

Always ignore large data files, credentials, and virtual environments.

__pycache__/
*.csv
*.parquet
.env
.DS_Store
venv/

Foundations of Testing for Preprocessing Pipelines

Testing a data pipeline means writing automated checks to ensure:

  1. The input data is valid
  2. The transformations are applied correctly
  3. The output conforms to expected structure and values

There are several kinds of tests that apply to data workflows:

Test type description
Unit TestValidates one transformation step
(e.g. scaling logic)
Integration TestChecks a full preprocessing pipeline end-to-end
Regression TestCompares new pipeline output with golden datasets
Contract TestAsserts schema expectation (columns, dtypes, ranges)
Smoke TestVerifies pipeline run successfully on a small sample

Why?

In traditional software, a bug might crash a feature. But in data pipelines, a bug can:

  • Corrupt millions of rows
  • Break downstream ML models
  • Produce wrong dashboards or business decisions – silently Testing ensures your pipelines are:
  • Correct – data transformations behave as expected
  • Stable – schema and values don't change without warning
  • Reproducible – same inputs -> same outputs
  • Scalable – they fail fast and gracefully with large or dirty data In short: Testing = data trust + pipeline confidence

When?

  • While building transformations (test-driven data engineering)
  • before pushing to prod
  • during CI/CD deployment
  • after a schema change or new data source
  • periodically

How? (Foundational Tests)

Start with schema checks

Goal: validate columns, dtypes, missing values

def test_column_schema(df):
	expected_columns = ['id' , 'age' , 'income']
	asset list(df.columns) == expected_columns
	assert df['age'].dtype == 'int64'
	assert df['income'].dtype == 'float64'

Unit Tests for transformations

Example: test age binning logic

def test_age_binning():
	df=pd.DataFrame({'age':[15, 25, 70]})
	result = bin_age(df)
	expected = ['child', 'adult', 'senior']
	assert result['age_group'].tolist()==expected

Smoke tests on sample data

Run the full pipeline on a 10-row sample to ensure it doesn't crash

def test_pipeline_smoke():
	sample = load_sample_data()
	output = run_full_pipeline (sample)
	assert not output.empty

Contract tests for external inputs

Goal: Ensure inputs conform to expected schemas before processing

def test_input_contract(df):
	assert 'user_id' in df
	assert pd.[api](/brain/Web_Scraping_and_APIs).types.is_integer_dtype(df['user_id])

Note: Testing a data pipeline is like setting up guards at every step. Before I process anything, I check if the input looks right, then i test each function that transforms my data. finally, i make sure the whole pipeline runs on a small sample. This way, if something breads – like a missing column or a bad type – I catch it early, not when i'm processing 10 million rows

Test-Driven Development (TDD) for Data Pipelines

TDD for data pipelines means this loop:

  1. Write a test that fails (define what the transformation should do )
  2. Write just enough pipeline logic to make the test pass
  3. refactor the code for clarity and performance
  4. repeat for example, before writing normalize_income()function, you write a test:
def test_income_normalization():
	df=pd.DataFrame({'income':[0,50,100]})
	result = normalize_income(df)
	assert result['income_scaled'].tolist() == [0.0, 0.5, 1.0 ]

only after that test exists, you implement normalize_income()

Why use TDD for data pipelines?

In traditional software, TDD helps devs write only the code they need, guided by the tests. In data pipelines, TDD helps you:

  • Think clearly about the expected data transformations
  • Prevent broken logic, unhandled edge cases, and schema drift
  • Build modular, well-scoped and testable pipelines
  • Catch bugs before they affect millions of records In short: TDD flips the mindset – write the test first, then write the code to pass it

When to apply TDD?

  • while designing new transformations
  • when refactoring existing pipelines
  • during onboarding of new data sources
  • while fixing a bug – write a test that reproduces the bug first You don't need TDD for every small helper function, but for critical logic (encoding, scaling, feature extraction...), it's invaluable

How to practice TDD with data?

Treat each transformation as a testable unit

example: binning age -> test expected bins

def test_bin_age():
	df = pd.DataFrame({'age':[10,30,80]})
	result = bin_age(df)
	assert result['age_group'].tolist() == ['child', 'adult', 'senior']

Design your pipeline modularly

Break it down into small functions, so each one can be tested in isolation.

Use test fixtures or mock datasets

Keep small, hard-coded datasets in your tests. No need for CSVs or databases.

Write tests – even before any logic

Get used to thinking. "What should the output look like?" before coding.

NOTE: TDD means i write the tests first, before the actual pipeline logic. for example, if i'm going to normalize income, i first write a test that says, 'I expect income_scaled to go from 0 to 1 if income is between 0 and 100'. Then I write the coe to make that test pass. This way, I always know what my functions are supposed to do before they exist

Linked to this note