REST APIs and Robust Authentication

Why?

In the age of modern data-driven applications, REST APIs are the highways through which information flows between systems. For ML pipelines, this is crucial; external APIs allow us to enrich datasets with product details, behavioral logs, customer reviews and more – durectly from e-commerce platforms, CRMs, IoT devices or public services. In DataOps, mastering API interaction is essential to:

Ingest realtime or external data into pipelines.
Ensure data freshness and variety for feature engineering.
Build scalable and secure systems that interface with external data providers

When is this relevant?

When your ML use case requires external enrichment (e.g., product details, user activity).
When data is not stored in a centralized data lake but needs to be pulled from a third-party service.
When working with APIs from e-commerce, financial, logistics or public datasets.
When integrating with SaaS tools like Stripe, Shopify, Salesforce....

What will we cover?

RESTful Architecture: Principles, verbs, statelessness, endpoints Request anatomy: Headers, query params, path params, body Authentication: API key, Bearer token, OAuth2 - when and how to use Pagination: Strats: offset and limit, cursor based Rate limiting: Avoiding throttling and respecting quotas Sync vs. Async: When to use synchronous vs asynchronous APIs

Rest: What is it really?

REST stands for REpresentational State Transfer – a lightweight architecture that treats everything as a resource.

You use HTTP verbs to act:
- GET/products -> retrieve
- POST/products -> create
- PUT/products -> update
- DELETE/products -> remove Statelessness means: the server doesn't remember who you are, each request must contain everything it needs.

Anatomy of an API request

GET/products?category=phones&page=2HTTP/1.1

HOST: api.shop.com AUTH: Bearer Accept: application/json

Headers: metadata like auth, content-type
Query params: filters (?category=phones&page=2)
Body: used in POST, PUT, GET to send data (usually JSON)

Authentication deep dive

Method	Use Case	Example
API key	Simple services or internal APIs	?api_key=123
Bearer Token	More secure, time-limited access	Authentication: Bearer
OAuth2	Third-party auth(e.g., google, twitter)	Redirect to auth, get token
Use Bearer tokens for production-grade ingestion pipelines

Pagination & Rate Limiting

APIs don't give you everything at once, you need to page through

Offset-based: ?page=2&limit=100
Cursor-based: ?after=last_id Rate limiting: APIs will cut you off after a certain number of requests. Check headers like: X-RateLimit-Limit: 500 X-RateLimit-Remaining: 100 Retry-After: 120 Always respect the APIs quote to avoid bans

Sync vs. Async APIs

Type	When to Use	Characterstics
Synchromous	For small, immediate responses	Fast, blocking
Asynchronous	When requests take time (e.g., large processing)	Return job ID, poll status

Data formats & Serialization

Why?

Data retrieved from APIs – especially in real-world systems – is typically nested, verbose, and unstructured. For machine learning pipelines to consume this data efficiently:

It must be flattened, cleaned and store in compact formats
the data must be transformed from raw API responses (JSON, XML) into formats optimized for storage, performance and ML processing (Avro, Parquet) Without this transformation layer, even well-designed ingestion pipelines become a bottleneck due to :
high memory consumption
slow I/O operatios
poor integration with data processing engines (e.g., spark, dask) Thus, format handling and serialization is a critical competency in production-grade DataOps.

When is this important?

When you're ingesting large or deeply nested data from APIs
When your data needs to be stored in an ML-friendly format (columnar, compressed, binary)
When you're designing streaming ingestion pipelines or need interoperability across systems (e.g., kafka -> avro)
When optimizing for cloud storage or distributed processing (e.g. S3 + spark)

What we will be learning?

Concept	Description
Serialization	transforming python/JSON/XML objects into storable, transferable formats
Deserialization	Converting files or byye streams back into python objects
JSON vs XML	Structurral differences, parsing techniques
Flattening	converting nested dictionaries/ lists into flat dataframes
Parquet / Avro	Binary columnar formats designed for efficient ML processing
Compression	Improving performance and reducing storage using codecs like snappy, brotli, gzip

Serialization ≠ Just saving

think of serialization as packing a suitcase:

you can't travel with a full wardrobe (complex object) – you must flatten and organize it into something compact (JSON, binary, parquet)
when you arrive (load the file), you unpack it into something you can use again (python dict, pandas dataframe)

JSON vs XML: What's the difference?

Feature	JSON	XML
Format	lightweight, js friendly	verbose, hierachical
Data Types	supports arrays, objects, numbers and strings	everything is a string
Use Case	APIs, modern web services	legacy systems, configs
Python libraries	json, pandas.json, json_normalize	xml.etree.ElementTree, lxml, beautifulsoup
Key issue: both are tree-like, and often deeply nested, requiring flattening before ML use

Flattening JSON for ML

JSON often loops like this:

{
	"product":{
		"name" :  "Laptop",
		"price" : 1499,
		"specs" : {
			"ram" : "16GB",
			"cpu" : "i7"
		}
	}
}

we want:

product_name	product_price	product_specs_ram	product_specs_cpu
Laptop	1499	16GB	i7
Use:

pd.json_normalize(data, sep='_')

Parsing XML: Think DOM types

XML:

<product>
	<name>Laptop</name>
	<price>1499</price>
	<specs>
		<ram>16GB</ram>
		<cpu>i7</cpu>
	</specs>
</product>

You must walk through the tree and extract elements manually or recursively Use:

import xml.etree.ElementTree as ET

or for messy APIs:

from bs4 import BeautifulSoup

Parquet & Avro: Optimized ML formats

Format	Parquet	Avro
Type	Columnar	Row-based
Use Case	ML pipelines, analytics	Streaming, schema evolution
Compression	Snappy, gzip	Snappy, deflate
Python Libs	pyarrow, fastparquet	fastavro, avro-python3

Error handling and retry patterns

APIs are inherently unreliable: they're external systems that may fail for reasons beyond our control. Network latency, rate limits, downtime or even unexpected data formats can all break an ingestion pipeline if not properly managed. In DataOps, reliability is non-negotiable. A failed call shouldn't crash your workflow or silently skip important data. By learning to:

Detect and classify errors
Retry failed requests wisely
Log and audit all behavior ... you build pipelines that are resilient, traceable and production-ready

When is this important?

when your ingestion is critical and must not drop data
when dealing with unstable or third-part APIs
When working with large scheduled jobs, where errors can accumulate
when scaling ingestion across hundreds of endpoints
when you need to debug why something failed – hours or days later

Learning material

Topic	Description
HTTP Errors and Failures	Classify 4xX, 5xX, timeouts, malformed responses
Retry Patterns	Exponential backoff, jitter and retry caps
Retry Tools	tenacity, retrying libraries
Logging and Auditing	How to trace API activity and failures
Robustness Mechanisms	Timeouts, control headers like ETag, If-Modified-Since

Types of API Failures

Client errors (4xX):
- you did something wrong (bad token, wrong endpoint, bad params)
- 401 unauthorized, 403 forbidden, 404 not found
Server Errors (5xX)
- the server is down or misbehaving
- 502 bad gateway, 503 service unavailable, 504 timeout
Timeouts / Network failures
- Slow connection, DNS failures, etc
- must be caught explicitly (e.g., requests.exceptions.Timeout)
Data Issues:
- Malformed JSON, missing keys, wrong types
- Not HTTP erros – these are semantic errors

Retry Strategies

Retrying a failed request can work – but not blindly, we should use:

Concept	Description
Exponential Backoff	Wait longer after each failure (e.g. 1s -> 2s -> 4s -> 8s)
Jitter	Add randomness to avoid retrying at the same time as others
Retry Caps	Stop retrying after N attempts or max time
Only Retry Recoverable Errors	5xX, timeouts – but not 401/403/404

Timeouts: your first defense

Set timeouts on all requests – always!

requests.get(url, timeout=5)

Without timeouts, your pipeline could hang indefinitely on a single request

Control Headers: avoid redundant calls

Some APIs offer:

ETag: unique version of resource
Last-Modified_ timestamp of last change Use: If-None-Match:"abc123" If-Modified-Since: "Tue, 25 Jul 2023 07:28:00 GMT"

-> the server will respond 304 Not Modified if no change = save bandwidth

Logging: If it fails and you don't log it... Did it really happen?

Use python's logging module to:

Record all failures (and their reasons)
Track retries
Alert or escalate errors

import logging
logging.basicConfig(level=logging.INFO)

Store logs in file or database for auditability in production.

e# Introduction to web scraping Web scraping is the process of automatically extracting information from web pages. in the DataOps + ML context, it's a powerful way to augment datasets, especially when public APIs are unavailable or incomplete.

Key reasons:

ML Dataset Enrichment:
- Example: Adding real product for descriptions and user reviews to improve recommendation systems.
- Example: collecting competitor pricing to train a dynamic pricing model
Market Intelligence:
- Competitive research on products, features, customer sentiment
Data Unification:
- Consolidating fragmented data from multiple online sources
Trend Analysis:
- Tracking changes over time, such as price fluctuations, product availability or review trends

"We scrape when we need structured data that exists only as unstructured web content." If a machine readable API already provides this data in a compliant way – scraping is often not the best choice.

When is scraping appropriate?

Not all scraping is legal, ethical or even technically allowed. We scrape only when:

Data is public (no auth required)
website's ToS allows automated access – or don't explicitly forbid it
robots.txt rules are respect (guideline for bots)
we comply with data protection laws (GDPR, CCPA, etc)
purpose is legitimate and non-abusive (e.g., enriching your own models, not stealing proprietary databases)

Important boundaries

never scrape personal data without explicit consent
avoid scraping private/protected content (requires login, paywalls)
limit request rate to avoid harming site performance

What is scraping?

web scraping involves:

Sending HTTP requests to a page (e.g., GET request)
Downloading the HTML (or JSON, XML in some cases)
Parsing the content to extract specific elements (using HTML tags, CSS selectors, XPath)
Structuring the data into a useable format (CSV, JSON, db)

core concepts:

HTML: the markup language used to structure web pages.
DOM: the tree like structure browsers create from HTML.
Selectors
- CSS selectors: target elements by tag, class, id, etc
- XPath: XML-style navigation through DOM
HTTP: the protocol for web communication (GET, POST, headers, cookies)

Think of a webpage as a messy pantry. scraping is like opening the door, finding the ingredients you need (selectors), and putting them in your own neatly labeled jars (structured data)

How scraping differs from APIs and crawling

Scraping vs APIs
- APIs return structured data (usually JSON or XML) via a documented interface.
- Scraping extracts unstructured data from HTML, requiring parsing and cleanup
- APIs are more stable but may require authentication, have rate limits or restrict certain data fields
- scraping is flexible but fragile – website layout changes can break the script
Scraping vs Crawlin g
- Scraping: extracts data from specific pages
- Crawling: automatically navigates and indexes many pages (like googlebot)
- crawling often includes scraping but at a larger scale

The Ethics and Legality of Web Scraping

Web scraping—the automated extraction of data from websites—has become a common technique in data-driven industries. However, its use raises important ethical and legal considerations that developers must be aware of.

⚖️ Legal Considerations

The legality of web scraping varies by jurisdiction and context. Key legal factors include:

Terms of Service Violations: Many websites explicitly prohibit scraping in their Terms of Service (ToS). Violating these terms may lead to account bans or civil lawsuits, even if the scraped content is publicly accessible.
Copyright and Database Rights: Web content may be protected by copyright laws or sui generis database rights (particularly in the EU). Copying and redistributing such data could infringe on the rights of content owners.
Computer Fraud and Abuse Act (CFAA): In the U.S., scraping that involves bypassing authentication or access restrictions may violate the CFAA, potentially leading to criminal liability.
Data Protection Laws: Scraping personal data—especially in regions governed by GDPR (EU) or CCPA (California)—can breach privacy regulations if proper consent and data handling practices are not followed.

🧭 Ethical Considerations

Even when legal, scraping can raise ethical issues:

Server Load and Denial of Service: Aggressive scraping can overwhelm a website's servers, unintentionally causing denial of service.
Respect for Robots.txt: Ethical scrapers respect the robots.txt file, which indicates which parts of a site are off-limits to crawlers.
Attribution and Fair Use: When using scraped data, ethical practices include citing sources and avoiding the misrepresentation or commercial exploitation of others' work without permission.
Intent and Impact: Consider whether scraping benefits the public (e.g., academic research, open data aggregation) or harms stakeholders (e.g., copying content for commercial gain).

✅ Best Practices

To navigate these concerns responsibly:

Read and follow the website’s ToS and robots.txt.
Avoid scraping sensitive or personal information.
Use rate limiting and caching to minimize server load.
Consider reaching out to site owners for permission or API access.

By balancing legal compliance and ethical responsibility, developers can ensure that their scraping practices are both respectful and sustainable.

Tools and techniques for scraping

Why?

in modern web scraping, the choice of tool determines:

Speed: (requests is lightweight, scrapy is optimized for large crawls)
ease of parsing (bs for simplicity and lxml for speed)
handling complexity (selenium for js heavy sites)
scalability (scrapy pipelines for industrial-scale jobs)

picking a scraper tool is like picking a vehicle: bike for short distance (requests + bs4), a car for medium trips (scrapy) and a tank for rough terrains (selenium + playwright)

When to use each tool?

Tool	Best For	Avoid When
requests	Simple, fast HTTP requests for static HTML/JSON pages	JS-Heavy sites (data missing in raw HTML)
BeautifulSoup	Easy HTML parsing and extraction	Very large HTML pages (slower than lxml)
lxml	Fast parsing with XPath support	less beginner-friendly syntax
Selenium	Interacting with dynamic, JS-rendered pages (clicks, scrolls)	High scale scraping (slow)
Scrapy	Large-scale, structured, concurrent scraping	very small, one-off scripts(overkill)

What each lib does

requests
- Makes HTTP GET/POST requests
- Handles headers, cookies, query params
- Supports session persistence
BeautifulSoul
- Parses HTML into a navigable object tree
- Finds elements by tag, class, id , CSS selectors
lxml
- High performance parsing engine
- supports xpath for precise selection
Selenium
- automates browsers (chrome, firefox)
- loads jjs-heavy pages by actually running JS
- can scroll, click and fill forms
Scrapy
- framework for large scraping projects
- built-in concurrency, pipelines, middlewares
- built-in support for pagination and following links

How to handle real-world scraping challenges

handling js-heavy pages

some sires load data after the initial HTML via JS
solutions:
- inspect network requests in evtools -> often you can find json api endpoints without using selenium
- if no direct API
  - use selenium or playwright to render and extract content
  - use requests_html (lightweight rendering)

avoid full browser automation unless strictly nevessary – it's slow and resource-intensive

rate-limiting, user-agents and headers

why? to avoid being blocked and to behave politely
rate-limiting: add delays between requests (time.sleep() or async rate limits)
user-agent rotation: pretend to be a normal browser (fake_useragent lib)
headers: send referrer, accept-language, cookies when needed

headers = {"User-Agent":"Mozilla/5.0", "Accept-Language":"en-US"}
response = requests.get(url, headers=headers)

pagination and lazy loading

pagination:
- common: ?page=2 or offset-based (Start=20)
- strategy: loop through page numbers until no more results
lazy loading / infinite scroll:
- often requires js execution
- sometimes kson apis feed the lazy load -> sniff network requests

REST APIs and Robust Authentication

Why?

When is this relevant?

What will we cover?

Rest: What is it really?

Anatomy of an API request

Authentication deep dive

Pagination & Rate Limiting

Sync vs. Async APIs

Data formats & Serialization

Why?

When is this important?

What we will be learning?

Serialization ≠ Just saving

JSON vs XML: What's the difference?

Flattening JSON for ML

Parsing XML: Think DOM types

Parquet & Avro: Optimized ML formats

Error handling and retry patterns

When is this important?

Learning material

Types of API Failures

Retry Strategies

Timeouts: your first defense

Control Headers: avoid redundant calls

Logging: If it fails and you don't log it... Did it really happen?

Key reasons:

When is scraping appropriate?

Important boundaries

What is scraping?

core concepts:

How scraping differs from APIs and crawling

The Ethics and Legality of Web Scraping

⚖️ Legal Considerations

🧭 Ethical Considerations

✅ Best Practices

Tools and techniques for scraping

Why?

When to use each tool?

What each lib does

How to handle real-world scraping challenges

handling js-heavy pages

rate-limiting, user-agents and headers

pagination and lazy loading

Linked to this note