Web_Scraping_and_APIs
REST APIs and Robust Authentication
Why?
In the age of modern data-driven applications, REST APIs are the highways through which information flows between systems. For ML pipelines, this is crucial; external APIs allow us to enrich datasets with product details, behavioral logs, customer reviews and more – durectly from e-commerce platforms, CRMs, IoT devices or public services. In DataOps, mastering API interaction is essential to:
- Ingest realtime or external data into pipelines.
- Ensure data freshness and variety for feature engineering.
- Build scalable and secure systems that interface with external data providers
When is this relevant?
- When your ML use case requires external enrichment (e.g., product details, user activity).
- When data is not stored in a centralized data lake but needs to be pulled from a third-party service.
- When working with APIs from e-commerce, financial, logistics or public datasets.
- When integrating with SaaS tools like Stripe, Shopify, Salesforce....
What will we cover?
RESTful Architecture: Principles, verbs, statelessness, endpoints Request anatomy: Headers, query params, path params, body Authentication: API key, Bearer token, OAuth2 - when and how to use Pagination: Strats: offset and limit, cursor based Rate limiting: Avoiding throttling and respecting quotas Sync vs. Async: When to use synchronous vs asynchronous APIs
Rest: What is it really?
REST stands for REpresentational State Transfer – a lightweight architecture that treats everything as a resource.
- You use HTTP verbs to act:
- GET/products -> retrieve
- POST/products -> create
- PUT/products -> update
- DELETE/products -> remove Statelessness means: the server doesn't remember who you are, each request must contain everything it needs.
Anatomy of an API request
GET/products?category=phones&page=2HTTP/1.1
HOST: api.shop.com AUTH: Bearer Accept: application/json
- Headers: metadata like auth, content-type
- Query params: filters (?category=phones&page=2)
- Body: used in POST, PUT, GET to send data (usually JSON)
Authentication deep dive
| Method | Use Case | Example |
|---|---|---|
| API key | Simple services or internal APIs | ?api_key=123 |
| Bearer Token | More secure, time-limited access | Authentication: Bearer |
| OAuth2 | Third-party auth(e.g., google, twitter) | Redirect to auth, get token |
| Use Bearer tokens for production-grade ingestion pipelines |
Pagination & Rate Limiting
APIs don't give you everything at once, you need to page through
- Offset-based: ?page=2&limit=100
- Cursor-based: ?after=last_id Rate limiting: APIs will cut you off after a certain number of requests. Check headers like: X-RateLimit-Limit: 500 X-RateLimit-Remaining: 100 Retry-After: 120 Always respect the APIs quote to avoid bans
Sync vs. Async APIs
| Type | When to Use | Characterstics |
|---|---|---|
| Synchromous | For small, immediate responses | Fast, blocking |
| Asynchronous | When requests take time (e.g., large processing) | Return job ID, poll status |
Data formats & Serialization
Why?
Data retrieved from APIs – especially in real-world systems – is typically nested, verbose, and unstructured. For machine learning pipelines to consume this data efficiently:
- It must be flattened, cleaned and store in compact formats
- the data must be transformed from raw API responses (JSON, XML) into formats optimized for storage, performance and ML processing (Avro, Parquet) Without this transformation layer, even well-designed ingestion pipelines become a bottleneck due to :
- high memory consumption
- slow I/O operatios
- poor integration with data processing engines (e.g., spark, dask) Thus, format handling and serialization is a critical competency in production-grade DataOps.
When is this important?
- When you're ingesting large or deeply nested data from APIs
- When your data needs to be stored in an ML-friendly format (columnar, compressed, binary)
- When you're designing streaming ingestion pipelines or need interoperability across systems (e.g., kafka -> avro)
- When optimizing for cloud storage or distributed processing (e.g. S3 + spark)
What we will be learning?
| Concept | Description |
|---|---|
| Serialization | transforming python/JSON/XML objects into storable, transferable formats |
| Deserialization | Converting files or byye streams back into python objects |
| JSON vs XML | Structurral differences, parsing techniques |
| Flattening | converting nested dictionaries/ lists into flat dataframes |
| Parquet / Avro | Binary columnar formats designed for efficient ML processing |
| Compression | Improving performance and reducing storage using codecs like snappy, brotli, gzip |
Serialization ≠ Just saving
think of serialization as packing a suitcase:
- you can't travel with a full wardrobe (complex object) – you must flatten and organize it into something compact (JSON, binary, parquet)
- when you arrive (load the file), you unpack it into something you can use again (python dict, pandas dataframe)
JSON vs XML: What's the difference?
| Feature | JSON | XML |
|---|---|---|
| Format | lightweight, js friendly | verbose, hierachical |
| Data Types | supports arrays, objects, numbers and strings | everything is a string |
| Use Case | APIs, modern web services | legacy systems, configs |
| Python libraries | json, pandas.json, json_normalize | xml.etree.ElementTree, lxml, beautifulsoup |
| Key issue: both are tree-like, and often deeply nested, requiring flattening before ML use |
Flattening JSON for ML
JSON often loops like this:
{
"product":{
"name" : "Laptop",
"price" : 1499,
"specs" : {
"ram" : "16GB",
"cpu" : "i7"
}
}
}
we want:
| product_name | product_price | product_specs_ram | product_specs_cpu |
|---|---|---|---|
| Laptop | 1499 | 16GB | i7 |
| Use: |
pd.json_normalize(data, sep='_')
Parsing XML: Think DOM types
XML:
<product>
<name>Laptop</name>
<price>1499</price>
<specs>
<ram>16GB</ram>
<cpu>i7</cpu>
</specs>
</product>
You must walk through the tree and extract elements manually or recursively Use:
import xml.etree.ElementTree as ET
or for messy APIs:
from bs4 import BeautifulSoup
Parquet & Avro: Optimized ML formats
| Format | Parquet | Avro |
|---|---|---|
| Type | Columnar | Row-based |
| Use Case | ML pipelines, analytics | Streaming, schema evolution |
| Compression | Snappy, gzip | Snappy, deflate |
| Python Libs | pyarrow, fastparquet | fastavro, avro-python3 |
Error handling and retry patterns
APIs are inherently unreliable: they're external systems that may fail for reasons beyond our control. Network latency, rate limits, downtime or even unexpected data formats can all break an ingestion pipeline if not properly managed. In DataOps, reliability is non-negotiable. A failed call shouldn't crash your workflow or silently skip important data. By learning to:
- Detect and classify errors
- Retry failed requests wisely
- Log and audit all behavior ... you build pipelines that are resilient, traceable and production-ready
When is this important?
- when your ingestion is critical and must not drop data
- when dealing with unstable or third-part APIs
- When working with large scheduled jobs, where errors can accumulate
- when scaling ingestion across hundreds of endpoints
- when you need to debug why something failed – hours or days later
Learning material
| Topic | Description |
|---|---|
| HTTP Errors and Failures | Classify 4xX, 5xX, timeouts, malformed responses |
| Retry Patterns | Exponential backoff, jitter and retry caps |
| Retry Tools | tenacity, retrying libraries |
| Logging and Auditing | How to trace API activity and failures |
| Robustness Mechanisms | Timeouts, control headers like ETag, If-Modified-Since |
Types of API Failures
- Client errors (4xX):
- you did something wrong (bad token, wrong endpoint, bad params)
- 401 unauthorized, 403 forbidden, 404 not found
- Server Errors (5xX)
- the server is down or misbehaving
- 502 bad gateway, 503 service unavailable, 504 timeout
- Timeouts / Network failures
- Slow connection, DNS failures, etc
- must be caught explicitly (e.g., requests.exceptions.Timeout)
- Data Issues:
- Malformed JSON, missing keys, wrong types
- Not HTTP erros – these are semantic errors
Retry Strategies
Retrying a failed request can work – but not blindly, we should use:
| Concept | Description |
|---|---|
| Exponential Backoff | Wait longer after each failure (e.g. 1s -> 2s -> 4s -> 8s) |
| Jitter | Add randomness to avoid retrying at the same time as others |
| Retry Caps | Stop retrying after N attempts or max time |
| Only Retry Recoverable Errors | 5xX, timeouts – but not 401/403/404 |
Timeouts: your first defense
Set timeouts on all requests – always!
requests.get(url, timeout=5)
Without timeouts, your pipeline could hang indefinitely on a single request
Control Headers: avoid redundant calls
Some APIs offer:
- ETag: unique version of resource
- Last-Modified_ timestamp of last change Use: If-None-Match:"abc123" If-Modified-Since: "Tue, 25 Jul 2023 07:28:00 GMT"
-> the server will respond 304 Not Modified if no change = save bandwidth
Logging: If it fails and you don't log it... Did it really happen?
Use python's logging module to:
- Record all failures (and their reasons)
- Track retries
- Alert or escalate errors
import logging
logging.basicConfig(level=logging.INFO)
Store logs in file or database for auditability in production.
e# Introduction to web scraping Web scraping is the process of automatically extracting information from web pages. in the DataOps + ML context, it's a powerful way to augment datasets, especially when public APIs are unavailable or incomplete.
Key reasons:
- ML Dataset Enrichment:
- Example: Adding real product for descriptions and user reviews to improve recommendation systems.
- Example: collecting competitor pricing to train a dynamic pricing model
- Market Intelligence:
- Competitive research on products, features, customer sentiment
- Data Unification:
- Consolidating fragmented data from multiple online sources
- Trend Analysis:
- Tracking changes over time, such as price fluctuations, product availability or review trends
"We scrape when we need structured data that exists only as unstructured web content." If a machine readable API already provides this data in a compliant way – scraping is often not the best choice.
When is scraping appropriate?
Not all scraping is legal, ethical or even technically allowed. We scrape only when:
- Data is public (no auth required)
- website's ToS allows automated access – or don't explicitly forbid it
- robots.txt rules are respect (guideline for bots)
- we comply with data protection laws (GDPR, CCPA, etc)
- purpose is legitimate and non-abusive (e.g., enriching your own models, not stealing proprietary databases)
Important boundaries
- never scrape personal data without explicit consent
- avoid scraping private/protected content (requires login, paywalls)
- limit request rate to avoid harming site performance
What is scraping?
web scraping involves:
- Sending HTTP requests to a page (e.g., GET request)
- Downloading the HTML (or JSON, XML in some cases)
- Parsing the content to extract specific elements (using HTML tags, CSS selectors, XPath)
- Structuring the data into a useable format (CSV, JSON, db)
core concepts:
- HTML: the markup language used to structure web pages.
- DOM: the tree like structure browsers create from HTML.
- Selectors
- CSS selectors: target elements by tag, class, id, etc
- XPath: XML-style navigation through DOM
- HTTP: the protocol for web communication (GET, POST, headers, cookies)
Think of a webpage as a messy pantry. scraping is like opening the door, finding the ingredients you need (selectors), and putting them in your own neatly labeled jars (structured data)
How scraping differs from APIs and crawling
- Scraping vs APIs
- APIs return structured data (usually JSON or XML) via a documented interface.
- Scraping extracts unstructured data from HTML, requiring parsing and cleanup
- APIs are more stable but may require authentication, have rate limits or restrict certain data fields
- scraping is flexible but fragile – website layout changes can break the script
- Scraping vs Crawlin g
- Scraping: extracts data from specific pages
- Crawling: automatically navigates and indexes many pages (like googlebot)
- crawling often includes scraping but at a larger scale
The Ethics and Legality of Web Scraping
Web scraping—the automated extraction of data from websites—has become a common technique in data-driven industries. However, its use raises important ethical and legal considerations that developers must be aware of.
⚖️ Legal Considerations
The legality of web scraping varies by jurisdiction and context. Key legal factors include:
-
Terms of Service Violations: Many websites explicitly prohibit scraping in their Terms of Service (ToS). Violating these terms may lead to account bans or civil lawsuits, even if the scraped content is publicly accessible.
-
Copyright and Database Rights: Web content may be protected by copyright laws or sui generis database rights (particularly in the EU). Copying and redistributing such data could infringe on the rights of content owners.
-
Computer Fraud and Abuse Act (CFAA): In the U.S., scraping that involves bypassing authentication or access restrictions may violate the CFAA, potentially leading to criminal liability.
-
Data Protection Laws: Scraping personal data—especially in regions governed by GDPR (EU) or CCPA (California)—can breach privacy regulations if proper consent and data handling practices are not followed.
🧭 Ethical Considerations
Even when legal, scraping can raise ethical issues:
-
Server Load and Denial of Service: Aggressive scraping can overwhelm a website's servers, unintentionally causing denial of service.
-
Respect for Robots.txt: Ethical scrapers respect the
robots.txtfile, which indicates which parts of a site are off-limits to crawlers. -
Attribution and Fair Use: When using scraped data, ethical practices include citing sources and avoiding the misrepresentation or commercial exploitation of others' work without permission.
-
Intent and Impact: Consider whether scraping benefits the public (e.g., academic research, open data aggregation) or harms stakeholders (e.g., copying content for commercial gain).
✅ Best Practices
To navigate these concerns responsibly:
- Read and follow the website’s ToS and robots.txt.
- Avoid scraping sensitive or personal information.
- Use rate limiting and caching to minimize server load.
- Consider reaching out to site owners for permission or API access.
By balancing legal compliance and ethical responsibility, developers can ensure that their scraping practices are both respectful and sustainable.
Tools and techniques for scraping
Why?
in modern web scraping, the choice of tool determines:
- Speed: (requests is lightweight, scrapy is optimized for large crawls)
- ease of parsing (bs for simplicity and lxml for speed)
- handling complexity (selenium for js heavy sites)
- scalability (scrapy pipelines for industrial-scale jobs)
picking a scraper tool is like picking a vehicle: bike for short distance (requests + bs4), a car for medium trips (scrapy) and a tank for rough terrains (selenium + playwright)
When to use each tool?
| Tool | Best For | Avoid When |
|---|---|---|
| requests | Simple, fast HTTP requests for static HTML/JSON pages | JS-Heavy sites (data missing in raw HTML) |
| BeautifulSoup | Easy HTML parsing and extraction | Very large HTML pages (slower than lxml) |
| lxml | Fast parsing with XPath support | less beginner-friendly syntax |
| Selenium | Interacting with dynamic, JS-rendered pages (clicks, scrolls) | High scale scraping (slow) |
| Scrapy | Large-scale, structured, concurrent scraping | very small, one-off scripts(overkill) |
What each lib does
- requests
- Makes HTTP GET/POST requests
- Handles headers, cookies, query params
- Supports session persistence
- BeautifulSoul
- Parses HTML into a navigable object tree
- Finds elements by tag, class, id , CSS selectors
- lxml
- High performance parsing engine
- supports xpath for precise selection
- Selenium
- automates browsers (chrome, firefox)
- loads jjs-heavy pages by actually running JS
- can scroll, click and fill forms
- Scrapy
- framework for large scraping projects
- built-in concurrency, pipelines, middlewares
- built-in support for pagination and following links
How to handle real-world scraping challenges
handling js-heavy pages
- some sires load data after the initial HTML via JS
- solutions:
- inspect network requests in evtools -> often you can find json api endpoints without using selenium
- if no direct API
- use selenium or playwright to render and extract content
- use requests_html (lightweight rendering)
avoid full browser automation unless strictly nevessary – it's slow and resource-intensive
rate-limiting, user-agents and headers
- why? to avoid being blocked and to behave politely
- rate-limiting: add delays between requests (time.sleep() or async rate limits)
- user-agent rotation: pretend to be a normal browser (fake_useragent lib)
- headers: send referrer, accept-language, cookies when needed
headers = {"User-Agent":"Mozilla/5.0", "Accept-Language":"en-US"}
response = requests.get(url, headers=headers)
pagination and lazy loading
- pagination:
- common: ?page=2 or offset-based (Start=20)
- strategy: loop through page numbers until no more results
- lazy loading / infinite scroll:
- often requires js execution
- sometimes kson apis feed the lazy load -> sniff network requests
