Logo
Brain/Web_Scraping_and_APIs

Web_Scraping_and_APIs

#scraping#api#integration#python

REST APIs and Robust Authentication

Why?

In the age of modern data-driven applications, REST APIs are the highways through which information flows between systems. For ML pipelines, this is crucial; external APIs allow us to enrich datasets with product details, behavioral logs, customer reviews and more – durectly from e-commerce platforms, CRMs, IoT devices or public services. In DataOps, mastering API interaction is essential to:

  • Ingest realtime or external data into pipelines.
  • Ensure data freshness and variety for feature engineering.
  • Build scalable and secure systems that interface with external data providers

When is this relevant?

  • When your ML use case requires external enrichment (e.g., product details, user activity).
  • When data is not stored in a centralized data lake but needs to be pulled from a third-party service.
  • When working with APIs from e-commerce, financial, logistics or public datasets.
  • When integrating with SaaS tools like Stripe, Shopify, Salesforce....

What will we cover?

RESTful Architecture: Principles, verbs, statelessness, endpoints Request anatomy: Headers, query params, path params, body Authentication: API key, Bearer token, OAuth2 - when and how to use Pagination: Strats: offset and limit, cursor based Rate limiting: Avoiding throttling and respecting quotas Sync vs. Async: When to use synchronous vs asynchronous APIs

Rest: What is it really?

REST stands for REpresentational State Transfer – a lightweight architecture that treats everything as a resource.

  • You use HTTP verbs to act:
    • GET/products -> retrieve
    • POST/products -> create
    • PUT/products -> update
    • DELETE/products -> remove Statelessness means: the server doesn't remember who you are, each request must contain everything it needs.

Anatomy of an API request

GET/products?category=phones&page=2HTTP/1.1

HOST: api.shop.com AUTH: Bearer Accept: application/json

  • Headers: metadata like auth, content-type
  • Query params: filters (?category=phones&page=2)
  • Body: used in POST, PUT, GET to send data (usually JSON)

Authentication deep dive

MethodUse CaseExample
API keySimple services or internal APIs?api_key=123
Bearer TokenMore secure, time-limited accessAuthentication: Bearer
OAuth2Third-party auth(e.g., google, twitter)Redirect to auth, get token
Use Bearer tokens for production-grade ingestion pipelines

Pagination & Rate Limiting

APIs don't give you everything at once, you need to page through

  • Offset-based: ?page=2&limit=100
  • Cursor-based: ?after=last_id Rate limiting: APIs will cut you off after a certain number of requests. Check headers like: X-RateLimit-Limit: 500 X-RateLimit-Remaining: 100 Retry-After: 120 Always respect the APIs quote to avoid bans

Sync vs. Async APIs

TypeWhen to UseCharacterstics
SynchromousFor small, immediate responsesFast, blocking
AsynchronousWhen requests take time (e.g., large processing)Return job ID, poll status

Data formats & Serialization

Why?

Data retrieved from APIs – especially in real-world systems – is typically nested, verbose, and unstructured. For machine learning pipelines to consume this data efficiently:

  • It must be flattened, cleaned and store in compact formats
  • the data must be transformed from raw API responses (JSON, XML) into formats optimized for storage, performance and ML processing (Avro, Parquet) Without this transformation layer, even well-designed ingestion pipelines become a bottleneck due to :
  • high memory consumption
  • slow I/O operatios
  • poor integration with data processing engines (e.g., spark, dask) Thus, format handling and serialization is a critical competency in production-grade DataOps.

When is this important?

  • When you're ingesting large or deeply nested data from APIs
  • When your data needs to be stored in an ML-friendly format (columnar, compressed, binary)
  • When you're designing streaming ingestion pipelines or need interoperability across systems (e.g., kafka -> avro)
  • When optimizing for cloud storage or distributed processing (e.g. S3 + spark)

What we will be learning?

ConceptDescription
Serializationtransforming python/JSON/XML objects into storable, transferable formats
DeserializationConverting files or byye streams back into python objects
JSON vs XMLStructurral differences, parsing techniques
Flatteningconverting nested dictionaries/ lists into flat dataframes
Parquet / AvroBinary columnar formats designed for efficient ML processing
CompressionImproving performance and reducing storage using codecs like snappy, brotli, gzip

Serialization ≠ Just saving

think of serialization as packing a suitcase:

  • you can't travel with a full wardrobe (complex object) – you must flatten and organize it into something compact (JSON, binary, parquet)
  • when you arrive (load the file), you unpack it into something you can use again (python dict, pandas dataframe)

JSON vs XML: What's the difference?

FeatureJSONXML
Formatlightweight, js friendlyverbose, hierachical
Data Typessupports arrays, objects, numbers and stringseverything is a string
Use CaseAPIs, modern web serviceslegacy systems, configs
Python librariesjson, pandas.json, json_normalizexml.etree.ElementTree, lxml, beautifulsoup
Key issue: both are tree-like, and often deeply nested, requiring flattening before ML use

Flattening JSON for ML

JSON often loops like this:

{
	"product":{
		"name" :  "Laptop",
		"price" : 1499,
		"specs" : {
			"ram" : "16GB",
			"cpu" : "i7"
		}
	}
}

we want:

product_nameproduct_priceproduct_specs_ramproduct_specs_cpu
Laptop149916GBi7
Use:
pd.json_normalize(data, sep='_')

Parsing XML: Think DOM types

XML:

<product>
	<name>Laptop</name>
	<price>1499</price>
	<specs>
		<ram>16GB</ram>
		<cpu>i7</cpu>
	</specs>
</product>

You must walk through the tree and extract elements manually or recursively Use:

import xml.etree.ElementTree as ET

or for messy APIs:

from bs4 import BeautifulSoup

Parquet & Avro: Optimized ML formats

FormatParquetAvro
TypeColumnarRow-based
Use CaseML pipelines, analyticsStreaming, schema evolution
CompressionSnappy, gzipSnappy, deflate
Python Libspyarrow, fastparquetfastavro, avro-python3

Error handling and retry patterns

APIs are inherently unreliable: they're external systems that may fail for reasons beyond our control. Network latency, rate limits, downtime or even unexpected data formats can all break an ingestion pipeline if not properly managed. In DataOps, reliability is non-negotiable. A failed call shouldn't crash your workflow or silently skip important data. By learning to:

  • Detect and classify errors
  • Retry failed requests wisely
  • Log and audit all behavior ... you build pipelines that are resilient, traceable and production-ready

When is this important?

  • when your ingestion is critical and must not drop data
  • when dealing with unstable or third-part APIs
  • When working with large scheduled jobs, where errors can accumulate
  • when scaling ingestion across hundreds of endpoints
  • when you need to debug why something failed – hours or days later

Learning material

TopicDescription
HTTP Errors and FailuresClassify 4xX, 5xX, timeouts, malformed responses
Retry PatternsExponential backoff, jitter and retry caps
Retry Toolstenacity, retrying libraries
Logging and AuditingHow to trace API activity and failures
Robustness MechanismsTimeouts, control headers like ETag, If-Modified-Since

Types of API Failures

  1. Client errors (4xX):
    • you did something wrong (bad token, wrong endpoint, bad params)
    • 401 unauthorized, 403 forbidden, 404 not found
  2. Server Errors (5xX)
    • the server is down or misbehaving
    • 502 bad gateway, 503 service unavailable, 504 timeout
  3. Timeouts / Network failures
    • Slow connection, DNS failures, etc
    • must be caught explicitly (e.g., requests.exceptions.Timeout)
  4. Data Issues:
    • Malformed JSON, missing keys, wrong types
    • Not HTTP erros – these are semantic errors

Retry Strategies

Retrying a failed request can work – but not blindly, we should use:

ConceptDescription
Exponential BackoffWait longer after each failure (e.g. 1s -> 2s -> 4s -> 8s)
JitterAdd randomness to avoid retrying at the same time as others
Retry CapsStop retrying after N attempts or max time
Only Retry Recoverable Errors5xX, timeouts – but not 401/403/404

Timeouts: your first defense

Set timeouts on all requests – always!

requests.get(url, timeout=5)

Without timeouts, your pipeline could hang indefinitely on a single request

Control Headers: avoid redundant calls

Some APIs offer:

  • ETag: unique version of resource
  • Last-Modified_ timestamp of last change Use: If-None-Match:"abc123" If-Modified-Since: "Tue, 25 Jul 2023 07:28:00 GMT"

-> the server will respond 304 Not Modified if no change = save bandwidth

Logging: If it fails and you don't log it... Did it really happen?

Use python's logging module to:

  • Record all failures (and their reasons)
  • Track retries
  • Alert or escalate errors
import logging
logging.basicConfig(level=logging.INFO)

Store logs in file or database for auditability in production.


e# Introduction to web scraping Web scraping is the process of automatically extracting information from web pages. in the DataOps + ML context, it's a powerful way to augment datasets, especially when public APIs are unavailable or incomplete.

Key reasons:

  • ML Dataset Enrichment:
    • Example: Adding real product for descriptions and user reviews to improve recommendation systems.
    • Example: collecting competitor pricing to train a dynamic pricing model
  • Market Intelligence:
    • Competitive research on products, features, customer sentiment
  • Data Unification:
    • Consolidating fragmented data from multiple online sources
  • Trend Analysis:
    • Tracking changes over time, such as price fluctuations, product availability or review trends

"We scrape when we need structured data that exists only as unstructured web content." If a machine readable API already provides this data in a compliant way – scraping is often not the best choice.

When is scraping appropriate?

Not all scraping is legal, ethical or even technically allowed. We scrape only when:

  • Data is public (no auth required)
  • website's ToS allows automated access – or don't explicitly forbid it
  • robots.txt rules are respect (guideline for bots)
  • we comply with data protection laws (GDPR, CCPA, etc)
  • purpose is legitimate and non-abusive (e.g., enriching your own models, not stealing proprietary databases)

Important boundaries

  • never scrape personal data without explicit consent
  • avoid scraping private/protected content (requires login, paywalls)
  • limit request rate to avoid harming site performance

What is scraping?

web scraping involves:

  1. Sending HTTP requests to a page (e.g., GET request)
  2. Downloading the HTML (or JSON, XML in some cases)
  3. Parsing the content to extract specific elements (using HTML tags, CSS selectors, XPath)
  4. Structuring the data into a useable format (CSV, JSON, db)

core concepts:

  • HTML: the markup language used to structure web pages.
  • DOM: the tree like structure browsers create from HTML.
  • Selectors
    • CSS selectors: target elements by tag, class, id, etc
    • XPath: XML-style navigation through DOM
  • HTTP: the protocol for web communication (GET, POST, headers, cookies)

Think of a webpage as a messy pantry. scraping is like opening the door, finding the ingredients you need (selectors), and putting them in your own neatly labeled jars (structured data)

How scraping differs from APIs and crawling

  • Scraping vs APIs
    • APIs return structured data (usually JSON or XML) via a documented interface.
    • Scraping extracts unstructured data from HTML, requiring parsing and cleanup
    • APIs are more stable but may require authentication, have rate limits or restrict certain data fields
    • scraping is flexible but fragile – website layout changes can break the script
  • Scraping vs Crawlin g
    • Scraping: extracts data from specific pages
    • Crawling: automatically navigates and indexes many pages (like googlebot)
    • crawling often includes scraping but at a larger scale

The Ethics and Legality of Web Scraping

Web scraping—the automated extraction of data from websites—has become a common technique in data-driven industries. However, its use raises important ethical and legal considerations that developers must be aware of.

⚖️ Legal Considerations

The legality of web scraping varies by jurisdiction and context. Key legal factors include:

  • Terms of Service Violations: Many websites explicitly prohibit scraping in their Terms of Service (ToS). Violating these terms may lead to account bans or civil lawsuits, even if the scraped content is publicly accessible.

  • Copyright and Database Rights: Web content may be protected by copyright laws or sui generis database rights (particularly in the EU). Copying and redistributing such data could infringe on the rights of content owners.

  • Computer Fraud and Abuse Act (CFAA): In the U.S., scraping that involves bypassing authentication or access restrictions may violate the CFAA, potentially leading to criminal liability.

  • Data Protection Laws: Scraping personal data—especially in regions governed by GDPR (EU) or CCPA (California)—can breach privacy regulations if proper consent and data handling practices are not followed.

🧭 Ethical Considerations

Even when legal, scraping can raise ethical issues:

  • Server Load and Denial of Service: Aggressive scraping can overwhelm a website's servers, unintentionally causing denial of service.

  • Respect for Robots.txt: Ethical scrapers respect the robots.txt file, which indicates which parts of a site are off-limits to crawlers.

  • Attribution and Fair Use: When using scraped data, ethical practices include citing sources and avoiding the misrepresentation or commercial exploitation of others' work without permission.

  • Intent and Impact: Consider whether scraping benefits the public (e.g., academic research, open data aggregation) or harms stakeholders (e.g., copying content for commercial gain).

✅ Best Practices

To navigate these concerns responsibly:

  • Read and follow the website’s ToS and robots.txt.
  • Avoid scraping sensitive or personal information.
  • Use rate limiting and caching to minimize server load.
  • Consider reaching out to site owners for permission or API access.

By balancing legal compliance and ethical responsibility, developers can ensure that their scraping practices are both respectful and sustainable.

Tools and techniques for scraping

Why?

in modern web scraping, the choice of tool determines:

  • Speed: (requests is lightweight, scrapy is optimized for large crawls)
  • ease of parsing (bs for simplicity and lxml for speed)
  • handling complexity (selenium for js heavy sites)
  • scalability (scrapy pipelines for industrial-scale jobs)

picking a scraper tool is like picking a vehicle: bike for short distance (requests + bs4), a car for medium trips (scrapy) and a tank for rough terrains (selenium + playwright)

When to use each tool?

ToolBest ForAvoid When
requestsSimple, fast HTTP requests for static HTML/JSON pagesJS-Heavy sites (data missing in raw HTML)
BeautifulSoupEasy HTML parsing and extractionVery large HTML pages (slower than lxml)
lxmlFast parsing with XPath supportless beginner-friendly syntax
SeleniumInteracting with dynamic, JS-rendered pages (clicks, scrolls)High scale scraping (slow)
ScrapyLarge-scale, structured, concurrent scrapingvery small, one-off scripts(overkill)

What each lib does

  1. requests
    • Makes HTTP GET/POST requests
    • Handles headers, cookies, query params
    • Supports session persistence
  2. BeautifulSoul
    • Parses HTML into a navigable object tree
    • Finds elements by tag, class, id , CSS selectors
  3. lxml
    • High performance parsing engine
    • supports xpath for precise selection
  4. Selenium
    • automates browsers (chrome, firefox)
    • loads jjs-heavy pages by actually running JS
    • can scroll, click and fill forms
  5. Scrapy
    • framework for large scraping projects
    • built-in concurrency, pipelines, middlewares
    • built-in support for pagination and following links

How to handle real-world scraping challenges

handling js-heavy pages
  • some sires load data after the initial HTML via JS
  • solutions:
    • inspect network requests in evtools -> often you can find json api endpoints without using selenium
    • if no direct API
      • use selenium or playwright to render and extract content
      • use requests_html (lightweight rendering)

avoid full browser automation unless strictly nevessary – it's slow and resource-intensive

rate-limiting, user-agents and headers

  • why? to avoid being blocked and to behave politely
  • rate-limiting: add delays between requests (time.sleep() or async rate limits)
  • user-agent rotation: pretend to be a normal browser (fake_useragent lib)
  • headers: send referrer, accept-language, cookies when needed
headers = {"User-Agent":"Mozilla/5.0", "Accept-Language":"en-US"}
response = requests.get(url, headers=headers)

pagination and lazy loading

  • pagination:
    • common: ?page=2 or offset-based (Start=20)
    • strategy: loop through page numbers until no more results
  • lazy loading / infinite scroll:
    • often requires js execution
    • sometimes kson apis feed the lazy load -> sniff network requests

Linked to this note