Introduction

Distributed data systems are infrastructures where data is stored and processed across multiple nodes working together. These nodes or machines share the workload and when the need for scaling up arises, more machines can be added instead of relying on a upgrade the machines (scaling horizontally).

Why?

The rise of social networks, e-commerce platforms, and IoT devices has created a flood of data. A single machine cannot store or process petabytes of information efficiently , Distributed systems enable:

Scalability. handle growth by adding more nodes
Fault tolerance: if one machine fails, others continue to function
Performance: tasks can be parallelized, significantly reducing computation time

When?

Distributed systems are used when:

Data volume is too large for a single machine
applications require high availability
real-time or large-scale analytics is necessary

How?

they function by splitting data into smaller parts, replicating them across nodes, using distributing computing frameworks like hadoop or spark to process them. Data locality is often prioritized – computation is moved to where the data resides, not the other way around.

Centralized vs Distributed Systems

Centralized System

Example: Hosting a Bitcoin trading service on a single home computer.
Scaling method: Vertical Scaling (add CPU, RAM, etc.).
Limitations:
- Physical limits to hardware upgrades.
- Single point of failure (power outage, crashes).
- High latency for global users.

Distributed System

Scales horizontally (add more machines).
Can support billions of users.
Improves reliability (no single point of failure).
Reduces latency by distributing workload across regions.

Key Characteristics

Communication happens only through the network.
Processes must maintain a shared state.
If processes don’t know about each other → it’s just a collection of computers, not a distributed system.
Goal: Design algorithms so distributed processes can collaborate effectively.

Hadoop Ecosystem

Hadoop ecosystem is an open source project developed at Yahoo! to address the challenges of storing big data, managing said data and also processing it on commodity hardware. At its core, Hadoop addresses three fundamental issues of big data:

Volume: how big data is
Velocity: how fast data is processed at scale
Variety: handling different types of data (structured, semi structured and unstructured)

Four Pillars of Hadoop

HDFS (Hadoop Distributed File System)

Distributed, scalable file system that stores large files by breaking them into blocks
Optimized for throughput rather than low latency
Provides replication for fault tolerance

MapReduce

Programming paradigm for processing large datasets in parallel.

Map: Break the input into key-value pairs and process them independently
Reduce: Aggregate the results of the map step

YARN (Yet Another Resource Negotiator)

Cluster resource manager introduced in Hadoop 2.0
Separates resources management from data processing
Allows multiple processing frameworks to run on hadoop simultaneously

Higher-Level Ecosystem Tools

Hadoop grew into an entire ecosystem of tools. Some important ones include:

Hive: SQL-like query engine for big data (translates queries into MapReduce or Spark)
Pig: scripting language for data transformations
HBase: column-oriented NoSQL database for real-time access
Oozie: Workflow scheduler for Hadoop jobs
Zookeeper: coordination service for distributed systems

How the Ecosystem Works Together

Raw data ingestion -> stored in HDFS
Batch processing -> run a MapReduce or Spark job to compute features
Querying results -> analysts use Hive to run SQL-like queries for insights
Serving features -> processed data stored in HBase to be accessed in near real.time by the recommendation engine
Orchestration -> pipelines scheduled with Oozie and coordinated with ZooKeeper

Strengths of Hadoop ecosystem

Scalability: can scale to thousands of nodes
Flexibility: handles all data types
Cost efficiency: runs on commodity hardware
Ecosystem integration: tools cover the full data lifecycle

HDFS: Throughput vs. Low Latency Systems

HDFS Characteristics

Optimized for throughput → handles huge files and large-scale batch jobs efficiently.
Not designed for low latency → small, random reads/writes are slow.
Ideal for Big Data analytics where processing large datasets matters more than instant response.

Comparison: HDFS vs. Low Latency Systems

Feature	HDFS (Hadoop Distributed File System)	Low-Latency Systems (e.g., HBase, Cassandra)
Primary Goal	High throughput for big data storage & processing	Low latency for fast query/transaction response
Data Access Pattern	Large, sequential reads/writes (streaming access)	Random, small reads/writes (real-time access)
Block Size	Large blocks (128MB, 256MB, etc.)	Small record-level access
Best For	Batch processing, analytics, ETL jobs	Real-time applications, OLTP workloads
Write Behavior	Replicates blocks to multiple nodes → higher latency but reliable	Writes are quick, designed for fast availability
Typical Use Case	Storing logs, images, videos, large datasets; powering Spark/MapReduce	Serving user profiles, session data, messaging apps
Trade-off	Reliability & throughput over response time	Response time over bulk throughput

Summary

HDFS = Best when you need to store and process petabytes of data with reliability.
Low-latency systems = Best when you need to serve queries or updates in milliseconds.

Deep dive into HDFS

HDFS (hadoop distributed file system) is the backbone of the hadoop ecosystem. It is designed to store extremely large files across a cluster of inexpensive machines.

Core principles of HDFS

Write once, Read Many Times (WORM)
- Data is generally written once and read many times.
- Optimized for analytics workloads rather than frequent updates
Large Block Size
- Instead of 4KB blocks, HDFS uses 128MB or 256MB blocks
- This reduces metadata overhead and allows efficient sequential reads
Data Locality
- HDFS tries to run computations on the nodes where data resides
- this avoids network bottlenecks by moving compute to data not data to compute

File Write Workflow in HDFS

let's say a file is uploaded:

client contacts NameNode -> NameNode breaks the file into blocks
NameNode decides which DataNodes will store each block
Client writes block replicas directly to assigned DataNodes.
- Replication factor = 3 (by default)
- Block A stores in Node1, Node 2 and Node3
- Block B stores in Node3, Node4 and Node5
NameNode updates metadata Result: file is distributed + fault tolerant

File Read Workflow in HDFS

let's say a file is being read:

Client contacts NameNode to request block locations
NameNode returns a list of DataNodes for each block
Client fetches blocks directly from DataNodes
Blocks are combined into the full file by the client Efficiency Trick: if a block exists on the same machine as the client, HDFS chooses that one first -> reduces network transfer

HDFS Architecture Components

Master-slave architecture (in big 2025 smh, not woke imo)

NameNode (master node)

Stores metadata
does not store actual data
critical single point of coordination -> if it fails, the whole cluster halts
modern setups use high availability with active + standby NameNodes

DataNodes (slave nodes)

stores the actual blocks of data
handle read/write requests from clients
periodically sends heartbeats to the NameNode to confirm availability
if a DataNode fails, the system automatically re-replicates its blocks

Secondary NameNode

misleading name, it does not replace the NameNode on failure
its job is to periodically merge the NameNode's editlog with metadata snapshots (fsimage)
prevents the editlogs from growing too large

🧩 1. Data Partitioning Objective:

Organize data into partitions to improve query performance and manageability. Strategies:

Partition by Time: For time-series data, partitioning by date (e.g., year, month, day) allows for efficient querying over specific time ranges.
Partition by Business Logic: Partitioning by fields like region, product_id, or customer_id can optimize queries that filter on these attributes. Tools:
Delta Lake: Utilize Delta Lake's OPTIMIZE command to compact small files and improve read performance. ([delta.io][1]) * Parquet: Implement partitioning manually by organizing data into directories corresponding to partition keys.

🗂️ 2. Data Clustering Objective:

Enhance query performance by physically co-locating related data. Strategies:

Z-Ordering: In Delta Lake, use Z-Ordering to cluster data based on frequently queried columns, reducing the amount of data scanned during queries. ([Denny Lee][2])
Clustering Columns: Choose clustering columns that align with your most common query patterns to optimize data layout.

🧪 3. Data Compression Objective:

Reduce storage costs and improve I/O performance. Strategies:

Columnar Compression: Leverage Parquet's columnar storage format, which allows for efficient compression and encoding.
Compression Codecs: Use efficient compression codecs like Snappy or Zstd to balance compression ratio and decompression speed.

🧹 4. Data Cleanup and Vacuuming Objective:

Maintain optimal storage by removing obsolete data. Strategies:

Delta Lake VACUUM: Regularly run the VACUUM command in Delta Lake to remove files that are no longer referenced, freeing up storage space. ([Xebia][3])
Data Retention Policies: Implement data retention policies to automatically delete outdated data, ensuring compliance and reducing storage costs.

🧠 5. Data Lifecycle Management Objective:

Optimize storage costs by managing data based on its access patterns. Strategies:

Tiered Storage: Move infrequently accessed data to lower-cost storage tiers, such as Amazon S3 Glacier, while keeping frequently accessed data on faster storage. ([dataclassification.fortra.com][4])
Archiving: Archive older data that is rarely accessed but still valuable for historical analysis.

By implementing these strategies, you can optimize the storage of your structured silver dataset, ensuring efficient data management and improved query performance. If you need further assistance with any of these techniques or have specific questions, feel free to ask! [1]: https://delta.io/blog/delta-lake-vs-parquet-comparison/?utm_source=chatgpt.com "Delta Lake vs. Parquet Comparison" [2]: https://dennyglee.com/2024/01/29/optimize-by-clustering-not-partitioning-data-with-delta-lake/?utm_source=chatgpt.com "Optimize by Clustering not Partitioning Data with Delta Lake" [3]: https://xebia.com/blog/databricks-lakehouse-optimization-a-deep-dive-into-vacuum/?utm_source=chatgpt.com "Databricks Lakehouse Optimization: A Deep Dive Into ..." [4]: https://dataclassification.fortra.com/blog/how-optimize-your-data-techniques-better-storage-and-analysis?utm_source=chatgpt.com "How to Optimize Your Data: Techniques for Better Storage ..."

Distributed_Systems_Architecture