Distributed_Systems_Architecture
Introduction
Distributed data systems are infrastructures where data is stored and processed across multiple nodes working together. These nodes or machines share the workload and when the need for scaling up arises, more machines can be added instead of relying on a upgrade the machines (scaling horizontally).
Why?
The rise of social networks, e-commerce platforms, and IoT devices has created a flood of data. A single machine cannot store or process petabytes of information efficiently , Distributed systems enable:
- Scalability. handle growth by adding more nodes
- Fault tolerance: if one machine fails, others continue to function
- Performance: tasks can be parallelized, significantly reducing computation time
When?
Distributed systems are used when:
- Data volume is too large for a single machine
- applications require high availability
- real-time or large-scale analytics is necessary
How?
they function by splitting data into smaller parts, replicating them across nodes, using distributing computing frameworks like hadoop or spark to process them. Data locality is often prioritized – computation is moved to where the data resides, not the other way around.
Centralized vs Distributed Systems
Centralized System
- Example: Hosting a Bitcoin trading service on a single home computer.
- Scaling method: Vertical Scaling (add CPU, RAM, etc.).
- Limitations:
- Physical limits to hardware upgrades.
- Single point of failure (power outage, crashes).
- High latency for global users.
Distributed System
- Scales horizontally (add more machines).
- Can support billions of users.
- Improves reliability (no single point of failure).
- Reduces latency by distributing workload across regions.
Key Characteristics
- Communication happens only through the network.
- Processes must maintain a shared state.
- If processes don’t know about each other → it’s just a collection of computers, not a distributed system.
- Goal: Design algorithms so distributed processes can collaborate effectively.
Hadoop Ecosystem
Hadoop ecosystem is an open source project developed at Yahoo! to address the challenges of storing big data, managing said data and also processing it on commodity hardware. At its core, Hadoop addresses three fundamental issues of big data:
- Volume: how big data is
- Velocity: how fast data is processed at scale
- Variety: handling different types of data (structured, semi structured and unstructured)
Four Pillars of Hadoop
HDFS (Hadoop Distributed File System)
- Distributed, scalable file system that stores large files by breaking them into blocks
- Optimized for throughput rather than low latency
- Provides replication for fault tolerance
MapReduce
Programming paradigm for processing large datasets in parallel.
- Map: Break the input into key-value pairs and process them independently
- Reduce: Aggregate the results of the map step
YARN (Yet Another Resource Negotiator)
- Cluster resource manager introduced in Hadoop 2.0
- Separates resources management from data processing
- Allows multiple processing frameworks to run on hadoop simultaneously
Higher-Level Ecosystem Tools
Hadoop grew into an entire ecosystem of tools. Some important ones include:
- Hive: SQL-like query engine for big data (translates queries into MapReduce or Spark)
- Pig: scripting language for data transformations
- HBase: column-oriented NoSQL database for real-time access
- Oozie: Workflow scheduler for Hadoop jobs
- Zookeeper: coordination service for distributed systems
How the Ecosystem Works Together
- Raw data ingestion -> stored in HDFS
- Batch processing -> run a MapReduce or Spark job to compute features
- Querying results -> analysts use Hive to run SQL-like queries for insights
- Serving features -> processed data stored in HBase to be accessed in near real.time by the recommendation engine
- Orchestration -> pipelines scheduled with Oozie and coordinated with ZooKeeper
Strengths of Hadoop ecosystem
- Scalability: can scale to thousands of nodes
- Flexibility: handles all data types
- Cost efficiency: runs on commodity hardware
- Ecosystem integration: tools cover the full data lifecycle
HDFS: Throughput vs. Low Latency Systems
HDFS Characteristics
- Optimized for throughput → handles huge files and large-scale batch jobs efficiently.
- Not designed for low latency → small, random reads/writes are slow.
- Ideal for Big Data analytics where processing large datasets matters more than instant response.
Comparison: HDFS vs. Low Latency Systems
| Feature | HDFS (Hadoop Distributed File System) | Low-Latency Systems (e.g., HBase, Cassandra) |
|---|---|---|
| Primary Goal | High throughput for big data storage & processing | Low latency for fast query/transaction response |
| Data Access Pattern | Large, sequential reads/writes (streaming access) | Random, small reads/writes (real-time access) |
| Block Size | Large blocks (128MB, 256MB, etc.) | Small record-level access |
| Best For | Batch processing, analytics, ETL jobs | Real-time applications, OLTP workloads |
| Write Behavior | Replicates blocks to multiple nodes → higher latency but reliable | Writes are quick, designed for fast availability |
| Typical Use Case | Storing logs, images, videos, large datasets; powering Spark/MapReduce | Serving user profiles, session data, messaging apps |
| Trade-off | Reliability & throughput over response time | Response time over bulk throughput |
Summary
- HDFS = Best when you need to store and process petabytes of data with reliability.
- Low-latency systems = Best when you need to serve queries or updates in milliseconds.
Deep dive into HDFS
HDFS (hadoop distributed file system) is the backbone of the hadoop ecosystem. It is designed to store extremely large files across a cluster of inexpensive machines.
Core principles of HDFS
- Write once, Read Many Times (WORM)
- Data is generally written once and read many times.
- Optimized for analytics workloads rather than frequent updates
- Large Block Size
- Instead of 4KB blocks, HDFS uses 128MB or 256MB blocks
- This reduces metadata overhead and allows efficient sequential reads
- Data Locality
- HDFS tries to run computations on the nodes where data resides
- this avoids network bottlenecks by moving compute to data not data to compute
File Write Workflow in HDFS
let's say a file is uploaded:
- client contacts NameNode -> NameNode breaks the file into blocks
- NameNode decides which DataNodes will store each block
- Client writes block replicas directly to assigned DataNodes.
- Replication factor = 3 (by default)
- Block A stores in Node1, Node 2 and Node3
- Block B stores in Node3, Node4 and Node5
- NameNode updates metadata Result: file is distributed + fault tolerant
File Read Workflow in HDFS
let's say a file is being read:
- Client contacts NameNode to request block locations
- NameNode returns a list of DataNodes for each block
- Client fetches blocks directly from DataNodes
- Blocks are combined into the full file by the client Efficiency Trick: if a block exists on the same machine as the client, HDFS chooses that one first -> reduces network transfer
HDFS Architecture Components
Master-slave architecture (in big 2025 smh, not woke imo)
NameNode (master node)
- Stores metadata
- does not store actual data
- critical single point of coordination -> if it fails, the whole cluster halts
- modern setups use high availability with active + standby NameNodes
DataNodes (slave nodes)
- stores the actual blocks of data
- handle read/write requests from clients
- periodically sends heartbeats to the NameNode to confirm availability
- if a DataNode fails, the system automatically re-replicates its blocks
Secondary NameNode
- misleading name, it does not replace the NameNode on failure
- its job is to periodically merge the NameNode's editlog with metadata snapshots (fsimage)
- prevents the editlogs from growing too large
🧩 1. Data Partitioning Objective:
Organize data into partitions to improve query performance and manageability. Strategies:
- Partition by Time: For time-series data, partitioning by date (e.g.,
year,month,day) allows for efficient querying over specific time ranges. - Partition by Business Logic: Partitioning by fields like
region,product_id, orcustomer_idcan optimize queries that filter on these attributes. Tools: - Delta Lake: Utilize Delta Lake's
OPTIMIZEcommand to compact small files and improve read performance. ([delta.io][1]) * Parquet: Implement partitioning manually by organizing data into directories corresponding to partition keys.
🗂️ 2. Data Clustering Objective:
Enhance query performance by physically co-locating related data. Strategies:
- Z-Ordering: In Delta Lake, use Z-Ordering to cluster data based on frequently queried columns, reducing the amount of data scanned during queries. ([Denny Lee][2])
- Clustering Columns: Choose clustering columns that align with your most common query patterns to optimize data layout.
🧪 3. Data Compression Objective:
Reduce storage costs and improve I/O performance. Strategies:
- Columnar Compression: Leverage Parquet's columnar storage format, which allows for efficient compression and encoding.
- Compression Codecs: Use efficient compression codecs like Snappy or Zstd to balance compression ratio and decompression speed.
🧹 4. Data Cleanup and Vacuuming Objective:
Maintain optimal storage by removing obsolete data. Strategies:
- Delta Lake VACUUM: Regularly run the
VACUUMcommand in Delta Lake to remove files that are no longer referenced, freeing up storage space. ([Xebia][3]) - Data Retention Policies: Implement data retention policies to automatically delete outdated data, ensuring compliance and reducing storage costs.
🧠 5. Data Lifecycle Management Objective:
Optimize storage costs by managing data based on its access patterns. Strategies:
- Tiered Storage: Move infrequently accessed data to lower-cost storage tiers, such as Amazon S3 Glacier, while keeping frequently accessed data on faster storage. ([dataclassification.fortra.com][4])
- Archiving: Archive older data that is rarely accessed but still valuable for historical analysis.
By implementing these strategies, you can optimize the storage of your structured silver dataset, ensuring efficient data management and improved query performance. If you need further assistance with any of these techniques or have specific questions, feel free to ask! [1]: https://delta.io/blog/delta-lake-vs-parquet-comparison/?utm_source=chatgpt.com "Delta Lake vs. Parquet Comparison" [2]: https://dennyglee.com/2024/01/29/optimize-by-clustering-not-partitioning-data-with-delta-lake/?utm_source=chatgpt.com "Optimize by Clustering not Partitioning Data with Delta Lake" [3]: https://xebia.com/blog/databricks-lakehouse-optimization-a-deep-dive-into-vacuum/?utm_source=chatgpt.com "Databricks Lakehouse Optimization: A Deep Dive Into ..." [4]: https://dataclassification.fortra.com/blog/how-optimize-your-data-techniques-better-storage-and-analysis?utm_source=chatgpt.com "How to Optimize Your Data: Techniques for Better Storage ..."
