Logo
Brain/Distributed_Systems_Architecture

Distributed_Systems_Architecture

#distributed-systems#architecture#storage

Introduction

Distributed data systems are infrastructures where data is stored and processed across multiple nodes working together. These nodes or machines share the workload and when the need for scaling up arises, more machines can be added instead of relying on a upgrade the machines (scaling horizontally).

Why?

The rise of social networks, e-commerce platforms, and IoT devices has created a flood of data. A single machine cannot store or process petabytes of information efficiently , Distributed systems enable:

  • Scalability. handle growth by adding more nodes
  • Fault tolerance: if one machine fails, others continue to function
  • Performance: tasks can be parallelized, significantly reducing computation time

When?

Distributed systems are used when:

  • Data volume is too large for a single machine
  • applications require high availability
  • real-time or large-scale analytics is necessary

How?

they function by splitting data into smaller parts, replicating them across nodes, using distributing computing frameworks like hadoop or spark to process them. Data locality is often prioritized – computation is moved to where the data resides, not the other way around.

Centralized vs Distributed Systems

Centralized System

  • Example: Hosting a Bitcoin trading service on a single home computer.
  • Scaling method: Vertical Scaling (add CPU, RAM, etc.).
  • Limitations:
    • Physical limits to hardware upgrades.
    • Single point of failure (power outage, crashes).
    • High latency for global users.

Distributed System

  • Scales horizontally (add more machines).
  • Can support billions of users.
  • Improves reliability (no single point of failure).
  • Reduces latency by distributing workload across regions.

Key Characteristics

  • Communication happens only through the network.
  • Processes must maintain a shared state.
  • If processes don’t know about each other → it’s just a collection of computers, not a distributed system.
  • Goal: Design algorithms so distributed processes can collaborate effectively.

Hadoop Ecosystem

Hadoop ecosystem is an open source project developed at Yahoo! to address the challenges of storing big data, managing said data and also processing it on commodity hardware. At its core, Hadoop addresses three fundamental issues of big data:

  • Volume: how big data is
  • Velocity: how fast data is processed at scale
  • Variety: handling different types of data (structured, semi structured and unstructured)

Four Pillars of Hadoop

HDFS (Hadoop Distributed File System)

  • Distributed, scalable file system that stores large files by breaking them into blocks
  • Optimized for throughput rather than low latency
  • Provides replication for fault tolerance

MapReduce

Programming paradigm for processing large datasets in parallel.

  • Map: Break the input into key-value pairs and process them independently
  • Reduce: Aggregate the results of the map step

YARN (Yet Another Resource Negotiator)

  • Cluster resource manager introduced in Hadoop 2.0
  • Separates resources management from data processing
  • Allows multiple processing frameworks to run on hadoop simultaneously

Higher-Level Ecosystem Tools

Hadoop grew into an entire ecosystem of tools. Some important ones include:

  • Hive: SQL-like query engine for big data (translates queries into MapReduce or Spark)
  • Pig: scripting language for data transformations
  • HBase: column-oriented NoSQL database for real-time access
  • Oozie: Workflow scheduler for Hadoop jobs
  • Zookeeper: coordination service for distributed systems

How the Ecosystem Works Together

  1. Raw data ingestion -> stored in HDFS
  2. Batch processing -> run a MapReduce or Spark job to compute features
  3. Querying results -> analysts use Hive to run SQL-like queries for insights
  4. Serving features -> processed data stored in HBase to be accessed in near real.time by the recommendation engine
  5. Orchestration -> pipelines scheduled with Oozie and coordinated with ZooKeeper

Strengths of Hadoop ecosystem

  • Scalability: can scale to thousands of nodes
  • Flexibility: handles all data types
  • Cost efficiency: runs on commodity hardware
  • Ecosystem integration: tools cover the full data lifecycle

HDFS: Throughput vs. Low Latency Systems

HDFS Characteristics

  • Optimized for throughput → handles huge files and large-scale batch jobs efficiently.
  • Not designed for low latency → small, random reads/writes are slow.
  • Ideal for Big Data analytics where processing large datasets matters more than instant response.

Comparison: HDFS vs. Low Latency Systems

FeatureHDFS (Hadoop Distributed File System)Low-Latency Systems (e.g., HBase, Cassandra)
Primary GoalHigh throughput for big data storage & processingLow latency for fast query/transaction response
Data Access PatternLarge, sequential reads/writes (streaming access)Random, small reads/writes (real-time access)
Block SizeLarge blocks (128MB, 256MB, etc.)Small record-level access
Best ForBatch processing, analytics, ETL jobsReal-time applications, OLTP workloads
Write BehaviorReplicates blocks to multiple nodes → higher latency but reliableWrites are quick, designed for fast availability
Typical Use CaseStoring logs, images, videos, large datasets; powering Spark/MapReduceServing user profiles, session data, messaging apps
Trade-offReliability & throughput over response timeResponse time over bulk throughput

Summary

  • HDFS = Best when you need to store and process petabytes of data with reliability.
  • Low-latency systems = Best when you need to serve queries or updates in milliseconds.

Deep dive into HDFS

HDFS (hadoop distributed file system) is the backbone of the hadoop ecosystem. It is designed to store extremely large files across a cluster of inexpensive machines.

Core principles of HDFS

  1. Write once, Read Many Times (WORM)
    • Data is generally written once and read many times.
    • Optimized for analytics workloads rather than frequent updates
  2. Large Block Size
    • Instead of 4KB blocks, HDFS uses 128MB or 256MB blocks
    • This reduces metadata overhead and allows efficient sequential reads
  3. Data Locality
    • HDFS tries to run computations on the nodes where data resides
    • this avoids network bottlenecks by moving compute to data not data to compute

File Write Workflow in HDFS

let's say a file is uploaded:

  1. client contacts NameNode -> NameNode breaks the file into blocks
  2. NameNode decides which DataNodes will store each block
  3. Client writes block replicas directly to assigned DataNodes.
    • Replication factor = 3 (by default)
    • Block A stores in Node1, Node 2 and Node3
    • Block B stores in Node3, Node4 and Node5
  4. NameNode updates metadata Result: file is distributed + fault tolerant

File Read Workflow in HDFS

let's say a file is being read:

  1. Client contacts NameNode to request block locations
  2. NameNode returns a list of DataNodes for each block
  3. Client fetches blocks directly from DataNodes
  4. Blocks are combined into the full file by the client Efficiency Trick: if a block exists on the same machine as the client, HDFS chooses that one first -> reduces network transfer

HDFS Architecture Components

Master-slave architecture (in big 2025 smh, not woke imo)

NameNode (master node)

  • Stores metadata
  • does not store actual data
  • critical single point of coordination -> if it fails, the whole cluster halts
  • modern setups use high availability with active + standby NameNodes

DataNodes (slave nodes)

  • stores the actual blocks of data
  • handle read/write requests from clients
  • periodically sends heartbeats to the NameNode to confirm availability
  • if a DataNode fails, the system automatically re-replicates its blocks

Secondary NameNode

  • misleading name, it does not replace the NameNode on failure
  • its job is to periodically merge the NameNode's editlog with metadata snapshots (fsimage)
  • prevents the editlogs from growing too large

🧩 1. Data Partitioning Objective:

Organize data into partitions to improve query performance and manageability. Strategies:

  • Partition by Time: For time-series data, partitioning by date (e.g., year, month, day) allows for efficient querying over specific time ranges.
  • Partition by Business Logic: Partitioning by fields like region, product_id, or customer_id can optimize queries that filter on these attributes. Tools:
  • Delta Lake: Utilize Delta Lake's OPTIMIZE command to compact small files and improve read performance. ([delta.io][1]) * Parquet: Implement partitioning manually by organizing data into directories corresponding to partition keys.

🗂️ 2. Data Clustering Objective:

Enhance query performance by physically co-locating related data. Strategies:

  • Z-Ordering: In Delta Lake, use Z-Ordering to cluster data based on frequently queried columns, reducing the amount of data scanned during queries. ([Denny Lee][2])
  • Clustering Columns: Choose clustering columns that align with your most common query patterns to optimize data layout.

🧪 3. Data Compression Objective:

Reduce storage costs and improve I/O performance. Strategies:

  • Columnar Compression: Leverage Parquet's columnar storage format, which allows for efficient compression and encoding.
  • Compression Codecs: Use efficient compression codecs like Snappy or Zstd to balance compression ratio and decompression speed.

🧹 4. Data Cleanup and Vacuuming Objective:

Maintain optimal storage by removing obsolete data. Strategies:

  • Delta Lake VACUUM: Regularly run the VACUUM command in Delta Lake to remove files that are no longer referenced, freeing up storage space. ([Xebia][3])
  • Data Retention Policies: Implement data retention policies to automatically delete outdated data, ensuring compliance and reducing storage costs.

🧠 5. Data Lifecycle Management Objective:

Optimize storage costs by managing data based on its access patterns. Strategies:

  • Tiered Storage: Move infrequently accessed data to lower-cost storage tiers, such as Amazon S3 Glacier, while keeping frequently accessed data on faster storage. ([dataclassification.fortra.com][4])
  • Archiving: Archive older data that is rarely accessed but still valuable for historical analysis.

By implementing these strategies, you can optimize the storage of your structured silver dataset, ensuring efficient data management and improved query performance. If you need further assistance with any of these techniques or have specific questions, feel free to ask! [1]: https://delta.io/blog/delta-lake-vs-parquet-comparison/?utm_source=chatgpt.com "Delta Lake vs. Parquet Comparison" [2]: https://dennyglee.com/2024/01/29/optimize-by-clustering-not-partitioning-data-with-delta-lake/?utm_source=chatgpt.com "Optimize by Clustering not Partitioning Data with Delta Lake" [3]: https://xebia.com/blog/databricks-lakehouse-optimization-a-deep-dive-into-vacuum/?utm_source=chatgpt.com "Databricks Lakehouse Optimization: A Deep Dive Into ..." [4]: https://dataclassification.fortra.com/blog/how-optimize-your-data-techniques-better-storage-and-analysis?utm_source=chatgpt.com "How to Optimize Your Data: Techniques for Better Storage ..."

Linked to this note