Big Data 5. Modern Data Stack

As the cost of cloud storage and compute continues to fall and internet speeds rise, data engineers are rethinking how data is stored, processed, and analyzed. This shift has led to what we now call the modern data stack: a flexible, scalable, cloud-native approach that replaces traditional monolithic systems.

In this post, we’ll explore key architectural changes that enabled the modern data stack, from the transition from SMP to MPP, to the decoupling of storage and compute, and the rise of columnar databases, ELT workflows, and the tooling ecosystem that glues it all together.

1. From SMP to MPP

1.1 What Is SMP?

Symmetric Multiprocessing (SMP) is a legacy database architecture where multiple CPU cores share a single memory space, storage, and operating system.

Shared resources: Memory, I/O devices, and disk are accessed through a common bus.
Single-machine design: All computation happens on one machine.
Scaling: Limited to vertical scaling—adding more resources to a single system.
Performance: Great for OLTP-style workloads but restricted by hardware limits.

1.2 Enter MPP

Massively Parallel Processing (MPP) systems scale horizontally by distributing data and computation across many machines.

Independent nodes: A master node coordinates multiple compute nodes.
Parallelism: Data is sharded across nodes; each node works in parallel on its portion.
Scaling: Add more nodes to scale out.
Storage models:
- Shared-nothing (most common): Each node has local storage.
- Shared-storage: All nodes access a common (often cloud-based) storage layer.

Note: In cloud-native architectures, “shared storage” often refers to distributed object storage like S3, accessed independently by stateless compute nodes.

1.3 SMP vs. MPP Comparison

Feature	SMP	MPP
Architecture	Single machine, shared memory	Master + multiple independent nodes
Scaling	Vertical only	Horizontal (scale-out)
Storage	Shared across cores	Local per node or shared via cloud storage
Parallelism	Within one machine	Across multiple machines
Use Case Today	Legacy on-prem systems	Modern cloud warehouses

MPP Architecture Diagram

Massively Parallel Processing (MPP) Architecture

2. Decoupling Storage and Compute

Traditionally, storage and compute were tightly linked. Cloud platforms changed this with decoupled architecture:

Storage: Durable, cheap, and separate (e.g., Amazon S3, GCS, Azure Blob).
Compute: Stateless and elastic (e.g., BigQuery slots, Snowflake virtual warehouses).

Benefits

Scale compute and storage independently.
Pay only for compute when in use.
Persistent storage allows compute clusters to spin down when idle.

3. Column-Oriented Databases

3.1 Row vs. Column Storage

Row-oriented databases are optimized for OLTP (e.g., inserting or updating entire records), while column-oriented databases are better for OLAP (analytical queries on large datasets).

Columnar storage allows:

Selective column reads (reducing I/O)
Efficient compression (similar data grouped together)
Vectorized computation

3.2 Comparison

Feature	Row-Oriented	Column-Oriented
Layout	Records stored together	Fields stored together
Use Case	OLTP (transactions)	OLAP (analytics, aggregations)
I/O Efficiency	Reads full rows	Reads only needed columns
Compression	Moderate	High
Examples	PostgreSQL, MySQL	Snowflake, Redshift, BigQuery

4. From ETL to ELT

The traditional ETL pattern:

Extract
Transform
Load

has largely been replaced by ELT:

Extract
Load raw data into the warehouse
Transform it inside the warehouse

This modern pattern simplifies pipelines and leverages cloud compute.

Tradeoff: While ELT increases flexibility and auditability, it also shifts responsibility for data quality downstream—often to analytics engineers.

5. Tools in the Modern Data Stack

Each layer is served by specialized tools that integrate via APIs, JDBC/ODBC, or event systems.

Layer	Tools	Role
Extract & Load	Fivetran, Airbyte, Stitch	Ingest data into warehouses from APIs, DBs, and flat files
Transform	DBT, Matillion, dataiku	Define SQL-based models, documentation, tests
Data Warehouse	Snowflake, BigQuery, Redshift	Scalable compute + storage for structured data
BI & Analytics	Looker, Tableau, Power BI, Metabase	Dashboards, visualization, exploration
Reverse-ETL	Census, Hightouch	Sync clean data from warehouse to CRMs, ads, tools

JDBC and ODBC are protocol standards used to connect external applications to databases.

JDBC: Java-specific

ODBC: Language-agnostic (used by Excel, Python, R, etc.)

Modern data warehouses expose these interfaces so apps can access/query data securely.

7. Metadata & Orchestration

7.1 Orchestration Tools

Managing dependencies, schedules, and retries across the data pipeline is critical. Popular orchestration tools:

Apache Airflow: DAG-based Python workflows
Dagster: Strong typing + observability
Prefect: Pythonic, cloud-native orchestration with minimal config

These tools manage the “when and how” of pipeline execution.

7.2 Metadata & Governance

Understanding what data exists, where it comes from, and who uses it is essential for modern data teams.

Popular metadata tools:

Tool	Role
DataHub	Metadata discovery & lineage
Amundsen	Data catalog with search and context
Atlan	Collaborative metadata, governance, access

These tools help with lineage, discoverability, data ownership, and compliance.

The modern data stack is:

Modular: Specialized tools connected via APIs
Scalable: Independent scaling of compute/storage
Efficient: Columnar, distributed, and cloud-optimized

Key enablers include:

MPP architectures
Decoupled storage/compute
Columnar formats
ELT processing
Orchestration and metadata tools

1. From SMP to MPP#

1.1 What Is SMP?#

1.2 Enter MPP#

1.3 SMP vs. MPP Comparison#

MPP Architecture Diagram#

2. Decoupling Storage and Compute#

Benefits#

3. Column-Oriented Databases#

3.1 Row vs. Column Storage#

3.2 Comparison#

4. From ETL to ELT#

5. Tools in the Modern Data Stack#

7. Metadata & Orchestration#

7.1 Orchestration Tools#

7.2 Metadata & Governance#