As the cost of cloud storage and compute continues to fall and internet speeds rise, data engineers are rethinking how data is stored, processed, and analyzed. This shift has led to what we now call the modern data stack: a flexible, scalable, cloud-native approach that replaces traditional monolithic systems.

In this post, we’ll explore key architectural changes that enabled the modern data stack, from the transition from SMP to MPP, to the decoupling of storage and compute, and the rise of columnar databases, ELT workflows, and the tooling ecosystem that glues it all together.


1. From SMP to MPP

1.1 What Is SMP?

Symmetric Multiprocessing (SMP) is a legacy database architecture where multiple CPU cores share a single memory space, storage, and operating system.

  • Shared resources: Memory, I/O devices, and disk are accessed through a common bus.
  • Single-machine design: All computation happens on one machine.
  • Scaling: Limited to vertical scaling—adding more resources to a single system.
  • Performance: Great for OLTP-style workloads but restricted by hardware limits.

1.2 Enter MPP

Massively Parallel Processing (MPP) systems scale horizontally by distributing data and computation across many machines.

  • Independent nodes: A master node coordinates multiple compute nodes.
  • Parallelism: Data is sharded across nodes; each node works in parallel on its portion.
  • Scaling: Add more nodes to scale out.
  • Storage models:
    • Shared-nothing (most common): Each node has local storage.
    • Shared-storage: All nodes access a common (often cloud-based) storage layer.

Note: In cloud-native architectures, “shared storage” often refers to distributed object storage like S3, accessed independently by stateless compute nodes.

1.3 SMP vs. MPP Comparison

Feature SMP MPP
Architecture Single machine, shared memory Master + multiple independent nodes
Scaling Vertical only Horizontal (scale-out)
Storage Shared across cores Local per node or shared via cloud storage
Parallelism Within one machine Across multiple machines
Use Case Today Legacy on-prem systems Modern cloud warehouses

MPP Architecture Diagram

Massively Parallel Processing (MPP) Architecture

2. Decoupling Storage and Compute

Traditionally, storage and compute were tightly linked. Cloud platforms changed this with decoupled architecture:

  • Storage: Durable, cheap, and separate (e.g., Amazon S3, GCS, Azure Blob).
  • Compute: Stateless and elastic (e.g., BigQuery slots, Snowflake virtual warehouses).

Benefits

  • Scale compute and storage independently.
  • Pay only for compute when in use.
  • Persistent storage allows compute clusters to spin down when idle.

3. Column-Oriented Databases

3.1 Row vs. Column Storage

Row-oriented databases are optimized for OLTP (e.g., inserting or updating entire records), while column-oriented databases are better for OLAP (analytical queries on large datasets).

Columnar storage allows:

  • Selective column reads (reducing I/O)
  • Efficient compression (similar data grouped together)
  • Vectorized computation

3.2 Comparison

Feature Row-Oriented Column-Oriented
Layout Records stored together Fields stored together
Use Case OLTP (transactions) OLAP (analytics, aggregations)
I/O Efficiency Reads full rows Reads only needed columns
Compression Moderate High
Examples PostgreSQL, MySQL Snowflake, Redshift, BigQuery

4. From ETL to ELT

The traditional ETL pattern:

  1. Extract
  2. Transform
  3. Load

has largely been replaced by ELT:

  1. Extract
  2. Load raw data into the warehouse
  3. Transform it inside the warehouse

This modern pattern simplifies pipelines and leverages cloud compute.

Tradeoff: While ELT increases flexibility and auditability, it also shifts responsibility for data quality downstream—often to analytics engineers.


5. Tools in the Modern Data Stack

Each layer is served by specialized tools that integrate via APIs, JDBC/ODBC, or event systems.

Layer Tools Role
Extract & Load Fivetran, Airbyte, Stitch Ingest data into warehouses from APIs, DBs, and flat files
Transform DBT, Matillion, dataiku Define SQL-based models, documentation, tests
Data Warehouse Snowflake, BigQuery, Redshift Scalable compute + storage for structured data
BI & Analytics Looker, Tableau, Power BI, Metabase Dashboards, visualization, exploration
Reverse-ETL Census, Hightouch Sync clean data from warehouse to CRMs, ads, tools

JDBC and ODBC are protocol standards used to connect external applications to databases.

  • JDBC: Java-specific
  • ODBC: Language-agnostic (used by Excel, Python, R, etc.)

Modern data warehouses expose these interfaces so apps can access/query data securely.


7. Metadata & Orchestration

7.1 Orchestration Tools

Managing dependencies, schedules, and retries across the data pipeline is critical. Popular orchestration tools:

  • Apache Airflow: DAG-based Python workflows
  • Dagster: Strong typing + observability
  • Prefect: Pythonic, cloud-native orchestration with minimal config

These tools manage the “when and how” of pipeline execution.

7.2 Metadata & Governance

Understanding what data exists, where it comes from, and who uses it is essential for modern data teams.

Popular metadata tools:

Tool Role
DataHub Metadata discovery & lineage
Amundsen Data catalog with search and context
Atlan Collaborative metadata, governance, access
These tools help with lineage, discoverability, data ownership, and compliance.


The modern data stack is:

  • Modular: Specialized tools connected via APIs
  • Scalable: Independent scaling of compute/storage
  • Efficient: Columnar, distributed, and cloud-optimized

Key enablers include:

  • MPP architectures
  • Decoupled storage/compute
  • Columnar formats
  • ELT processing
  • Orchestration and metadata tools