Skip to main content
Back to Blog
data-engineeringdistributed-systemssoftware-architecturesystem-design-conceptsdata-contract

Why Data Platforms Need SOLID Principles

KYFEX··10 min read

Why Data Platforms Need SOLID Principles

The engineering patterns that distinguish scalable data organizations from technical debt graveyards

Modern data platforms face a unique architectural challenge. Unlike traditional software systems where bad code might crash a service or slow down a feature, poorly architected data systems create cascading failures that corrupt every downstream decision, model, and insight. The complexity compounds when these platforms process billions of events daily, power thousands of machine learning models, and support critical business decisions worth millions of dollars.

Data contracts and SLOs are the spine.

At scale, each dataset (table, topic, feature set) is a product with a contract (typed schema, semantic definition, and compatibility policy) and SLOs (freshness, completeness, and correctness with explicit owners) and runbooks. CI/CD blocks incompatible changes. Runtime enforces SLOs with expectations for batch and lateness windows for stream. When contracts break or SLOs slip, the blast radius includes models, KPIs, and decisions. SOLID is how we contain that blast radius.

The fundamental issue plaguing data organizations isn’t the lack of powerful tools or talented engineers. It’s that data teams have inherited software engineering’s scale problems without adopting its architectural solutions. While backend engineering teams have spent decades refining principles like SOLID, data teams often treat these as foreign concepts irrelevant to pipeline development. This disconnect manifests as unmaintainable codebases, fragile pipelines, and exhausted engineers debugging production issues at all hours.

The Architectural Debt Crisis in Data Systems

Consider a typical scenario in a mature data organization. An executive dashboard displays incorrect metrics. The investigation reveals a web of dependencies spanning seven transformation layers, three team boundaries, and a monolithic SQL script exceeding 10,000 lines that evolved organically for two years. The fix requires understanding not just the immediate bug but the archaeological history of business logic embedded in the code. Four hours later, the issue is quelled but everyone knows it’s temporary.

This isn’t just technical debt; it’s architectural failure. The solution isn’t migrating from one orchestrator to another or adopting the latest compute engine. The solution lies in applying proven architectural principles adapted to data systems so correctness and operability are enforced by design, not heroics.

SOLID Principles Reimagined for Data Architecture

The SOLID principles provide a framework for maintainable, evolvable systems. In data platforms, think in terms of tables, DAG nodes, contracts, and SLOs, not just classes and methods.

SOLID → Data‑Platform Translation

This translation matches how large-scale organizations designed their stacks: interactive SQL at web scale (e.g., Dremel/BigQuery), unified batch/stream semantics and watermarks (e.g., Dataflow/Beam), and open table formats to decouple compute from storage (e.g., Apache Iceberg). (VLDB (open PDF), Google Research, incubator.apache.org)

Single Responsibility Principle: Decomposing Data Monoliths

The Single Responsibility Principle (SRP) says a module should have one reason to change. In data, a node/table should own one business concept and one contract.

Anti‑pattern: a monolithic job doing everything
# Anti-pattern: Monolithic pipeline with multiple responsibilities
def process_user_data():
# Extract from multiple sources
kafka_data = extract_from_kafka()
database_data = extract_from_database()
api_data = extract_from_api()

# Complex joining and transformation logic
merged_data = complex_merge(kafka_data, database_data, api_data)

# Business logic for multiple metrics
engagement_metrics = calculate_engagement(merged_data)
revenue_metrics = calculate_revenue(merged_data)
retention_metrics = calculate_retention(merged_data)

# Quality checks for all metrics
validate_engagement(engagement_metrics)
validate_revenue(revenue_metrics)
validate_retention(retention_metrics)

# Write to multiple destinations
write_to_warehouse(engagement_metrics, revenue_metrics, retention_metrics)
write_to_cache(engagement_metrics)
write_to_api(revenue_metrics)
Better: small, testable components with one output and one contract
class UserEventExtractor:
"""Responsible solely for extracting and normalizing user events"""
def extract(self, source: DataSource) -> DataFrame:
return source.read().normalize(self.schema)

class EngagementCalculator:
"""Owns the business logic for engagement metrics"""
def calculate(self, events: DataFrame) -> EngagementMetrics:
return self._apply_engagement_rules(events)

class MetricValidator:
"""Responsible for data quality validation"""
def validate(self, metrics: Metrics) -> ValidationResult:
return self._run_quality_checks(metrics)

class WarehouseWriter:
"""Manages writes to the data warehouse"""
def write(self, data: DataFrame, table: str) -> WriteResult:
return self.warehouse_client.write(data, table)

DAG/table‑level SRP in an engine‑portable style

// Beam-style SRP: each PTransform owns one contract and one output.
PCollection<UserEventV2> events =
p.apply("ReadEvents", KafkaIO.<byte[], byte[]>read(...))
.apply("ParseAvroV2", ParseAvro.of(UserEventV2.SCHEMA)); // Contract: event schema, owner, SLO

PCollection<EngagementFactsV1> facts =
events.apply("ToEngagementFactsV1", new ToEngagementFacts()); // Contract: columns + invariants

PCollection<ValidationResult> dq =
facts.apply("ValidateFacts", new ValidateFacts()
.requireNonNull("user_id")
.boundRate("null_title", 0.0, 0.01)
.expectFreshness(Duration.standardMinutes(30)));

dq.apply("GatePromotion", AssertNoSevere()); // Fail fast; do not publish bad data

facts.apply("WriteIceberg", IcebergIO.write("metrics.engagement_v1"));

This shows SRP at the node boundary with contracts and promotion gates. The same logical pipeline can run on different engines (Dataflow/Flink/Spark) without changing business semantics. (Google Research)

Open/Closed Principle: Building Extensible Data Platforms

Open for extension, closed for modification. New sources, metrics, or sinks should integrate via registration and tests, not edits to core pipeline code.

Anti‑pattern: conditionals that grow with every new source
class DataPipeline:
def process(self, source_type: str, config: dict):
if source_type == "postgresql":
connection = psycopg2.connect(**config)
data = pd.read_sql(config["query"], connection)
elif source_type == "mongodb":
client = MongoClient(**config)
data = pd.DataFrame(client[config["db"]][config["collection"]].find())
elif source_type == "s3":
data = pd.read_parquet(config["path"])
elif source_type == "api":
response = requests.get(config["url"], headers=config["headers"])
data = pd.DataFrame(response.json())
return self.transform(data)
Better: registry + promotion gates
# Registry + promotion gates (tests) rather than conditionals.
class ConnectorRegistry:
_registry: dict[str, type["DataConnector"]] = {}
@classmethod
def register(cls, name: str, impl: type["DataConnector"]):
cls._registry[name] = impl
@classmethod
def create(cls, name: str, cfg: dict) -> "DataConnector":
try:
conn = cls._registry[name](cfg)
except KeyError:
raise ValueError(f"Unknown connector '{name}'")
_run_contract_tests(conn) # schema sample, rowcount bounds, perf smoke tests
return conn

class PostgreSQLConnector:
def __init__(self, cfg: dict):
self._conn = psycopg2.connect(**cfg["connection"])
self._query = cfg["query"]
def read(self) -> pd.DataFrame:
return pd.read_sql(self._query, self._conn)
def close(self) -> None:
self._conn.close()

New integrations are extensible via registration, and must pass contract tests before joining prod DAGs.

Liskov Substitution Principle: Interchangeable Components at Scale

LSP ensures you can swap implementations without breaking contracts which is critical for model rollouts, engine swaps, and storage abstractions.

from abc import ABC, abstractmethod

class ModelInterface(ABC):
"""Base interface ensuring all models are substitutable"""
@abstractmethod
def preprocess(self, features: DataFrame) -> DataFrame: ...
@abstractmethod
def predict(self, features: DataFrame) -> DataFrame: ...
@abstractmethod
def get_model_metadata(self) -> ModelMetadata: ...

class XGBoostModel(ModelInterface):
def preprocess(self, features: DataFrame) -> DataFrame:
return features.fillna(0).clip(lower=0)
def predict(self, features: DataFrame) -> DataFrame:
processed = self.preprocess(features)
scores = self.model.predict(processed)
return DataFrame({"score": scores, "model_version": self.version})
def get_model_metadata(self) -> ModelMetadata: ...

class TransformerModel(ModelInterface):
def preprocess(self, features: DataFrame) -> DataFrame:
return self.tokenizer.encode(features) # ensure contract-compatible schema
def predict(self, features: DataFrame) -> DataFrame: ...
def get_model_metadata(self) -> ModelMetadata: ...

class ModelServing:
def __init__(self, model: ModelInterface):
self.model = model
self.metadata = model.get_model_metadata()
def serve_prediction(self, request: PredictionRequest) -> PredictionResponse:
features = self.extract_features(request)
predictions = self.model.predict(features)
return self.format_response(predictions)

Behavioral invariants make substitution safe: enforce a response envelope (score, model_version, optional explanations), latency SLOs, and any domain invariants (e.g., monotonicity). Validate with shadow traffic and diff tests before shifting live traffic:

def substitution_test(baseline: ModelInterface, candidate: ModelInterface, sample: DataFrame):
a = baseline.predict(sample)["score"]
b = candidate.predict(sample)["score"]
assert a.notna().all() and b.notna().all()
# Example invariant: candidate must not increase FP rate > 1%
lift = (b > 0.5).mean() - (a > 0.5).mean()
assert lift <= 0.01, f"FP lift too high: {lift:.3%}"

Interface Segregation: Focused Interfaces for Diverse Consumers

Don’t force all consumers through one bloated API. Build persona‑specific surfaces with explicit latency and cost SLOs.

class DataScientistInterface:
"""For model training"""
def get_training_data(self, feature_set: str, time_range: TimeRange) -> TrainingDataset:
features = self.feature_store.get_features(feature_set, time_range)
labels = self.label_store.get_labels(time_range)
return TrainingDataset(features, labels, self._get_statistics(features))

class ProductAnalystInterface:
"""For business analysis"""
def get_business_metrics(self, metric_names: list[str], dimensions: list[str], time_range: TimeRange):
return self.metric_store.query(metrics=metric_names, group_by=dimensions,
time_range=time_range, include_comparisons=True)

class ExecutiveInterface:
"""For executive KPIs"""
def get_kpis(self, date: Date | None = None) -> KPISnapshot:
d = date or Date.today()
return KPISnapshot(daily=self._daily_kpis(d), trends=self._trends(d), alerts=self._alerts(d))

Each surface can tune caching, sampling, and materializations to its SLOs without burdening others.

Dependency Inversion: Abstracting Infrastructure Complexity

High‑level logic should depend on abstractions (tables/features/metrics), not concrete engines or storage.

Anti‑pattern: business logic written directly against Spark API calls and tied to a specific warehouse table.
Better: invert both compute and table format dependencies.
from abc import ABC, abstractmethod

class ComputeEngine(ABC):
@abstractmethod
def read(self, source: "Table") -> "Relation": ...
@abstractmethod
def filter(self, rel: "Relation", condition: "Condition") -> "Relation": ...
@abstractmethod
def aggregate(self, rel: "Relation", grouping: list[str], aggs: dict[str, str]) -> "Relation": ...
@abstractmethod
def write(self, rel: "Relation", dest: "Table", mode: str = "append") -> None: ...

class Table(ABC):
@abstractmethod
def read(self, columns: list[str] | None = None) -> "Relation": ...
@abstractmethod
def write(self, rel: "Relation", mode: str = "append") -> None: ...
@abstractmethod
def schema(self) -> "Schema": ...

class IcebergTable(Table): ...
class DeltaTable(Table): ...
class BigQueryTable(Table): ...

class RevenueLogic:
def __init__(self, engine: ComputeEngine, sink: Table):
self.engine, self.sink = engine, sink
def compute_monthly(self, source: Table):
tx = source.read()
completed = self.engine.filter(tx, Condition("status", "==", "completed"))
monthly = self.engine.aggregate(completed, ["txn_month"], {"amount": "sum"})
self.sink.write(monthly, mode="merge") # business logic never touches engine/storage specifics

This decoupling enables engine swaps (Spark↔Flink↔Dataflow) and table‑format portability (Iceberg/Delta/Hudi) without touching business logic. At Netflix scale, Iceberg provides ACID, schema evolution, and multi‑engine reads/writes on object storage — exactly the kind of storage abstraction DIP demands. (Netflix Tech Blog, incubator.apache.org)

Design Patterns as Architectural Building Blocks

Beyond SOLID, certain patterns consistently pay off.

Factory Pattern for Dynamic Resource Allocation

Choose the right pipeline for the workload profile — without if‑else sprawl:

class PipelineFactory:
"""Dynamically creates appropriate pipeline based on data characteristics"""
@staticmethod
def create_pipeline(config: PipelineConfig) -> Pipeline:
data_profile = ProfileAnalyzer.analyze(config.source)
if data_profile.size > 1_000_000_000: # 1B records
return DistributedPipeline(engine=SparkEngine(executor_instances=100),
partitions=1000, checkpoint_enabled=True)
elif data_profile.requires_gpu:
return GPUPipeline(engine=RapidsEngine(gpu_count=8), batch_size=10000)
elif data_profile.is_streaming:
return StreamingPipeline(engine=FlinkEngine(parallelism=16), watermark_delay="10 seconds")
else:
return BatchPipeline(engine=DuckDBEngine(), memory_limit="8GB")

Observer Pattern for Comprehensive Monitoring

Observe pipelines without tangling monitoring logic into transforms:

class DataPipelineObservable:
def __init__(self):
self._observers: list[PipelineObserver] = []
self._state = PipelineState()
def attach(self, observer: PipelineObserver): self._observers.append(observer)
def notify(self, event: PipelineEvent):
for o in self._observers: o.update(event, self._state)
def process(self, data: DataFrame):
self.notify(PipelineEvent.STARTED)
try:
transformed = self.transform(data)
self._state.records_processed = len(transformed)
self.notify(PipelineEvent.TRANSFORMATION_COMPLETE)
validated = self.validate(transformed)
self._state.validation_results = validated
self.notify(PipelineEvent.VALIDATION_COMPLETE)
self.write(validated)
self.notify(PipelineEvent.COMPLETED)
except Exception as e:
self._state.error = e
self.notify(PipelineEvent.FAILED)
raise

class DataQualityObserver(PipelineObserver):
def update(self, event: PipelineEvent, state: PipelineState):
if event == PipelineEvent.VALIDATION_COMPLETE and state.validation_results.has_anomalies():
self.alert_team(state.validation_results.anomalies)

Operations at Scale: SLOs, Backfills, and Multi‑Tenancy

  • SLOs & error budgets: Track freshness (e.g., P95 lag), completeness (volume bounds), and correctness (reconciliation error rate). Burn error budgets before promoting new features or metrics.
  • Deterministic backfills: Resource‑isolated from prod; reconcile with row‑counts, checksums, and metric parity checks.
  • Streaming semantics: Treat watermarks as heuristics; configure allowed lateness and retractions for late/duplicate data. Unify batch/stream semantics to keep logic consistent across modes. (Google Research)
  • Multi‑tenancy & cost: Quotas, workload isolation, preemption, and per‑dataset cost attribution keep COGS visible and prevent noisy‑neighbor incidents.
  • Ownership & runbooks: Each contract lists owner, on‑call, escalation, and a rollback path (shadow writes, blue/green tables).

Grounded Examples from Large-Scale Systems

  • Google’s Dremel introduced columnar processing over nested data, enabling second‑level scans on trillions of rows commercialized as BigQuery. (ACM Digital Library)
  • Google’s Dataflow model (inspired by FlumeJava & MillWheel) codified event time, watermarks, and triggers for correctness with out‑of‑order data. (Google Research)
  • Presto (Meta) powers large‑scale interactive analytics across heterogeneous sources. (Meta Research (PDF))
  • Scuba (Meta) provides in‑memory, real‑time analysis at high ingest rates for operational use cases. (VLDB)
  • Apache Iceberg, originally developed at Netflix and now widely adopted, enables ACID tables on object stores with multi‑engine reads/writes; it graduated from the Apache Incubator in 2020. (incubator.apache.org, Netflix Tech Blog)

The Compound Effect of Architectural Principles

Applied together, these principles create systems with emergent properties:

  1. Evolutionary architecture -> Extend via registration and contracts rather than risky edits; decompose and replace components incrementally.

2. Team scalability -> Clear boundaries and SLOs support parallel work across many teams without constant coordination.

3. Operational excellence -> Faults isolate to a node/table; contracts gate bad data; rollbacks are surgical.

4. Innovation velocity -> Swap engines, storage formats, or models under stable interfaces; experiment safely behind contracts.

The Technical Debt Multiplier in Data Systems

Technical debt in data systems compounds faster than in services: bad data pollutes downstream analytics and models, destroys trust, and drives shadow pipelines. Architecture grounded in SOLID + contracts + SLOs maintains trust by being predictable, observable, and correctable. When issues occur, you can identify and fix them quickly; when requirements change, you adapt without destabilizing the platform; when scale increases, you grow without rewrites.

Wrapping Up

The difference between data platforms that scale and those that collapse isn’t the latest tool or the biggest team. It’s architectural discipline. Translating SOLID to contracts, SLOs, and stable interfaces gives you smaller blast radius, faster migrations, predictable on‑call, and sustained feature velocity.

What SOLID buys you in data platforms

  • Smaller blast radius -> faults isolate to a node/table; contracts gate bad data.
  • Faster migrations -> engines and table formats swap under stable interfaces.
  • Predictable on‑call -> clear owners, SLOs, and rollback paths.
  • Parallel velocity -> persona‑specific surfaces and registries let teams ship independently.

Data engineering is software engineering with specialized constraints. Treat datasets as products, encode their contracts, and apply SOLID at the DAG/table boundary. The result is infrastructure that functions as a competitive advantage rather than a liability.

Enjoyed this? Get posts like this delivered to your inbox.

AI engineering insights. No spam.

Subscribe

AI engineering insights. No spam.