Skip to content

Archives

Building a Medallion architecture with ClickHouse

  • Building a Medallion architecture with ClickHouse

    Walkthrough of the "Medallion" architecture concept, which comprises three layers (or stages), each serving distinct purposes in the data pipeline:

    • Bronze layer - This layer acts as the landing area for raw, unprocessed data directly from the source system: simply put a "staging area". This data is stored in its original structure with minimal transformations and additional metadata. This layer is optimized for fast ingestion, and can provide an historical archive of source data that is always available for reprocessing or debugging. Whether the bronze layer should store all data is a point of contention, with some users preferring to filter the data and apply transformations, e.g., flattening JSON, renaming fields, or filtering out poorly formed data. We're not overly opinionated here but recommend optimizing the storage for consumption by the silver layer only - not other consumers.

    • Silver layer - Here, data is cleansed, deduplicated, and conformed to a unified schema, with raw data from the previous Bronze layer being enriched and transformed to provide a more accurate and consistent view. This data can be consistent and usable for enterprise-wide use cases such as machine learning and analytics. The data model should emerge at this layer with a focus placed on ensuring primary and foreign keys are consistent to simplify future joins. While not common, applications and downstream consumers can read from this layer. These are typically business-wide applications that need the entire cleansed dataset, e.g., ML workflows. Importantly, data quality will not improve after this stage only the ease at which it can be queried efficiently.

    • Gold layer - This later aims to have fully curated, business-ready, and project-specific datasets that make the data more accessible (and performant) to consumers. These datasets are often denormalized, or pre-aggregated, for optimal read performance and may have been composed of multiple tables from the previous silver stage. The focus here is on applying final transformations and ensuring the highest data quality for consumption by end-users or applications, such as reporting and user-facing dashboards.

    This layered approach to data pipelines aims to efficiently address challenges like data quality, duplication and schema inconsistencies. By transforming raw data incrementally, the Medallion architecture aims to ensure a clear lineage and progressively refined datasets that are ready for analysis or operational use.

    Tags: medallion-architecture data architecture pipelines clickhouse