Medallion Architecture Data Pipeline

Welcome to our project showcase! Here’s a quick overview of how we use a medallion architecture for managing data streams.

Pipeline Highlights

Data Ingestion with Kafka: Streaming data flows seamlessly from Kafka.
Bronze Layer: Stores raw data as-is for historical tracking.
Silver Layer: Data is cleansed and structured, ready for analytics.
Gold Layer: Optimized Delta tables and fact tables for reporting and insights.

Gold Layer: KPI Classification Approach

In the gold layer, we implement a row-level KPI classification model. Instead of calculating KPIs for predefined hyper cubes or aggregated views, we:

Classify Each Item:
- Each row in the gold layer is checked against KPI definitions.
- A boolean column is added to indicate whether the item fulfills the KPI requirements (true/false).
Dimension Filtering and Aggregation:
- The BI reporting tool allows users to select items dynamically using different dimensions (e.g., time, geography, or product type).
- KPIs are calculated using aggregation functions like AVG() on the boolean values.
Master Data Integration:
- Dimensions for BI analysis, such as Product, Region, and Time, are sourced from the company’s master data system.
- These dimensions are integrated into the pipeline via a daily batch ETL process, adhering to SCD Type 2 for maintaining historical changes in dimension data.
Star and Snowflake Schemas:
- The overall structure of the data warehouse depends on the hierarchical complexity of dimensions:
  - Star Schema: If dimensions have a flat hierarchy, they directly connect to the fact table, forming a star-like structure.
  - Snowflake Schema: If dimensions contain multiple levels of hierarchy (e.g., Region → Country → City), they are further normalized into related dimension tables, forming a snowflake-like structure.
Benefits of This Approach:
- Flexible and scalable: Users can slice and dice data dynamically without predefined hyper cubes.
- Simplified pipeline: KPI logic is encapsulated at the item level, reducing complexity in ETL processes.
- BI-driven insights: KPIs are dynamically derived based on user queries and needs.

Comparison: Hyper Cubes in Classical Data Warehouses

What Are Hyper Cubes?

In classical data warehouses, hyper cubes are multidimensional structures used for organizing and analyzing data. They consist of:

Dimensions: Perspectives for analysis, such as Time, Product, or Region.
Measures: Numeric data or facts like sales or profit.
Cells: Each cell contains a measure corresponding to a unique combination of dimensions.

For example:

Product	Region	Time	Total Sales
Shoes	Europe	2023	$100,000
Jackets	Asia	2024	$50,000

Hyper cubes are optimized for predefined aggregations and enable operations such as slicing, dicing, and drilling down.

Challenges of Hyper Cubes

Predefined Structure:
- Dimensions and hierarchies must be predefined, limiting flexibility.
Scalability Issues:
- Adding new dimensions increases storage and computational requirements exponentially.
Static Aggregations:
- Pre-aggregated data cannot accommodate ad-hoc queries easily.
ETL Complexity:
- Preparing hyper cubes requires extensive ETL processes for every possible combination of dimensions.

Modern Approach: Medallion Architecture with Delta Lake

In contrast to classical hyper cubes, our pipeline leverages a medallion architecture with Delta Lake. Key differences:

Row-Level Processing: KPIs are not pre-aggregated but classified at the row level.
Dynamic KPI Computation: Aggregations are deferred to the BI layer, enabling flexible queries.
Scalable and Efficient: Delta Lake’s ACID compliance and scalability handle large volumes of data without the limitations of hyper cubes.
Simplified ETL: Focuses on cleaning and structuring data rather than extensive pre-aggregation.

Why This Approach?

Our approach ensures:

Flexibility for diverse queries and user-defined KPIs.
Scalability to handle large-scale streaming data from Kafka.
Simplified ETL pipelines, reducing processing time and complexity.

By integrating these advanced methods, we enable real-time insights, scalable analytics, and user-driven reporting capabilities.