Methods for Retrieving and Utilizing Information Stored in Azure Data Lake
In the realm of enterprise data lakes, the raw data-delta pattern is an approach that offers a balance between flexible data ingestion and reliable updates. This method involves initially landing raw data with minimal transformation (raw layer) and then applying incremental changes (deltas) to update or refine this data progressively.
Advantages of the Raw Data-Delta Pattern
The raw data-delta pattern offers several benefits, including:
- Flexibility and Scalability: This pattern supports storing large volumes of diverse raw data with schema-on-read flexibility, accommodating structured and unstructured data types. It scales cost-effectively using low-cost storage typical of data lakes.
- Incremental Data Updates with ACID Guarantees: Using delta formats (e.g., Delta Lake), it enables ACID transactions that ensure consistency, allowing concurrent writes and reads and reducing data corruption or conflicts. This is critical for real-time or near-real-time analytics.
- Improved Data Reliability and Governance: By layering delta updates on raw data, it provides schema enforcement, versioning, and auditability. This reduces the risk of data swamps common in raw-only lakes and supports regulatory compliance and confidence in data quality.
- Enables Real-time or Near-real-time Processing: Continuous delta ingestion allows fresh data availability, supporting business operations requiring timely insights.
Disadvantages of the Raw Data-Delta Pattern
While the raw data-delta pattern offers numerous advantages, it also presents some challenges, such as:
- Complexity of Initial Setup and Management: Implementing this layered ingestion requires disciplined metadata management, ETL/ELT orchestration, and governance frameworks. This may require specialized technical expertise to avoid pitfalls.
- Higher Operational Overhead Compared to Pure Raw Lakes: Managing ACID transactions, concurrent operations, and delta updates involves infrastructure and computational costs higher than simple raw data landing, particularly when real-time capabilities are needed.
- Potential for Data Duplication and Pipeline Fragility: Without careful pipeline design, delta updates may introduce version sprawl or conflicts, while batch-based pipelines can break with upstream operational changes, requiring ongoing maintenance.
- Requires Strong Governance to Avoid ‘Data Swamp’: Without enforcing proper schema and adhering to governance policies, the layering of raw and delta data can degrade into poorly organized data, reducing usability.
Conclusion
The raw data-delta pattern is a powerful tool for enterprises seeking both scalability and data reliability. However, it comes at the cost of increased architectural complexity, governance demands, and operational overhead compared to simpler raw-lake-only ingestion approaches. It is particularly advantageous when concurrent writes, real-time updates, and data quality assurance are necessary for mission-critical analytics.
Delta lake can help consumers to query data easily, and Data Factory supports delta as a sink that can help producers to automatically add data in a delta lake format. Multiple zones in a data lake can help in ingesting all raw data and creating multiple aggregations ready for consumption.
The article concludes in Chapter 5, discussing a proof of concept that uses ADF, Delta Lake, and Spark to ingest data from SQLDB, query data from the delta lake, and copy data to a consumer's own environment if needed. The article also highlights the main stakeholders of the enterprise data lake: Data Producer and Data Consumer, with Data Producer looking for an easy way to ingest data and Data Consumer creating business value from the data.
When copying data from the enterprise data lake to a consumer's own environment, it's important to note that this process can take a long time, especially for large data sets. The data virtualization pattern allows consumers to query directly on the data lake without creating data duplicates, making it a single source of truth.
For consumers with limited technological knowledge, the data virtualization pattern is easy to step into and create reports. However, for teams with strong performance requirements, this pattern may not be feasible. In such cases, the consumer has full control of the data when using the copy data pattern, which is important for teams with strict performance requirements or SLAs.
- The raw data-delta pattern, when combined with technology solutions like Delta Lake and Data Factory, enables easy querying of data for consumers and efficient data ingestion for producers, creating a single source of truth in a data lake.
- For consumers with limited technological knowledge, the data virtualization pattern is advantageous due to its ease of use in creating reports, while the copy data pattern is more suitable for teams with strict performance requirements or service level agreements (SLAs).