In the early 1990s, it became evident that relying solely on department-level and application-specific operational reporting couldn’t provide a comprehensive and nuanced view of enterprise data. This realization prompted the widespread adoption of data warehouses.
Soon after, the rise of e-commerce and the growing embrace of numerous enterprise solutions pushed the traditional data warehouse model to its scalability limits. Early internet companies, Google in particular, played a pivotal role in reshaping the understanding of how data could be stored and processed at an unprecedented scale.
This innovation paved the way for the impending reality of data lakes, which represented a significant departure from traditional data storage and processing methods. Various architectural approaches evolved as organizations transitioned to the cloud. A common approach was to leverage data lakes as landing areas and a cloud data warehouse for the curated layer. This enabled both conventional enterprise analytics and the ingestion of vast volumes of structured and semi-structured data.
While this shift towards data lakes presented increased scalability, it introduced complexities through the coexistence of data lakes and data warehouses, which proved challenging and inefficient to manage. The industry yearned for a unified solution capable of supporting a broad spectrum of data-related tasks, from data engineering and analytics to machine learning and data-driven applications.
In 2012, Databricks introduced Spark, a unified, high-performance analytics engine designed to handle batch and real-time data processing at large scales. Subsequently, in 2019, Databricks unveiled Delta Lake, an open-source storage layer engineered to enhance the reliability, security, and performance of existing data lakes. Combined with additional capabilities, they offered crucial support for ACID transactions, scalable metadata management, unified streaming capabilities, and batch data processing.
With these advancements, the concept of the data lakehouse became a reality, further solidifying the integration of data storage and processing.
Stepping into the future: building the decentralized data architecture and governance framework.
Moreover, the concept of data fabric and data mesh emerged as promising solutions to address complexities beyond technology in the adoption of enterprise-level data platforms. Data fabric, introduced in 2016, provides an architectural and service-oriented solution that offers consistent capabilities across various endpoints, spanning hybrid multi-cloud environments. It essentially provides a technical response to the challenge of managing access and data across diverse and dispersed data sources.
In contrast, data mesh, introduced in 2019, focuses on a sociotechnical approach, emphasizing four core principles: domain ownership, treating data as a product, creating self-serve data platforms, and implementing federated computational governance. These principles are shaping how enterprises plan their data strategy roadmaps.
Organizations soon recognized the potential for synergy between data fabric and data mesh. By implementing data fabric as a semantic overlay to access data from diverse sources while embracing data mesh principles to govern distributed data creation, unified access for both centrally managed and distributed data can be achieved. This approach simplifies the transition to distributed governance, enhanced data sharing, and improved scalability and flexibility.
From a technical implementation perspective, data lakehouses, which could already support a wide array of data consumption needs, will evolve into the preferred platforms to enable data fabric and data mesh:
- Lakehouse catalogs allowed for unified governance, security, standardization, and streamlined data discovery.
- Diverse sharing mechanisms facilitated authorized access to data beyond organizational boundaries.
- Lakehouse workflows empowered the creation of self-service data pipelines for both batch and real-time operations.
- Furthermore, by continually enhancing their catalogs and the ability to execute federated queries across multiple external sources, lakehouses could effectively function as de facto data fabric.
Future blogs will further detail how the combination of mesh and fabric, built upon a mature lakehouse infrastructure, forms the ideal foundation for growth and business agility, paving the way for a data-centric future where the power of unified data platforms drives innovation and success.
Alex is a well-rounded professional with more than 25 years hands-on experience in technology consulting / advisory services, solution, and enterprise architecture with a focus on cloud transformation, data engineering and AI. He has a passion for framing large transformative solution and guiding teams towards achieving business and technology objectives.