The Enterprise Datawarehouse (DWH) is designed to be the single place to answer all business-related-questions in a fast and reliable way. This ambitious and long-running goal is currently being challenged by many smaller-scoped initiatives, included under the Data Lake umbrella, that will one day make DWH disappear. This article focuses on the reasons of the attack, on the vision that has replaced it and on the steps to enable that vision.
Why are DWHs attacked?
While most IT costs decrease with time, the cost of a DWH does not. On the contrary, the more a DWH lives, the more it costs, because it includes more and more data and because it is built as a monolithic structure. Evolutive modelling (aka Data Vault) and DWH automation are good approaches to make DWH cheaper and more evolutive but they are not widespread enough to become the solution. Furthermore, these approaches are now in competition with the Data Lake approach described below.
DWHs inhibit business innovation
Many reasons here on why a DWH cannot keep the pace of the business:
- Many processes are in place to protect data quality in a DWH, mostly because many reports are dependent on that data. But these processes make it slow to evolve
- A DWH is meant to answer all business needs from a single set of tables, and it is a technical challenge to grow a structure to answer a new set of requirements it was not initially designed for. This challenge requires time and expertise from IT. If IT is not available, extensions are slow to implement
- A DWH is limited to structured data, implying that it will have, in the end, only a few figures extracted from images, tweets or emails, without ever having the original files available for analysis
- When new use-cases for data emerge, their value must be validated in a cheap and fast process that is often not compatible with the pace of the DWH
Since a DWH cannot keep up the pace of the business, the business is used to getting some data from the DWH, merge it with other sources in Excel or in a dataviz tool and design dashboards. In turn, this usage of dataviz inhibits data sharing and governance.
DWHs cannot be real-time
With all the data preparation steps involved, DWH are typically loaded once or twice a day. And it is difficult to add a real-time data integration for a single table because all the referential information is still only loaded once a day.
DWHs inhibit organizational innovation
The DWH is owned and managed by IT, with strictly controlled access and evolutions. These characteristics inhibit data-driven initiatives to be created by other Directions without the involvement of IT. Meanwhile, data-preparation tools are ready for non-IT users to curate their datasets and contribute back with their data to data-driven innovation. Some companies have implemented some light-governance areas within their DWH, but these Data Lab zones have a much better support in Data Lakes.
DWH are not adapted to agile projects
Developing ETL and dashboards can be done following an agile methodology such as Scrum, but the length of the processes around the DWH are not adapted to agility.
What benefits should we retain from the DWH era?
There are many characteristics that have to be maintained in the post-DWH era:
- Aggregate content from multiple sources around common dimensions, thus simplifying the creation of cross-functional datasets
- Ensure quality, with curated, agreed-upon referentials and facts
- Ensure security, where each user is granted access to specific data
- Ensure performance of the access by many users
- Let users drill from aggregate to detail data
In addition, some capabilities should be added:
- Integrate any type of data
- Load data in real-time
- Use data stored once for multiple usages and applications, including Data Science
- Create small-scoped data projects that together form a complete data architecture
What vision for BI without a DWH?
Below are a list of guiding principles for a new vision of BI, built without DWH.
Limit data duplication
Every transformation of the data takes time to build, test, run, and later evolve. Thus the less copies of the data we have, the more agile architecture we create. Data Lakes will have a small number of regions, with processes to transform the data between regions.
For example in a simplified file-based Data Lake, we could design these zones:
- Raw Zone to keep the raw data imported from the sources
- Refined Zone to maintain use-case-based datasets shared with many users
- Distilled Zone where data can be further worked on, with limited support from IT
In this simplified vision, two zones are sufficient to build a system that stores raw data and transforms it into datasets for end-users to consume. The goal is to avoid the creation of additional copies of the data because of the requirements of the BI or Dataviz tools.
The three zones above have different governance levels, the Refined Zone has the same governance as the DWH and the Distilled Zone has a lighter governance.
Limit ETL code duplication
To process inbound data and to transform data between zones, we now have the opportunity to create pipelines that can handle both batch and real-time data. These frameworks (eg Apache Beam) let us standardize the import processes and expand the scope they can be applied to. This is an opportunity to create architectures that handle data integration from multiples sources in real-time.
Have IT and Business users catalog data as it is prepared
As the number of datasets grows, either created by IT or contributed back by business users, it is necessary to document the content of each file. This is a difficult task that can only be done when datasets are published. Platforms exist that can automatically scan the content of the datasets and help catalog files.
Open catalogs and close datasets
As the number of datasets grows, another important activity is to keep datasets access limited to users that should have them. In this proposed vision, the datasets should by default be closed and opened only to users or groups requiring access to them. As the data catalog stores users’ requests and validations (what datasets, requested by, requested when, requested for how long, validated by, etc.), there is a clear audit trail and dataset-usage-tracking system. In addition, it becomes easy to see which datasets are used by a specific user, or not used at all.
Use BI tools directly on the data lake files
As we want to limit the number of copies of data, we need to expose the information is the data lake directly to BI and Data Visualization applications. The “BI on Data Lake” platforms we integrate create a semantic layer, a security layer and a performance layer that lets users connect from Excel or Tableau directly to the Data Lake files or tables. That single semantic/security layer provides all BI / Dataviz / Data Science tools with the same data, the same business rules and the same security, without creating an additional copy of the data, inside or outside the Data Lake. From Excel, Tableau or any other dashboarding tool, each user has access to his data, organized with dimensions and can drill inside reports with cube-like performance.
How to implement this vision today?
At Polarys, we help teams transition to the Data Lake, because the main impact of the Data Lake is organizational. Giving more capability around data for the Business is a change that requires governance, training and organization changes.
At Polarys, we also help companies select components for their data architecture. We work with a small number of partners with which we can build these solutions today:
- Cloud Data Lake providers, some with file-based-storage only, some with file-based and table-based storage depending on the data
- Data Governance software
- BI-on-Data-Lake platforms
- BI, Data Visualization, Data Preparation and Data Science platforms
Feel free to contact me to have more information on the partners we work with.
In this vision, does the DWH really disappear?
Yes, the physical DWH really disappears, along with its cost and its limitations. But most of its guiding principles stay, because these principles are important in the IT landscape, such as fast access to reliable data for one or many BI tools.
It is the physical DWH as a separate database that disappears, transforming into a region of the Data Lake, with curated data and an interface to BI solutions.
Please note that Google BigQuery does not have the drawbacks of the DWH that are listed above. BigQuery is the storage area for all structured data in the Google architecture, it can be fed and queried in real-time, it is thus considered a part of a larger, logical, Data Lake.
At Polarys, we believe the transition from a DWH-centric organization to a Data-Lake-centric organization will take time and energy, but the outcome is a more agile and adequate data-management architecture that will promote innovation and enable your users to easily build new solutions around your data.
Do you agree with this vision? Are you ready to shut down your DWH?