image

Data Warehouse, Data Lake & Data Vault

Data Lakes & Data Warehouses

Data Lakes and Data Warehouses both act as repositories, but they are designed for very different purposes. Data Warehouses work best for specific projects with set resources while Data Lakes are optimized for managing all incoming Big Data.

The data stored in the warehouse is sourced from the various Operational Data Sources (ODS) which means that it can be sourced from heterogeneous systems and usually require data cleansing for additional operations to ensure quality of data before it is used in the DW for reporting.

Data Lakes saves time, effort and cost by creating a single repository for all Structured, Semi-Structured and Un-Structured Data, making it easy for Data Scientists to pull exactly what they need for analysis.

Data Vault

A data vault is a system made up of a model, methodology and architecture that is explicitly designed to solve a complete business problem as requirements change. Data Vault data is generally RAW data sets.  So, in the case of the Data Vault, reconciling to the source system is a recommended for testing.

It serves to structure the data warehouse data as systems of permanent records, and to absorb structural changes without requiring any alterations. Data Vault requires to load data exactly as it exists in the source system. No edits, no changes, no application of business rules (including data cleansing). This ensures that Data Vault is 100% auditable.

Data Mart and Data Cubes

Data Mart: A Data Mart is a type of data store often used to support presentation layers of the data warehouse environment. It contains only those Data that is specific to a particular group. For example, the marketing Data Mart may contain only Data related to items, customers, and sales. Data Marts are confined to subjects. With a data mart, teams can access data and gain insights faster, because they don’t have to spend time searching within a more complex data warehouse or manually aggregating data from different sources.

Data Cube: In Data Cubes, we represent Data in Multiple Dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records. Data cubes are used to represent data that is too complex to be described by a table of columns and rows. As such, data cubes can go far beyond 3-D to include many more dimensions.

Data in Organization may includes:

  • Structured Data: It is comprised of clearly defined data types with patterns that make them easily searchable. Such as Invoices, Receipts, Sensor Data, Online Forms, Spreadsheets, CRM Profile
  • Semi-Structured Data: CSV, logs, XML, and JSON Format
  • Unstructured Data: Social Media Content, Emails, Podcasts, Security Footage, Transcripts, PDFs, Images, Audio and Video

Leave a Reply

Your email address will not be published. Required fields are marked *

seventeen − nine =