What is a Data Lakehouse?
Data Lakehouse is a next-generation analysis platform architecture.
It have both flexibility of Data Lake and governance / performance of DWH.

why was born Data Lakehouse
Data Lakehouse was born to overcome demerit of Data Lake and Data Warehouse.
Data Lake can save all raw data to object storage and it’s cheap.
But It isn’t good at to execute transaction of update and delete, and not good at to execute fast queries.
Data Warehouse is high query performance because query optimized.
But Its storage cost is high and difficult to handle unstructured data.
Data Lakehouse can store raw data at low cost and realize reliability and fast analysis.
Core element of Data Lakehouse
Open columnar format
Open columnar format is a structure to sort out data by column.
By sort out data by column, When loading data, can only load necessary column.
For example, there is the following table.
UserID | Age | Country |
---|---|---|
1 | 25 | JP |
2 | 30 | US |
3 | 28 | UK |
In open columnar format, can sort out as the following.
-- Column Chunk: UserID
[1, 2, 3]
-- Column Chunk: Age
[25, 30, 28]
-- Column Chunk: Country
["JP", "US", "UK"]
Transaction log layer
The transaction log layer is a structure that records all operations—additions, updates, and deletions—along with information about when, by whom, and where each action was performed.
By this structure, can execute rollback even if the process falls off midway.
And avoid conflict when to update parallel.
Distributed analysis engine
Distributed analysis engine is a query engine that can process in parallel huge data by multiple machine.
By distributed analysis engine, can reduce load of per one machine and quickly respond simultaneous analysis request.
Meta data catalog
Meta data catalog is a structure to manage meta data.
Meta data is table name, column name, partition information or access permission.
By using meta data catalog, It’s easy to find data and realize access control for secure.