Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
update design_doc.md
Browse files Browse the repository at this point in the history
  • Loading branch information
yenjuw committed Feb 25, 2024
1 parent 0f16a38 commit e276686
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 11 deletions.
Binary file modified doc/assets/system-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 8 additions & 11 deletions doc/design_doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,20 @@
### Goal
The goal of this project is to design and implement a **Catalog Service** for an OLAP database system. The Catalog aims for managing metadata and providing a centralized repository for storing information about the structure and organization of data within the OLAP database. This project aims to produce a functional catalog that adheres to [the Iceberg catalog specification](https://iceberg.apache.org/spec/) exposed through [REST API](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml).
## Architectural Design
We follow the logic model described below. The input of our service comes from execution engine and I/O service. And we will provide metadata to planner and scheduler.
We follow the logic model described below. The input of our service comes from execution engine and I/O service. And we will provide metadata to planner and scheduler. We will use pickleDB as the key-value store to store (namespace, tables) and (table_name, metadata) as two (key, value) pairs as local db files. [TODO: server part]
![system architecture](./assets/system-architecture.png)
### Data Model
We adhere to the Iceberg data model, arranging tables based on namespaces, with each table uniquely identified by its name.

For every table in the catalog, there is an associated metadata file. This file contains a collection of manifests, each of which references the table's information at different points in time. The manifest file is an in-memory, non-persistent component that gets recreated based on on-disk files during service restarts. (If it is not frequently updated, we could dump it to disk every time we update it)

To enhance startup and recovery times, we periodically save the in-memory index to disk. This ensures a quicker restoration process by utilizing the dumped index data.
![Catalog Data Model](./assets/iceberg-metadata.png)
For every namespace in the database, there are associated list of tables.
For every table in the catalog, there are associated metadata, including statistics, version, table-uuid, location, last-column-id, schema, and partition-spec.
[TODO: data model (struct of metadata)]

### Use Cases
#### Namespace
create/update/delete namespace.
create/delete/rename namespace
#### Table
create/update/delete table
#### Query Table’s Metadata
create/delete/rename table
#### Query Table’s Metadata (including statistics, version, table-uuid, location, last-column-id, schema, and partition-spec)
get metadeta by {namespace}/{table}

## Design Rationale
Expand All @@ -32,11 +30,10 @@ get metadeta by {namespace}/{table}
* Data durability mechanisms will be implemented to prevent data loss during restarts.
* Performance:
* Optimization on data retrieval and storage strategies to minimize latency in metadata access.
* Efficient indexing mechanisms, such as Bloom filters, enhance query performance.
* Partitioning strategies facilitate data pruning and improve query execution performance.
* Engineering Complexity / Maintainability:
* Centralized metadata management achieved by separating data and metadata, reducing complexity and facilitating consistent metadata handling.
* Code modularity and clear interfaces facilitate easier updates and improvements.
* We adopt the existing kvstore (pickleDB) and server (Rocket) to mitigate the engineering complexity.
* Testing:
* Comprehensive testing plans cover correctness through unit tests and performance through long-running regression tests. Unit tests focus on individual components of the catalog service, while regression tests evaluate system-wide performance and stability.
* Other Implementations:
Expand Down

0 comments on commit e276686

Please sign in to comment.