Add IcebergDocument as one implementation of VirtualDocument #3147

bobbai00 · 2024-12-10T16:38:08Z

This PR introduces an implementation of result storage using Apache Iceberg.

How to enable the Iceberg result storage

Go to storage-config.yaml,

change result-storage-mode to iceberg
configure storage.iceberg.catalog.jdbc section,

iceberg:
    catalog:
      jdbc: # currently we only support storing catalog info via jdbc, i.e. https://iceberg.apache.org/docs/1.7.1/jdbc/
        url: "jdbc:mysql://localhost:3306/texera_iceberg?serverTimezone=UTC"
        username: ""
        password: ""

make sure the JDBC is accessible via the url, username, and password

Major changes

Introduced IcebergDocument: a thread-safe implementation of VirtualDocument for storing and reading results in Iceberg tables.
Introduced IcebergTableWriter: an append-only writer for Iceberg tables with configurable buffer size.
Added support for new configuration properties under storage.iceberg to specify catalog and table settings.

Introduced Dependencies

In workflow-core, some new packages are added

Iceberg-related packages
Hadoop common. The reason of adding this dependency is to pass the compilation: In the source code of iceberg-parquet, the line 160,
although the file is not of type HadoopOutputFile, it still creats a Hadoop Configuration() as the placeholder. During the runtime, we don't have any dependency on Hadoop or HDFS.

Overview of the behavior IcebergDocument and IcebergWriter

IcebergDocument:
- Handles reading and managing data in Iceberg tables.
- Initializes the table during construction, creating it if it does not exist or overriding it if specified.
- Supports iterator-based incremental read operations.
- Thread-safe for read and clear operations.
IcebergTableWriter:
- Writes data to Iceberg tables in an append-only manner.
- Creates new Parquet files for every buffer flush, ensuring immutability.
- Not thread-safe, so it should only be accessed by one thread at a time.

How the result will be stored via Iceberg tables

Given a storage key, a table named key will be created.
To append tuples to the table key, each worker will append immutable parquet files to the table's data space using IcebergTableWriter. To avoid the parquet filename collision, each worker will prefix its created file with ${workerIndex}_${fileIndex}, in which workerIndex is its index, and fileIndex is a number maintained that increased by 1 every time a new data file is created and flushed by the writer.
To read the tuples, the reader uses the iterator returned by IcebergDocument.get. This iterator can incrementally read new data while writers are appending tuples.

...orkflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/ItemizedFileDocument.scala

# Conflicts: # core/amber/src/main/scala/edu/uci/ics/texera/workflow/WorkflowCompiler.scala # core/workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/OpResultStorage.scala # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/sink/managed/ProgressiveSinkOpExec.scala

# Conflicts: # core/workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/OpResultStorage.scala

… into jiadong-add-file-result-storage

# Conflicts: # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/SpecialPhysicalOpFactory.scala # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/sink/ProgressiveSinkOpExec.scala

This reverts commit a2e53b5.

shengquan-ni · 2025-01-03T21:19:58Z

...flow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/iceberg/IcebergDocument.scala

+  @transient lazy val catalog: Catalog = IcebergCatalogInstance.getInstance()
+
+  // During construction, create or override the table
+  synchronized {


this synchronized is unnecessary as it only locks this instance.

bobbai00 requested a review from shengquan-ni December 10, 2024 16:38

bobbai00 self-assigned this Dec 10, 2024

bobbai00 requested a review from Yicong-Huang December 10, 2024 16:39

shengquan-ni reviewed Dec 10, 2024

View reviewed changes

...orkflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/ItemizedFileDocument.scala Outdated Show resolved Hide resolved

shengquan-ni reviewed Dec 10, 2024

View reviewed changes

...orkflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/ItemizedFileDocument.scala Outdated Show resolved Hide resolved

bobbai00 force-pushed the jiadong-add-file-result-storage branch 2 times, most recently from 6522779 to a83d779 Compare December 14, 2024 00:14

bobbai00 force-pushed the jiadong-add-file-result-storage branch 2 times, most recently from 1edb551 to cef347b Compare December 21, 2024 02:56

bobbai00 changed the title ~~Add PartitionDocument and ItemizedFileDocument~~ Add IcebergDocument as one implementation of VirtualDocument that can be used to store operator results Dec 22, 2024

bobbai00 changed the title ~~Add IcebergDocument as one implementation of VirtualDocument that can be used to store operator results~~ Add IcebergDocument as one implementation of VirtualDocument Dec 22, 2024

bobbai00 added 19 commits December 22, 2024 16:10

add itemized file document and partition document

1fe9f17

add unit test for PartitionDocument

219b82d

add more to unit tests

e446e9c

make PartitionDocument return T

9627b25

fix partition document test

b85fd45

refining the documents

8e6fec3

add type R to PartitionedItemizedFileDocument

288aea4

do a rename

c3a1d00

adding the arrow file document, TODO: fix the test

97c601e

pass the compilation

e2c5515

finish arrow document

c17a54e

start to add some iceberg related

bc38cc4

finish initial iceberg writer

51dd7cf

finish initial version of iceberg

481c437

refactor test parts

0274f66

finish 1st viable version

4663fef

fix the append read

9607f98

finish append read

d2d0ed7

finish concurrent write test

f4ea0e3

bobbai00 added 7 commits December 30, 2024 09:39

clean up the iceberg document

1be10bf

clean up the iceberg writer

7adfda4

add more comments on the iceberg util

4617564

add more comments

13731cb

refactor local file IO

2baa661

merge master

e105913

bobbai00 requested a review from shengquan-ni December 30, 2024 18:37

bobbai00 added 21 commits December 30, 2024 12:41

Merge branch 'master' into jiadong-add-file-result-storage

4cf144b

# Conflicts: # core/workflow-core/src/main/scala/edu/uci/ics/amber/core/storage/result/OpResultStorage.scala

cleanup the config

9b69f59

Merge remote-tracking branch 'origin/jiadong-add-file-result-storage'…

60445e6

… into jiadong-add-file-result-storage

cleanup the clear logic

9a482b1

fmt

decab8d

refactor the test to use the test db

9cb2674

make the test harder

51d8a1e

make the test more clean

39b0448

Merge branch 'master' into jiadong-add-file-result-storage

2655dae

# Conflicts: # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/SpecialPhysicalOpFactory.scala # core/workflow-operator/src/main/scala/edu/uci/ics/amber/operator/sink/ProgressiveSinkOpExec.scala

incorporate worker idx to sink

73106dd

add format version and row lineage to the iceberg table

a2e53b5

Merge branch 'master' into jiadong-add-file-result-storage

cffafe0

Revert "add format version and row lineage to the iceberg table"

f54e38c

This reverts commit a2e53b5.

fix iceberg util spec

7176864

try to add the record id

76dd31c

try debugging the test

31070be

half way to have a consistent order

1156db4

fix the get range

c712c1d

fix the get's refresh

a16ff80

add getAfter test

d2e710f

remove redundant dependency

a8bb3db

shengquan-ni reviewed Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IcebergDocument as one implementation of VirtualDocument #3147

Add IcebergDocument as one implementation of VirtualDocument #3147

bobbai00 commented Dec 10, 2024 •

edited

Loading

shengquan-ni Jan 3, 2025 •

edited

Loading

Add IcebergDocument as one implementation of VirtualDocument #3147

Are you sure you want to change the base?

Add IcebergDocument as one implementation of VirtualDocument #3147

Conversation

bobbai00 commented Dec 10, 2024 • edited Loading

How to enable the Iceberg result storage

Major changes

Introduced Dependencies

Overview of the behavior IcebergDocument and IcebergWriter

How the result will be stored via Iceberg tables

shengquan-ni Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

bobbai00 commented Dec 10, 2024 •

edited

Loading

shengquan-ni Jan 3, 2025 •

edited

Loading