You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the deletes, we need a broader discussion on where the responsibilities lie between iceberg-rust and the query engine.
On the read-side there Tasks are passed to the query engine. I like this nice and clean boundary between the engine and the library. I would love to go to a similar API for deletes. Similar to the read path, the library comes up with a set of tasks that are passed back to the query engine to write out the files and return the DataFile with all the statistics and such.
The current focus of #700 is adding DataFiles, which is reasonable for engines to take control over. As a next step, we need to add delete operations. Here it gets more complicated since it can be that the delete can be performed purely on Iceberg metadata (eg. dropping a partition), but it can also be that certain Parquet files need to be rewritten. In such a case, the old DataFile will be dropped, and one or more DataFiles will be added when the engines have rewritten the Parquet files, excluding the rows that need to be dropped.
When doing a delete, the following steps are being taken:
First, based on the partition predicates it is determined if a whole partition can be dropped. It so, the whole manifest will be read, and marked as deleted.
Second, the manifest will be opened, and based on the statistics of each of the manifest-entries we can determine if the whole file can be deleted, if so, it will be marked as deleted.
Third, we have to pass the file to the query engine to check if it needs to rewrite the file. The Query engine can leverage the Parquet bloom filters to see if needs to rewrite the file, if so, it can go over the row groups to check if it needs to rewrite the row group, and if so it will start rewriting the file. There is a chance that the original file can be kept (because no rows are deleted), or we need to drop the old manifest entry and add a new one that deletes the records that we want to drop.
As you might notice from above, this is pretty similar to the read path. Except, we need to invert the evaluators. For the read path, we check for ROWS_MIGHT_MATCH to include it in the query plan. For the delete use-case, we need to determine the opposite, namely ROWS_CANNOT_MATCH. Therefore we need to extend the evaluators:
Strict projection needs to be added to the transforms.
Strict Metrics Evaluator to determine if the predicate cannot match.
Once this is ready, we can incorporate this into the write path, and also easily add the update operation (append + delete).
The text was updated successfully, but these errors were encountered:
For the deletes, we need a broader discussion on where the responsibilities lie between iceberg-rust and the query engine.
On the read-side there Tasks are passed to the query engine. I like this nice and clean boundary between the engine and the library. I would love to go to a similar API for deletes. Similar to the read path, the library comes up with a set of tasks that are passed back to the query engine to write out the files and return the DataFile with all the statistics and such.
The current focus of #700 is adding DataFiles, which is reasonable for engines to take control over. As a next step, we need to add delete operations. Here it gets more complicated since it can be that the delete can be performed purely on Iceberg metadata (eg. dropping a partition), but it can also be that certain Parquet files need to be rewritten. In such a case, the old DataFile will be dropped, and one or more DataFiles will be added when the engines have rewritten the Parquet files, excluding the rows that need to be dropped.
When doing a delete, the following steps are being taken:
As you might notice from above, this is pretty similar to the read path. Except, we need to invert the evaluators. For the read path, we check for
ROWS_MIGHT_MATCH
to include it in the query plan. For the delete use-case, we need to determine the opposite, namelyROWS_CANNOT_MATCH
. Therefore we need to extend the evaluators:Once this is ready, we can incorporate this into the write path, and also easily add the update operation (append + delete).
The text was updated successfully, but these errors were encountered: