-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support position delete writer #704
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two kinds of writers in iceberg:
- Plain position delete writer: https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49
- Sorting position delete writer:
https://github.com/apache/iceberg/blob/da2ad389fd9ba8222f6fb3f57922209c239a7045/core/src/main/java/org/apache/iceberg/deletes/PositionDeleteWriter.java#L49
It seems that this pr tries to implement 2, while there are some missing part there. I would suggest to implement 1 first as it's easier, what do you think?
crates/iceberg/src/arrow/schema.rs
Outdated
@@ -607,6 +607,19 @@ impl SchemaVisitor for ToArrowSchemaConverter { | |||
} | |||
} | |||
|
|||
/// Convert iceberg field to an arrow field. | |||
pub fn field_to_arrow_field(field: &crate::spec::NestedFieldRef) -> Result<FieldRef> { | |||
let mut converter = ToArrowSchemaConverter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is a little hack to me. How about just create a one field schema, and convert it using arrow schema, then get the result?
fn write<'life0, 'async_trait>( | ||
&'life0 mut self, | ||
input: PositionDeleteInput<'a>, | ||
) -> ::core::pin::Pin< | ||
Box<dyn ::core::future::Future<Output = Result<()>> + ::core::marker::Send + 'async_trait>, | ||
> | ||
where | ||
'life0: 'async_trait, | ||
Self: 'async_trait, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove these auto generated lifetime markers and prefix of types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For here we use a sync version so that seems we need to explicitly declare these auto-generated lifetime.
The reason we need the sync version is that the input takes the reference like: struct PositionDeleteInput<'a>
, we need to explicitly convert it into a record batch in the sync function part and then return a async future to write this record batch.
} | ||
|
||
/// The memory position delete writer. | ||
pub struct MemoryPositionDeleteWriter<B: FileWriterBuilder> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub struct MemoryPositionDeleteWriter<B: FileWriterBuilder> { | |
pub struct PositionDeleteWriter<B: FileWriterBuilder> { |
I don't think we should add a Memory
prefix here since it make people feel that we are storing everything in memory, and it applies to all structs.
Is position delete must be sorted or it just be optional? From the iceberg spec, it looks like it must be sorted. https://iceberg.apache.org/spec/#position-delete-files:~:text=The%20rows%20in%20the%20delete%20file%20must%20be%20sorted%20by%20file_path%20then%20pos%20to%20optimize%20filtering%20rows%20while%20scanning. |
Yes, it's required in spec, but some compute engine could sort this before passing to writer, and writer doesn't need to handle sorting itself. |
Make sense. Let's implement 1 first |
Complete #340