Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Add extra metrics for insert overwrite operation #4047

Conversation

Alexvsalexvsalex
Copy link
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

When INSERT OVERWRITE is used it deletes existing data and write new one. We have statistics about new wrote data, but it is unknown how much data we deleted.
Changes in the PR suggests to enrich statistics by two extra metrics numRemovedFiles and numRemovedBytes to have basic understanding of deleted data.

How was this patch tested?

Added new unit tests.

Does this PR introduce any user-facing changes?

New metrics numRemovedFiles and numRemovedBytes will appear in table's history for insert overwrite.
If anybody wants to return back old behavior then they need to disable spark.databricks.delta.insertOverwrite.removeMetrics.enabled.

@Alexvsalexvsalex Alexvsalexvsalex force-pushed the add_remove_metrics_for_insert_overwrite branch from e190bb3 to 30e9df0 Compare January 14, 2025 14:02
Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scottsand-db scottsand-db merged commit b1139d8 into delta-io:master Jan 14, 2025
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants