Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-37021][state/forst] Fix incorrect paths when reusing files for checkpoints. #26040

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AlexYinHan
Copy link
Contributor

@AlexYinHan AlexYinHan commented Jan 21, 2025

What is the purpose of the change

This pull request addresses several bugs in ForSt that may prevent proper file reuse, leading to slower or failed snapshot and restoration operations.

Brief change log

  • Make DataTransferStrategyBuilder function properly.
    • Use URI to tell whether two FileSystem are the same.
    • Use 'Reuse' strategy for Restoration only in CLAIM mode
    • Use 'Reuse' strategy for Snapshot only when SharingFilesStrategy is 'FORWARD_BACKWARD'
  • Ensure we correctly distinguish between dbFilePath and realSourcePath.
  • During Snapshots: do not copy file if it is already owned by JM.

Verifying this change

This change is already covered by existing tests, such as (please describe tests).

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@AlexYinHan
Copy link
Contributor Author

@flinkbot run azure

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 21, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@Zakelly Zakelly self-requested a review January 22, 2025 04:02

// Try path-copying first. If failed, fallback to bytes-copying
StreamStateHandle targetStateHandle =
tryPathCopyingToCheckpoint(sourceStateHandle, checkpointStreamFactory, stateScope);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryPathCopyingToCheckpoint can fail with an IOException as well as returning null. It would be better to always return null for a failure I think - otherwise for an IOException we will not try bytesCopyingToCheckpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. I've modified the code as suggested. PTAL

filePath.getName());
LOG.trace("Write file to checkpoint: {}, {}", filePath, handleAndLocalPath.getHandle());
return handleAndLocalPath;
return HandleAndLocalPath.of(targetStateHandle, dbFilePath.getName());
}

/**
* Duplicate file to checkpoint storage by calling {@link CheckpointStreamFactory#duplicate} if
* possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: javadoc @params etc are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added params and returns in the javadoc

@Nullable ForStFlinkFileSystem forStFlinkFileSystem,
boolean isDbPathUnderCheckpointPathForSnapshot) {
DataTransferStrategy strategy;
if (forStFlinkFileSystem == null || isDbPathUnderCheckpointPathForSnapshot) {
if (sharingFilesStrategy != SnapshotType.SharingFilesStrategy.FORWARD_BACKWARD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for ForStIncrementalSnapshotStrategy, the SnapshotType.SharingFilesStrategy.FORWARD is not supported since it breaks the precondition of file sharing between CP and DB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I've added a check in ForStIncrementalSnapshotStrategy#asyncSnapshot(). It throws an IllegalArgumentException when encountering SharingFilesStrategy.FORWARD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants