Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PID reconcile command #10567

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

johannes-darms
Copy link
Contributor

@johannes-darms johannes-darms commented May 17, 2024

What this PR does / why we need it:

Please refer to #10501

Which issue(s) this PR closes:

Closes #10501

Special notes for your reviewer/ Suggestions on how to test this:**:
Configure two PID Provider via docker-compose-file
Alter this line:

  -Ddataverse.pid.providers=fake,perma

add those lines to the dataverse env:

      
        -Ddataverse.pid.perma.type=perma
        -Ddataverse.pid.perma.label=PermaProvider
        -Ddataverse.pid.perma.permalink.base-url=https://example.org/
        -Ddataverse.pid.perma.permalink.separator=-
        -Ddataverse.pid.perma.authority=identifier

Start up dataverse.
Create a new dataset with a datafile.
Then change the pidProvider of the dataset.

curl -X PUT --location "http://localhost:8080/api/v1/datasets/{id}/pidReconcile/perma" \
    -H "X-Dataverse-key: $KEY" -d 'perma'

Then reconcile the pid via API call:

curl -X PUT --location "http://localhost:8080/api/v1/datasets/{id}/pidReconcile" \
    -H "X-Dataverse-key: $KEY"

Verify that the PID changes in the UI and a notification apprears.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:
Yes, will be added later.

@coveralls
Copy link

coveralls commented May 17, 2024

Coverage Status

coverage: 20.69% (+0.1%) from 20.59%
when pulling 71f84ee on johannes-darms:feat/reconcilepid
into da3dd95 on IQSS:develop.

@johannes-darms johannes-darms changed the title WIP PID reconcile command PID reconcile command May 21, 2024
// remove dataset PID
try {
if (currentPidProvider.alreadyRegistered(getDataset())) { //if not registered with PIDProvider than there is no need to delete it...
currentPidProvider.deleteIdentifier(getDataset()); // delete it externally
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be better to not delete this, so it remains reserved and can't be used again for new datasets? (Doesn't help for PermaLinks, but there (or for all providers) you could extend the local PID lookup to also check the alternate id field.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this would resolve some of our problems we end up with a bunch of registered PIDs that will never be used. I'd rather accept the risk (I'm not sure if many are using this feature) and document that an administrator must complete the reconciliation offline, i.e. move the files and delete the alternativeIdentifier from the database.

you could extend the local PID lookup to also check the alternate id field.
Sure do, I believe this is needed anyhow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs to be documented

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a second call to a new AlternativePersistentIdentifier.findByProtocolIdentifierAuthority query in the DVObjectServiceBean.isGlobalIdLocallyUnique method (only if the first query would return true)? (Query could be modelled on DvObject.findIdByAlternativeGlobalId query but avoiding the join since we don't care which dvObject has that alt PID). That would assure that the old id is never reused. (Seems like a generally good idea even though your code may be the first automated way to end up with an alt PID from an active provider. It's possible that someone migrating in data from elsewhere could be reusing

// This keep the link the object store intact, without altering the files.
// IMHO: This command should also update the storage. First, maintenance of the storage becomes a mess if there is a counterintuitive layout.
// Second, it could occur that another object is minted with the old identifier and consequently we have a conflict in our storage system.
// We accept this risk for now since a super user can update the storage manually from old file path to the new one, and remove the AlternativeIdentifer form the database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: If changes above don't remove this risk (by making sure you still find the alternateId when checking to see if a newly generated one already exists), then a manual work-around would be to add the alternateID to the prohibited list for the PID provider.

@qqmyers
Copy link
Member

qqmyers commented May 21, 2024

@johannes-darms - I made several related comments about avoiding mixing the effective Pid generator and PidProvider for the existing PID concepts. If that's not clear, let's have a call or slack, etc.

@johannes-darms
Copy link
Contributor Author

I made several related comments about avoiding mixing the effective Pid generator and PidProvider for the existing PID concepts. If that's not clear, let's have a call or slack, etc.

Thanks for the explanation I wasn't aware of the distinction. I'll update the code accordingly.

@johannes-darms
Copy link
Contributor Author

@qqmyers : I've adapted the PR according to your feedback. Could you have another look?

@@ -153,7 +156,7 @@ protected void validateOrDie(DatasetVersion dsv, Boolean lenient) throws Command
* @param ctxt
* @throws CommandException
*/
protected void registerExternalIdentifier(Dataset theDataset, CommandContext ctxt, boolean retry) throws CommandException {
protected void registerExternalIdentifier(DvObject theDataset, CommandContext ctxt, boolean retry) throws CommandException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FOr this change, the theDataset parameter name should probably change as well. That said, files don't currently (before #7334) get PIDs prior to publication, so file related code could be dropped for now.

Collections.singleton(Permission.EditDataset), getDataset());
}
// Datast must be unreleased! This means there is only one version!
if (getDataset().isReleased()) { //@TODO: Clarify whether this is the best check...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems OK - covers deaccessioned as well

api.setAuthority(oldId.getAuthority());
api.setIdentifier(oldId.getIdentifier());
api.setDvObject(getDataset());
api.setStorageLocationDesignator(true);// cf. Dataset#getIdentifierForFileStorage()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rare case - if there's already an alt PID that is the storageLocationDesignator

// remove datafile PIDs
try {
for (DataFile df : getDataset().getFiles()) {
GlobalId oldPid=df.getGlobalId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files may not have PIDs and, pre-#7334, shouldn't have PIDs before publication. Minimally need to have a null check if you want to keep the code ready for when files get registered PIDs at upload.

api.setAuthority(oldPid.getAuthority());
api.setIdentifier(oldPid.getIdentifier());
api.setDvObject(df);
api.setStorageLocationDesignator(true); // cf. Dataset#getIdentifierForFileStorage()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same rare issue - if an altPID already exists that is the storage location designator

public boolean onSuccess(CommandContext ctxt, Object r) {
//update search index with the state
ctxt.index().asyncIndexDataset(getDataset(), true);
//invalidate all existing Export caches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing - we have this code elsewhere for if/when we need it.

@@ -292,6 +293,8 @@ notification.typeDescription.WORKFLOW_FAILURE=External workflow run has failed
notification.typeDescription.STATUSUPDATED=Status of dataset has been updated
notification.typeDescription.DATASETCREATED=Dataset was created by user
notification.typeDescription.DATASETMENTIONED=Dataset was referenced in remote system
notification.typeDescription.PIDRECONCILED=The Persistent identifier of dataset has been updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a notification.email.pidreconciled needed as well (somewhere around line 800)? I think sendNotification does try to send email unless this notification is muted.

ds.setIdentifierRegistered(true);
// set file
ds.setFiles(ds.getFiles().subList(0,1));
DataFile df = ds.getFiles().get(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a file w/o a pid to check the file pids off case - just to assure there's no NPE, no new PID.

It is possible to update the used PIDProvider of a dataset. This API not not modifies the PIDProvider, but
reconciles the PID with the current configured provider if it has not already been published.
This means that the identifier of a previous configured PIDProvider is removed and a new one is created using the current PIDProvider.
Note that this change does not affect the storage repository where the old identifier is still used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(An administrator could move the files manually and set the storagelocationdesignator to false on the entry in the alternativepersistentidentifier table for the old identifier if this is a concern. This is not necessary for Dataverse to function correctly.)

Reconcile the PID ofa a dataset (If multiple PIDProvider are Enabled)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dataverse can be configured to use multiple PID Providers (see the :ref:`pids-configuration` section for more information).
It is possible to update the used PIDProvider of a dataset. This API not not modifies the PIDProvider, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change: It is possible to update the PIDProvider used by a dataset and assign the dataset a new PID that uses the protocol/authority/separator/shoulder of the new PIDProvider. This API 'reconciles' a draft dataset's PID by creating a new PID supported by the PidProvider and assigning the original PID as an alternativePersistentIdentifier for the dataset. The API is restricted to datasets that have not already been published. (It does not make any changes to any PidProvider.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to handle file PIDs. If you leave the code in, we could add an "(and any PIDs currently assigned to the dataset's files)", ~ignoring that this can't happen in the current version, but would occur after #7334. Otherwise, maybe add a note over on the PR that it should be updated to include support for renconciling.

export SERVER_URL=https://demo.dataverse.org
export PERSISTENT_IDENTIFIER=doi:10.5072/FK2/YD5QDG

curl -X PUT -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/datasets/$PERSISTENT_IDENTIFIER/pidReconcile/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct - it should be ~ /api/datasets/:persistentId/pidReconcile?persistentId=$PERSISTENT_IDENTIFIER

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request/Idea: Move draft Datasets should update their PID configuration
3 participants