Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE: Improve how Dataverse labels shapefiles to prevent mislabelling of zip files that aren't shapefiles #8945

Open
jggautier opened this issue Jul 13, 2022 · 8 comments · May be fixed by #10627
Labels
Size: 10 A percentage of a sprint. 7 hours. Type: Bug a defect

Comments

@jggautier
Copy link
Contributor

jggautier commented Jul 13, 2022

A depositor uploaded a double zipped file into a dataset in the Harvard Dataverse Repository and the file has been incorrectly labelled as a "Shapefile as ZIP Archive".

The file is in the published dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HWVUER.

There are no shape files in the zip file and the depositor wrote that it isn't a shapefile. The depositor also wrote that they used the UI (their Chrome browser) to upload the file (and not the Dataverse API). The email conversation with the depositor is at https://help.hmdc.harvard.edu/Ticket/Display.html?id=322790.

The file needs to be correctly labelled as a "ZIP Archive". Having it labelled as a "Shapefile as ZIP Archive" might be confusing to anyone looking to download the data.

@jggautier jggautier added the Type: Bug a defect label Jul 13, 2022
@jggautier
Copy link
Contributor Author

jggautier commented Jul 13, 2022

I tried the redetect file type API endpoint. It reported that it worked, but the file is still labelled as a "Shapefile as ZIP Archive".

Lastly, I downloaded the Zip file, double zipped it again and uploaded it to Demo Dataverse to see if Demo would label it as a "Shapefile as ZIP Archive". It did. (The dataset was deleted along with other datasets older than 30 days.)

@qqmyers
Copy link
Member

qqmyers commented Jul 13, 2022

A .zip file would get labeled as a Shapefile if any of the included files has an extension in ["shp", "shx", "dbf", "prj"]. I can't see your example file - does it have one of these? If so, we could/should tighten up the logic to test for all four since all 4 are required and someone may have a .prj or other single extension for some other reason. If there are no files with these extensions, then something else is happening.

@jggautier
Copy link
Contributor Author

Hi @qqmyers. There are no shape files in the zip file.

@jggautier
Copy link
Contributor Author

jggautier commented Jul 13, 2022

@pdurbin found files like “pointZ.dbf pointZ.prj pointZ.shp pointZ.shx” in a hidden directory inside of the zip file. "They seem to come from an R package called “maptools”. The path in the zip is replication/rpkgs/.checkpoint/2020-07-30/lib/x86_64-w64-mingw32/4.0.2/maptools/shapes."

The depositor wrote that "the zip file does not contain any shape files." I'm not sure if the depositor's scripts use the maptools package. I've asked the depositor:

  • Is the depositor using that R package? Or was it just imported and not used in the R code? If it isn't being used and can be removed, maybe the depositor can just not include that R package in the zip file, and then the Dataverse software won't label the zip files as a "Shapefile as ZIP Archive"
  • Should the use of that R package make the zip file a "Shapefile as ZIP Archive"?

@jggautier
Copy link
Contributor Author

I haven't heard from the depositor, yet. Just sent a followup email. I also took a look at the R files in the zip file and didn't see a maptools package being imported, but I'm not very familiar with R either, so I asked the depositor some clarifying questions about that too.

@jggautier
Copy link
Contributor Author

jggautier commented Jul 21, 2022

The depositor let me know that they don't think they directly used maptools in the replication, but it's possible that other packages require maptools and that it's tough to figure out which packages require which packages, so they'd rather keep the current files as they are.

This sounds to me like the entire zip file is not a shapefile and it shouldn't be labelled as such only because a library used in code files includes hidden directories with shapefiles in it. Could that file detection feature in the Dataverse software be adjusted so that it doesn't label this and other zip files like it as "Shapefile as ZIP Archive"?

I remember hearing that by using the API we can also upload files and specify any file type. Thought that could be a workaround for this depositor so I tested this on Demo Dataverse:

curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME;type=application/zip" "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"

But it doesn't work for this zip file. The uploaded file is still labelled as "Shapefile as ZIP Archive" and the response in my terminal shows that the "contentType":"application/zipped-shapefile". (It does work for a PNG file I tried.)

@mreekie, in #8816 we wrote about planning to talk with others who know more about the preservation and use of shapefiles. I'm wondering if those folks can also weigh in on this.

@jggautier
Copy link
Contributor Author

Moving this out of the Harvard Dataverse Repository GitHub repo and into the Dataverse software GitHub

@jggautier jggautier transferred this issue from IQSS/dataverse.harvard.edu Aug 30, 2022
@jggautier jggautier changed the title A dataset's double zipped file is mislabelled as "Shapefile as ZIP Archive" Improve how Dataverse labels shapefiles to prevent mislabelling of zip files that aren't shapefiles Nov 8, 2022
@sbarbosadataverse sbarbosadataverse changed the title Improve how Dataverse labels shapefiles to prevent mislabelling of zip files that aren't shapefiles SPIKE: Improve how Dataverse labels shapefiles to prevent mislabelling of zip files that aren't shapefiles Apr 24, 2024
@pdurbin pdurbin added the Size: 10 A percentage of a sprint. 7 hours. label May 8, 2024
@pdurbin
Copy link
Member

pdurbin commented May 8, 2024

The file is in the published dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HWVUER.

I gave this a 10. Hopefully it's straightforward and we have a file to try to reproduce the problem. ^^

We most recently touched this code here:

@jp-tosca jp-tosca self-assigned this May 22, 2024
@jp-tosca jp-tosca removed their assignment Jun 5, 2024
@stevenwinship stevenwinship self-assigned this Jun 11, 2024
@stevenwinship stevenwinship removed their assignment Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 10 A percentage of a sprint. 7 hours. Type: Bug a defect
Projects
Development

Successfully merging a pull request may close this issue.

5 participants