Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request/Petition: Change the default behaviour of DROID, to scan the whole file #1082

Open
Dclipsham opened this issue Mar 19, 2024 · 8 comments

Comments

@Dclipsham
Copy link

On first download DROID's 'maximum bytes to scan' value is set at 65536 bytes, limiting DROID to scan the first and last 64k of all given files. This is a setting that's been in place for a very long time and was probably chosen at the time for performance reasons.

For DROID to give the most accurate results, the whole file should be scanned. Formats such as MP4 (for which identification seeks a 'moov' atom) and PDF/A versions (for which we seek the PDFconformance and PDFAid tags) will quite often not get identified accurately with default settings, as key elements don't always fall within the 64k boundaries.

This is a regularly reported issue (e.g. 'Why is this file not identifying?', or 'Why are Seigfried and DROID giving different results?') - it's virtually always because the user didn't know about the default behaviour.

Changing the default value to a negative number will solve this.

Since we've removed most of the purely variable signature sequences this shouldn't significantly adversely affect performance either (large WAV files are probably the most impacted because they all seek certain chunks that could be anywhere, while MP4 won't identify without the moov chunk so better to scan the whole thing to find it rather than stop early and not find it), although the option remains for limiting the scan should people want it.

Keen to hear others' thoughts.

@steve-daly
Copy link

This sounds a reasonable request. @Dclipsham Would you suggest adjusting the Command Line default behaviour as well as the GUI first-run default?

We don't currently do any automated performance testing after PRONOM and/or DROID releases but it's something I'd like to do. Do you think there are any remaining variable-only signatures that could be looked at. I remember adding a simple BOF to one last year, but don't know how many remain.

I'm reminded of this one too #773

@Dclipsham
Copy link
Author

Dclipsham commented Mar 19, 2024

Thanks @steve-daly - by my count there are 86 Signatures that have at least one variable sequence - every single one of them has either a BOF or an EOF as an anchor. List attached (ordering is by signature ID).
Variable_sequences.txt

One further potential inefficiency may relate to ZIP-based container formats that still retain a binary signature:
fmt/161 - SIARD 1.0 - 504B0304*504B0304*786D6C6E733D22687474703A2F2F7777772E6261722E61646D696E2E63682F786D6C6E732F73696172642F312E302F6D657461646174612E78736422
x-fmt/412 - Java Archive (JAR) - 504B0304*4D4554412D494E462F4D414E49464553542E4D46

In the above cases the ZIP-based triggers will shunt them onto the Container sig track, but I wonder (I haven't got any means to properly profile this) if any zip file failing to match anything in the Container Signature file will then go on to get a full file scan in search of these further binary signature elements. ZIPs can obviously be extremely large, so this might be one possible issue, but again I think its an acceptable risk - generally zips get expanded anyway.

I recall there was a discussion about tidying up the legacy binary signatures where container signatures exist but I can't see an actual ticket

@steve-daly
Copy link

I think there's some code in DROID which removes Binary signatures at runtime when a Container signature is found for the same format. I think we had to replicate that when working on the API recently. I'll have a look around

@Dclipsham
Copy link
Author

In answer to your question about defaults for GUI & CLI - I favour behavioural consistency, so having both default to full scan would be my preference.

@nishihatapalmer
Copy link
Contributor

It was certainly true that we used the default 65536 for performance reasons. We did extensive testing on a lot of files, and determined that almost all signatures would match within that, but we acknowledged that there are a few formats where matches can happen outside of this. Some signatures require a full-text scan, but most do not.

The performance differences can be significant on large corpuses, so I would be wary of changing this default for everyone without a big warning.

One thing that would be interesting to explore is identifying which signatures require longer identification, as there are very few of these. It could be possible to get the best of both worlds, by limiting full text scans just to those formats which require it.

@nishihatapalmer
Copy link
Contributor

Another consideration is whether you should alter the default at all. It is entirely possible to specify full text scan, so I'm not clear that altering this as a default makes sense for most users. The possibility of disrupting a lot of existing workflows by massively slowing down scanning is significant.

I do think that identifying the formats that need full text scan would be more useful. From memory, these were usually multimedia formats like video, where signaturea can appear a long way into a file. Most formats like documents and XML, the magic bytes appear early (or close to the end).

@Dclipsham
Copy link
Author

Dclipsham commented Apr 10, 2024

Thank you for your comments Matt,

The most frequent ones that seem to get queried are PDF/A or similar PDF variants (where the conformance tags are often found deep within the file), MOV (where in most cases we're seeking the 'moov' atom which is required for MOV but can appear anywhere), MP4 (ditto, although these are more commonly optimised than MOV). There used to be a particular issue with AVI but the signature was simplified so now only needs the first few bytes. One that also comes up but less frequently in general cases is BWAVE/BWF, which is seeking the 'BEXT' chunk which can be anywhere within a WAV, plus its 64bit equivalent, RF64.

There will be others but these are most common in my experience. In total there are 214 PUIDs that have at least one sequence either containing a full "*" wildcard or a Variable sequence as part of a Signature ID.

The problem from my perspective that I'm aiming to resolve, both previously in my role maintaining PRONOM, and latterly in my role at Preservica, is frequent confusion by tools users between the outputs of DROID (which by default scans BOF/EOF 64k), Siegfried (which by default scans the whole file, although I gather has additional cleverness regarding first-match and prioritisation), and the DP systems (e.g. Preservica, Libsafe which use DROID at full scan, Archivematica which uses Siegfried, Rosetta which I believe can use either) - e.g. "Why when I scan this myself do i get different outputs in DROID than other tools?" The answer is almost always the default max bytes to scan for DROID.

In my experience working with digital archivists over the past years they've always tended toward accuracy over performance given the choice, so I don't think any perceived performance hit would be a major concern for the majority.

Further, I don't think any actual performance hit would be significant in most cases - previous efforts between myself and other frequent contributors to PRONOM have eliminated all solely-variable signature sequences (i.e. if a signature ID has a Variable sequence it also has at least one BOF or EOF also), so every signature pattern in use currently is anchored in some way either to a relative BOF or EOF position (the largest max offset in use is 136004), so without finding at least one of these patterns first, any given file shouldn't need to be scanned fully.

David

@nishihatapalmer
Copy link
Contributor

Hi David, thanks, very interesting. Certainly anchoring variable sequences will have a huge effect on limiting performance hits.

If accuracy is the most important factor, and other tools do this too, then it would seem appropriate for DROID to follow suit.

Just make sure any such change to this default is communicated prominently!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants