Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider taking advantage of git commit-graph #1101

Open
michielbdejong opened this issue Sep 10, 2024 · 3 comments
Open

Consider taking advantage of git commit-graph #1101

michielbdejong opened this issue Sep 10, 2024 · 3 comments

Comments

@michielbdejong
Copy link
Contributor

With git commit-graph, the query git log Musi on a repo like tosdr-snapshots becomes 42 faster:

crawler@ota-tosdr-ubuntu-20-04:~$ git clone https://github.com/tosdr/tosdr-snapshots
Cloning into 'tosdr-snapshots'...
remote: Enumerating objects: 325901, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 325901 (delta 0), reused 0 (delta 0), pack-reused 325900 (from 1)
Receiving objects: 100% (325901/325901), 1014.11 MiB | 23.45 MiB/s, done.
Resolving deltas: 100% (177325/177325), done.
Updating files: 100% (15036/15036), done.
crawler@ota-tosdr-ubuntu-20-04:~$ cd tosdr-snapshots/
crawler@ota-tosdr-ubuntu-20-04:~/tosdr-snapshots$ git --version
git version 2.46.0
crawler@ota-tosdr-ubuntu-20-04:~/tosdr-snapshots$ time git log Musi > /dev/null

real	0m36.150s
user	0m35.476s
sys	0m0.542s
crawler@ota-tosdr-ubuntu-20-04:~/tosdr-snapshots$ time git log Musi/* > /dev/null

real	0m37.725s
user	0m37.163s
sys	0m0.557s
crawler@ota-tosdr-ubuntu-20-04:~/tosdr-snapshots$ git commit-graph write --reachable --changed-paths
Computing commit changed paths Bloom filters: 100% (82187/82187), done.
crawler@ota-tosdr-ubuntu-20-04:~/tosdr-snapshots$ time git log Musi > /dev/null

real	0m0.847s
user	0m0.773s
sys	0m0.073s
crawler@ota-tosdr-ubuntu-20-04:~/tosdr-snapshots$ time git log Musi/* > /dev/null

real	0m38.458s
user	0m38.002s
sys	0m0.453s
crawler@ota-tosdr-ubuntu-20-04:~/tosdr-snapshots$ 

But as you can also see there, I haven't found a way yet to speed up the query git log Musi/* yet, so unless there is a way to speed that up too, it might be worth to try to avoid using wildcards in paths in git log commands? Will update here if I find out more.

@michielbdejong
Copy link
Contributor Author

michielbdejong commented Sep 13, 2024

I'll have a stab at this!
It seems the wildcard is coming form generateFilePath when the mimeType is unknown.

For the versions repo we know the extension is always .md so that could be a quick win.
For the snapshots repo, taking the ToS;DR one as an example, it's 98% .html, 2% .pdf and .3% .txt/.docx/.json/.md/.* (sic).

One option would be to shortlist the allowed extensions, since there are only 7 different ones in use.
Another option might be to do a git log for the folder, and then sift out the requested terms type in JS.
A third option could be to ascertain the extension before calling generateFilePath.
I'll look into all three of these options.

@michielbdejong
Copy link
Contributor Author

I'll record my working log for this in tosdr/edit.tosdr.org#1174 so you don't get a long conversation with minor details in this issue.

@michielbdejong
Copy link
Contributor Author

With the --skipPreRun and --skipSnapshots options from #1104 the only remaining place where a wildcard is given to git is in git ls-files and there it is harmless (I tested this).

So with that, OpenTermsArchive/docs#142 is now a fix for this issue, although it is still blocked on #1104.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant