Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster builds: Decoupled mondo repo artefacts #619

Open
Tracked by #589
joeflack4 opened this issue Jul 24, 2024 · 9 comments
Open
Tracked by #589

Faster builds: Decoupled mondo repo artefacts #619

joeflack4 opened this issue Jul 24, 2024 · 9 comments
Assignees
Labels
build Mostly for build PRs: when changes only to data files post `build-mondo-ingest`; no code changes performance

Comments

@joeflack4
Copy link
Contributor

joeflack4 commented Jul 24, 2024

Overview

I've recently started to refresh my "mondo repo artefacts" once a week. This has led me to see that building these artefacts takes up a significant amount of build time, usually around 45 minutes (about ~20% of the current avg ~3.5hr build time).

These files are actually dependencies of the mondo-ingest pipeline. There are no changes anywhere in mondo-ingest that would have any effect on how these artefacts are built. It makes sense to decouple them from both a logical and efficiency standpoint.

The artefacts

These are the artefacts and their dependency structure. Nested items depend on the items 1 level above.

  • tmp/mondo/ - this isn't a release artefact; it's a cloned Mondo directory. It isn't used anywhere except to build the following artefacts.
    • mondos.ssom.tsv
    • mondo.owl
      • mondo.db

Possible implementation details

Where to store releases

  • a. Release these somehow in the mondo repo releases
  • b. Separate repo just for these artefacts
  • c. Add to mondo-ingest releases.

(a) is a bad idea, because this just makes a complicated release pipeline even more complex. (b) This is my favorite solution. (c) This is not a bad solution, but it would have similar problems to 'a'. We want to eventually move to doing mondo-ingest releases, and so now we would have to deal with two kinds of releases in mondo-ingest: these files, and the actual mondo-ingest build outputs.

When / how to trigger a new release of these artefacts

a. Whenever mondo has a new commit on main
b. Whenever any file that "the artefacts" depend on changes

  • Would basically store md5 hashes of each of these files, and whenever main changes, do a check to see if any hashes changed. It may be that mondo-edit.obo is the only one.

A GitHub action would run regularly, check on changes, and run the release.

Changes that would need to be made to mondo-ingest pipeline

For these goals, mondo.owl, etc, they would check the GitHub releases and see if a new release came out that is newer than what is cached locally (if anything). If so, then it would wget these artefacts.

Possible challenges

Memory requirements for GH action: I don't think that this is an issue yet, but I'm not 100% sure. I have to monitor these builds and see what the peak memory usage is and make sure that it is less than the maximum available for an (ideally free) GH action runner (though we might be able to switch to MacOS instead of ubuntu runners for more memory if needed). Currently, mondo-ingest builds are peaking for me at around 32GB of memory, whereas these artefacts peak at around 5.5GB, which so far is I think workable.

@joeflack4 joeflack4 added build Mostly for build PRs: when changes only to data files post `build-mondo-ingest`; no code changes performance labels Jul 24, 2024
@joeflack4 joeflack4 self-assigned this Jul 24, 2024
@matentzn
Copy link
Member

Please assign this to me, this should not be happening

@joeflack4 joeflack4 assigned matentzn and unassigned joeflack4 Jul 24, 2024
@joeflack4
Copy link
Contributor Author

joeflack4 commented Jul 24, 2024

@matentzn I marked it as a low priority; it's just an idea I came up with and the details kinda poured out of my head. We can discuss some other time. I am curious to what you mean by "this should not be happening". There is no bug here. This is an optimization. Unless perhaps you mean that it shouldn't take 45 minutes for these things to be built.

@matentzn
Copy link
Member

I thought we just spend a lot of resources fixing this: https://github.com/monarch-initiative/mondo-ingest/pull/594/files

@joeflack4
Copy link
Contributor Author

Ah, this idea improves upon that further.

What we have done already is making so that whenever you run commands that use mondo.owl etc, it will always check and rebuild them if they are out of date.

What this issue proposes is a decoupled process that uses a GitHub action to build these regularly and then upload them to get up releases. Then, when you run a goal which needs mondo.owl, it will download them if the local files are out of date.

@matentzn
Copy link
Member

I guess if the process consumes that much time this is a possibility, but given the fact that you may need to rerun from the repo from time to time anyways (for example if you need a change that was just pushed) it introduces a bit of complexity in the code. The PR I linked (please ensure it is seen thru as I dont have eyes on it) should make sure that the code is only run once for any state of the Mondo master branch. But you suggestion also makes some sense; basically you have actions run every night rebuilding mondo.owl and mondo.sssom.tsv, caches them and the ingest code just pulls them (not sure how to get reliable download URLs from GH actions, I remember this was problematic in the past)

@joeflack4
Copy link
Contributor Author

introduces a bit of complexity

True. If you are running a goal that needs mondo.owl etc, and it detects a change, it'll try to wget from the release, and if the build process is still running, then without any additional logic to account for this it will just download the old release.

The PR I linked (please ensure it is seen thru

Oh, Refactor git refresh system (small refactor) #594 didn't have an assignee. It looks like you got stuck on an issue during your build. I've assigned this to myself now, but it is a low priority.

should make sure that the code is only run once for any state of the Mondo master branch

No worries; that has already been completed in:

not sure how to get reliable download URLs from GH actions

I think you mean releases? I haven't had any such problems. We have ICD11 and OMIM pipelines set up using GitHub releases, and there's been no issue yet.

@matentzn
Copy link
Member

I think you mean releases? I haven't had any such problems. We have ICD11 and OMIM pipelines set up using GitHub releases, and there's been no issue yet.

Yes but you have specific repos for that. Are you suggesting to create a mondo.owl nightly built repo so you can create github releases? As if it was a "source"? I thought you meant to create a github action in the mondo-ingest repo that caches the built files in the GHA cache. In any case, maybe your idea of a separate repo isnt a bad one; Just make sure its raised it a call with Trish to discuss pros and cons.

Oh, #594 didn't have an assignee.

Dropped the ball on this and I wont be able to pick it up again. Maybe just try to review the diff carefully until you understand the point of all the changes, and if you dont understand something / or want to question a modelling decision, just let me know?

@joeflack4
Copy link
Contributor Author

I'll talk about #594 w/ Trish and prioritize.

For mondo.owl etc as releases, I proposed 3 alternatives in the OP:

Possible implementation details

Where to store releases

  • a. Release these somehow in the mondo repo releases
  • b. Separate repo just for these artefacts
  • c. Add to mondo-ingest releases.

I'd rather not do (b) and create a separate repo, but I think it is the easiest solution.

It would run more often than nightly. In TermHub, we run things every 20 minutes. It will check if there are any updates to main, and if not, it will do nothing.

@matentzn
Copy link
Member

I am neither opposed nor strongly in favour, I think the idea is generally worth considering as it could shave off 30 min of every build we do..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Mostly for build PRs: when changes only to data files post `build-mondo-ingest`; no code changes performance
Projects
None yet
Development

No branches or pull requests

2 participants