-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add log exporting to e2e tests #308
base: main
Are you sure you want to change the base?
Conversation
4742edf
to
8f77076
Compare
8f77076
to
00e0231
Compare
00e0231
to
4d3e3a7
Compare
4d3e3a7
to
82d5711
Compare
@@ -156,6 +171,7 @@ jobs: | |||
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} | |||
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | |||
aws-region: ${{ secrets.AWS_REGION }} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there was one extra newline left here. Was that one intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes when I'm back from PTO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
@@ -141,7 +141,22 @@ jobs: | |||
HF_TOKEN: ${{ secrets.HF_TOKEN }} | |||
run: | | |||
. venv/bin/activate | |||
# set preserve to true so we can retain the logs | |||
export PRESERVE=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this var used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside of the e2e test script. It prevents the temp dir from being deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathan-weinberg Could you please help me understand why we wouldn't want to use the env here? If we don't expect it to be used then we should just not have it in the script entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag just sets the env: https://github.com/instructlab/instructlab/blob/main/scripts/e2e-ci.sh#L433
I'm saying instead of setting it in the workflow file, just pass -mp
instead of -m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes much more sense. I'll change it to use the flag then.
log_file=$(find /tmp -name "training_params_and_metrics_global0.jsonl") | ||
mv "${log_file}" training-log.jsonl | ||
- name: Upload training logs | ||
uses: actions/upload-artifact@v4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a hardened action here per the org policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Can you include the version number in a comment next to it like we do elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely!
@@ -156,6 +171,7 @@ jobs: | |||
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} | |||
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | |||
aws-region: ${{ secrets.AWS_REGION }} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove it?
|
||
- name: Download loss data | ||
id: download-logs | ||
uses: actions/download-artifact@v4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on hardening here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it 🫡
run: | | ||
pip install -r requirements-dev.txt | ||
|
||
- name: Try to upload to s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this says Try to upload to s3
- do we assume a chance of failure? What is the job behavior if this step succeeds versus if it fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fail the end-to-end run if we failed to upload to S3? If so, then we'd need to rerun the entire ec2 job. Else, we could succeed & provide a warning in the PR/summary that we could not upload to S3.
There are two thoughts I'd have on this:
- Testing for functional correctness is primary, loss correctness is secondary - here we consider the loss failing to upload as not a failure of the PR
- Testing for loss + functional correctness is primary. Here we would hold a failure to upload to S3 with the same weight as themselves failing.
Let me know what your thoughts are on how we should proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm fine with either, up to you and the @instructlab/training-maintainers! My suggestion: we go for 1 but include something like this? https://stackoverflow.com/questions/74907704/is-there-a-github-actions-warning-state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the warning state is exactly what I was looking for. Let's go with that.
82d5711
to
1fd7c48
Compare
Currently, the training library runs through a series of end-to-end tests which ensure there are no bugs in the code being tested. However; we do not perform any form of validation to assure that the training logic and quality has not diminished. This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit, but invisible bugs may be introduced which cause models to regress in training quality, or other bugs that plague the models themselves to seep in. This commit fixes that problem by introducng the ability to export the training loss data itself from the test and rendering the loss curve using matplotlib. Signed-off-by: Oleg S <[email protected]>
1fd7c48
to
387828b
Compare
Currently, the training library runs through a series of end-to-end tests which ensure there are
no bugs in the code being tested. However; we do not perform any form of validation to assure that
the training logic and quality has not diminished.
This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit,
but invisible bugs may be introduced which cause models to regress in training quality, or other
bugs that plague the models themselves to seep in.
This commit fixes that problem by introducng the ability to export the training loss data itself
from the test and rendering the loss curve using matplotlib.
When the results are outputted, they can be found under the "Summary" tab of a Github actions run.
For example:
Resolves #179
Signed-off-by: Oleg S [email protected]