Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add log exporting to e2e tests #308

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RobotSail
Copy link
Member

@RobotSail RobotSail commented Oct 25, 2024

Currently, the training library runs through a series of end-to-end tests which ensure there are
no bugs in the code being tested. However; we do not perform any form of validation to assure that
the training logic and quality has not diminished.

This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit,
but invisible bugs may be introduced which cause models to regress in training quality, or other
bugs that plague the models themselves to seep in.

This commit fixes that problem by introducng the ability to export the training loss data itself
from the test and rendering the loss curve using matplotlib.

When the results are outputted, they can be found under the "Summary" tab of a Github actions run.
For example:

Screenshot 2024-10-25 at 6 18 14 PM

Resolves #179

Signed-off-by: Oleg S [email protected]

@mergify mergify bot added CI/CD Affects CI/CD configuration ci-failure dependencies Pull requests that update a dependency file labels Oct 25, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 25, 2024
@mergify mergify bot removed the ci-failure label Oct 25, 2024
.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
@@ -156,6 +171,7 @@ jobs:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there was one extra newline left here. Was that one intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes when I'm back from PTO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@mergify mergify bot added the one-approval label Oct 28, 2024
@@ -141,7 +141,22 @@ jobs:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
. venv/bin/activate
# set preserve to true so we can retain the logs
export PRESERVE=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this var used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside of the e2e test script. It prevents the temp dir from being deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use the flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathan-weinberg Could you please help me understand why we wouldn't want to use the env here? If we don't expect it to be used then we should just not have it in the script entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag just sets the env: https://github.com/instructlab/instructlab/blob/main/scripts/e2e-ci.sh#L433

I'm saying instead of setting it in the workflow file, just pass -mp instead of -m

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes much more sense. I'll change it to use the flag then.

.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
log_file=$(find /tmp -name "training_params_and_metrics_global0.jsonl")
mv "${log_file}" training-log.jsonl
- name: Upload training logs
uses: actions/upload-artifact@v4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a hardened action here per the org policy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you include the version number in a comment next to it like we do elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely!

@@ -156,6 +171,7 @@ jobs:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove it?


- name: Download loss data
id: download-logs
uses: actions/download-artifact@v4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on hardening here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it 🫡

.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved
run: |
pip install -r requirements-dev.txt

- name: Try to upload to s3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this says Try to upload to s3 - do we assume a chance of failure? What is the job behavior if this step succeeds versus if it fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fail the end-to-end run if we failed to upload to S3? If so, then we'd need to rerun the entire ec2 job. Else, we could succeed & provide a warning in the PR/summary that we could not upload to S3.

There are two thoughts I'd have on this:

  1. Testing for functional correctness is primary, loss correctness is secondary - here we consider the loss failing to upload as not a failure of the PR
  2. Testing for loss + functional correctness is primary. Here we would hold a failure to upload to S3 with the same weight as themselves failing.

Let me know what your thoughts are on how we should proceed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm fine with either, up to you and the @instructlab/training-maintainers! My suggestion: we go for 1 but include something like this? https://stackoverflow.com/questions/74907704/is-there-a-github-actions-warning-state

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the warning state is exactly what I was looking for. Let's go with that.

.github/workflows/e2e-nvidia-l40s-x4.yml Outdated Show resolved Hide resolved
.github/workflows/e2e-nvidia-l40s-x4.yml Show resolved Hide resolved
.github/workflows/e2e-nvidia-l40s-x4.yml Show resolved Hide resolved
Currently, the training library runs through a series of end-to-end tests which ensure there are
no bugs in the code being tested. However; we do not perform any form of validation to assure that
the training logic and quality has not diminished.

This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit,
but invisible bugs may be introduced which cause models to regress in training quality, or other
bugs that plague the models themselves to seep in.

This commit fixes that problem by introducng the ability to export the training loss data itself
from the test and rendering the loss curve using matplotlib.

Signed-off-by: Oleg S <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file one-approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include loss curve in E2E tests
4 participants