feat: add log exporting to e2e tests #308

RobotSail · 2024-10-25T22:08:27Z

Currently, the training library runs through a series of end-to-end tests which ensure there are
no bugs in the code being tested. However; we do not perform any form of validation to assure that
the training logic and quality has not diminished.

This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit,
but invisible bugs may be introduced which cause models to regress in training quality, or other
bugs that plague the models themselves to seep in.

This commit fixes that problem by introducng the ability to export the training loss data itself
from the test and rendering the loss curve using matplotlib.

When the results are outputted, they can be found under the "Summary" tab of a Github actions run.
For example:

Resolves #179

Signed-off-by: Oleg S [email protected]

.github/workflows/e2e-nvidia-l4-x1.yml

danmcp · 2024-10-26T17:45:41Z

.github/workflows/e2e-nvidia-l4-x1.yml

@@ -156,6 +171,7 @@ jobs:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}
+


Looks like there was one extra newline left here. Was that one intentional?

Can you remove it?

Yes when I'm back from PTO

JamesKunstle

lgtm!

nathan-weinberg · 2024-10-29T16:28:19Z

.github/workflows/e2e-nvidia-l4-x1.yml

@@ -141,7 +141,22 @@ jobs:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: |
          . venv/bin/activate
+          # set preserve to true so we can retain the logs
+          export PRESERVE=1


Where is this var used?

Inside of the e2e test script. It prevents the temp dir from being deleted.

Why not just use the flag?

@nathan-weinberg Could you please help me understand why we wouldn't want to use the env here? If we don't expect it to be used then we should just not have it in the script entirely.

The flag just sets the env: https://github.com/instructlab/instructlab/blob/main/scripts/e2e-ci.sh#L433

I'm saying instead of setting it in the workflow file, just pass -mp instead of -m

I see, that makes much more sense. I'll change it to use the flag then.

.github/workflows/e2e-nvidia-l4-x1.yml

nathan-weinberg · 2024-10-29T16:30:06Z

.github/workflows/e2e-nvidia-l4-x1.yml

+          log_file=$(find /tmp -name "training_params_and_metrics_global0.jsonl")
+          mv "${log_file}" training-log.jsonl
+      - name: Upload training logs
+        uses: actions/upload-artifact@v4


Can we use a hardened action here per the org policy?

Yes of course.

Thanks! Can you include the version number in a comment next to it like we do elsewhere?

Absolutely!

nathan-weinberg · 2024-10-29T16:30:22Z

.github/workflows/e2e-nvidia-l4-x1.yml

@@ -156,6 +171,7 @@ jobs:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}
+


Can you remove it?

nathan-weinberg · 2024-10-29T16:30:31Z

.github/workflows/e2e-nvidia-l4-x1.yml

+
+      - name: Download loss data
+        id: download-logs
+        uses: actions/download-artifact@v4


Same comment on hardening here

Added it 🫡

.github/workflows/e2e-nvidia-l4-x1.yml

nathan-weinberg · 2024-10-29T16:33:52Z

.github/workflows/e2e-nvidia-l4-x1.yml

+        run: |
+          pip install -r requirements-dev.txt
+
+      - name: Try to upload to s3


So this says Try to upload to s3 - do we assume a chance of failure? What is the job behavior if this step succeeds versus if it fails?

Should we fail the end-to-end run if we failed to upload to S3? If so, then we'd need to rerun the entire ec2 job. Else, we could succeed & provide a warning in the PR/summary that we could not upload to S3.

There are two thoughts I'd have on this:

Testing for functional correctness is primary, loss correctness is secondary - here we consider the loss failing to upload as not a failure of the PR

Testing for loss + functional correctness is primary. Here we would hold a failure to upload to S3 with the same weight as themselves failing.

Let me know what your thoughts are on how we should proceed.

Honestly I'm fine with either, up to you and the @instructlab/training-maintainers! My suggestion: we go for 1 but include something like this? https://stackoverflow.com/questions/74907704/is-there-a-github-actions-warning-state

Yes, the warning state is exactly what I was looking for. Let's go with that.

.github/workflows/e2e-nvidia-l40s-x4.yml

Currently, the training library runs through a series of end-to-end tests which ensure there are no bugs in the code being tested. However; we do not perform any form of validation to assure that the training logic and quality has not diminished. This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit, but invisible bugs may be introduced which cause models to regress in training quality, or other bugs that plague the models themselves to seep in. This commit fixes that problem by introducng the ability to export the training loss data itself from the test and rendering the loss curve using matplotlib. Signed-off-by: Oleg S <[email protected]>

mergify bot added CI/CD Affects CI/CD configuration ci-failure dependencies Pull requests that update a dependency file labels Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 4742edf to 8f77076 Compare October 25, 2024 22:11

mergify bot added ci-failure and removed ci-failure labels Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 8f77076 to 00e0231 Compare October 25, 2024 22:13

mergify bot removed the ci-failure label Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 00e0231 to 4d3e3a7 Compare October 25, 2024 22:17

RobotSail requested review from danmcp, Maxusmusti, JamesKunstle, aldopareja, nathan-weinberg and cdoern October 25, 2024 22:18

danmcp requested changes Oct 25, 2024

View reviewed changes

.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved

.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved

RobotSail force-pushed the official-loss-printing branch from 4d3e3a7 to 82d5711 Compare October 26, 2024 17:25

danmcp reviewed Oct 26, 2024

View reviewed changes

JamesKunstle approved these changes Oct 28, 2024

View reviewed changes

mergify bot added the one-approval label Oct 28, 2024

nathan-weinberg requested changes Oct 29, 2024

View reviewed changes

RobotSail force-pushed the official-loss-printing branch from 82d5711 to 1fd7c48 Compare November 5, 2024 23:14

RobotSail force-pushed the official-loss-printing branch from 1fd7c48 to 387828b Compare November 6, 2024 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add log exporting to e2e tests #308

feat: add log exporting to e2e tests #308

RobotSail commented Oct 25, 2024 •

edited

Loading

danmcp Oct 26, 2024

RobotSail Oct 28, 2024

nathan-weinberg Oct 29, 2024

RobotSail Oct 31, 2024

RobotSail Nov 5, 2024

JamesKunstle left a comment

nathan-weinberg Oct 29, 2024

RobotSail Oct 31, 2024

nathan-weinberg Oct 31, 2024

RobotSail Nov 5, 2024

nathan-weinberg Nov 6, 2024

RobotSail Nov 7, 2024

nathan-weinberg Oct 29, 2024

RobotSail Nov 5, 2024

nathan-weinberg Nov 6, 2024

RobotSail Nov 7, 2024

nathan-weinberg Oct 29, 2024

nathan-weinberg Oct 29, 2024

RobotSail Nov 5, 2024

nathan-weinberg Oct 29, 2024

RobotSail Nov 5, 2024

nathan-weinberg Nov 6, 2024

RobotSail Nov 7, 2024

feat: add log exporting to e2e tests #308

Are you sure you want to change the base?

feat: add log exporting to e2e tests #308

Conversation

RobotSail commented Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesKunstle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobotSail commented Oct 25, 2024 •

edited

Loading