Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training seem to crash occasionally #3

Open
andreped opened this issue Aug 8, 2022 · 3 comments
Open

Training seem to crash occasionally #3

andreped opened this issue Aug 8, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@andreped
Copy link
Owner

andreped commented Aug 8, 2022

When training RL models using sapai-gym, different errors tend to occur.

I have tried to uses try-expect blocks, but the problem is that if this happens, training using standard baseline 3 crashes, and we will have to start all over again.

We should therefore either: 1) fix what is bugged in sapai/sapai-gym or 2) add a wrapper function that catches when this fails, and tries to generate a new one (if possible).

@andreped andreped changed the title Training seem to crash ocationally Training seem to crash occasionally Aug 8, 2022
@andreped andreped added the bug Something isn't working label Aug 18, 2022
@andreped
Copy link
Owner Author

I've added a temporary fix for this, which essentially catches when this happens, and restarts training from the previous state, keeping all model history and whatnot.

Need a proper fix for this in sapai/sapai-gym.

@andreped andreped self-assigned this Aug 20, 2022
@andreped
Copy link
Owner Author

As I assumed all errors were coming from sapai-gym, I added a fix to catch all errors happening there:
andreped/sapai-gym@7443f36

However, to my surprise, when running a regular training (now without the try/except loop in the main training script train_agent.py, I got an error from within sb3. This is more challenging to solve. Not really sure what is causing it. See error prompt below after about 250k steps:

Traceback (most recent call last):
  File ".\main.py", line 28, in <module>
    train_with_masks(ret)
  File "C:\Users\andrp\workspace\super-ml-pets\src\train_agent.py", line 60, in train_with_masks
    model.learn(total_timesteps=ret.nb_steps, callback=checkpoint_callback)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 579, in learn
    self.train()
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\ppo_mask\ppo_mask.py", line 439, in train
    values, log_prob, entropy = self.policy.evaluate_actions(
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\policies.py", line 280, in evaluate_actions
    distribution.apply_masking(action_masks)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 152, in apply_masking
    self.distribution.apply_masking(masks)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\sb3_contrib\common\maskable\distributions.py", line 62, in apply_masking
    super().__init__(logits=logits)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\torch\distributions\categorical.py", line 64, in __init__
    super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
  File "C:\Users\andrp\workspace\super-ml-pets\venv38\lib\site-packages\torch\distributions\distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter probs (Tensor of shape (64, 213)) of distribution MaskableCategorical(probs: torch.Size([64, 213]), logits: torch.Size([64, 213])) to satisfy the constraint Simplex(), but found invalid values:
tensor([[4.9590e-11, 2.1976e-10, 6.1887e-01,  ..., 3.3524e-13, 4.5890e-12,
         5.3164e-14],
        [1.4266e-06, 8.7648e-10, 1.3233e-06,  ..., 1.5695e-07, 2.9451e-08,
         1.5212e-07],
        [2.2623e-06, 2.3994e-09, 5.3787e-07,  ..., 3.9735e-08, 2.8777e-09,
         2.6170e-08],
        ...,
        [1.6828e-12, 4.9032e-04, 9.5983e-13,  ..., 1.7402e-13, 1.9223e-13,
         5.6725e-14],
        [4.7819e-10, 7.7589e-03, 7.8509e-18,  ..., 6.4911e-11, 8.8994e-12,
         8.3013e-11],
        [3.6789e-08, 1.2760e-07, 4.7924e-16,  ..., 8.6682e-09, 8.6489e-10,
         3.7913e-08]], grad_fn=<SoftmaxBackward0>)

@andreped
Copy link
Owner Author

Random Exception seem to happen after training thousands of steps:

Exception: get_idx < pet-hedgehog 10-1 status-honey-bee 2-1 > not found

What is causing this?

@andreped andreped removed their assignment Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

1 participant