Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to solve reward dropping after reaching super humain level #97

Closed
amineoui opened this issue Sep 14, 2023 · 9 comments
Closed

how to solve reward dropping after reaching super humain level #97

amineoui opened this issue Sep 14, 2023 · 9 comments
Labels
discussion Discussion of a typical issue or concept environment New or improved environment help wanted Extra attention is needed

Comments

@amineoui
Copy link

how to solve reward dropping after reaching super humain level , or how to save model on this top level , before its start dropping
image

@puyuan1996
Copy link
Collaborator

  • Hello, in order to provide a more precise solution to your question, we need more detailed information about your task. Could you please provide specifics about the environment, the algorithm used, and the config file you are currently working with?

  • If you're using the LightZero framework, we have the optimal model from the historical evaluation saved at path like this: zoo/classic_control/cartpole/config/data_mz_ctree/cartpole_muzero_seed0/ckpt/ckpt_best.pth.tar. You can load this optimal model from the training process by specifying this path in the model_path field in your config file.

  • Please note that the path provided above is merely an example, and your actual path may vary based on your project setup and configurations.

@puyuan1996 puyuan1996 added the discussion Discussion of a typical issue or concept label Sep 15, 2023
@amineoui
Copy link
Author

amineoui commented Sep 16, 2023

algo: sampled_efficientzero
env: im trying simulating market trading as my custom env using gaf features , is like work for 1 month of dataset tranning but not work for 1 year data set training
im also wondring about how to enter direct data or transformed data with shape like (7, 9) with mlp model_type

this is my config:

image_channel=7
shape=(7, 9, 9)
file_name = 'shape/shape_7_9_9_1month.npy'

collector_env_num = 16
n_episode = 16
evaluator_env_num = 4
continuous_action_space = False
K = 3 # num_of_sampled_actions
num_simulations = 10
update_per_collect = 10
batch_size = 256
max_env_step = int(1e9)
reanalyze_ratio = 0.9

data_sampled_efficientzero_config = dict(
exp_name=
f'result/stocks_sampled_efficientzero_ns{num_simulations}_upc{update_per_collect}_rr{reanalyze_ratio}_seed0',
env=dict(
env_name='my_custom_env',
env_id='my_custom_env',
env_file_name= file_name,
obs_shape=shape,
collector_env_num=collector_env_num,
evaluator_env_num=evaluator_env_num,
n_evaluator_episode=evaluator_env_num,
manager=dict(shared_memory=False, ),
),
policy=dict(
model=dict(
model_type='conv', #mlp, conv
observation_shape=shape,
frame_stack_num=1,
image_channel=image_channel,
action_space_size=K,
# downsample=True,
lstm_hidden_size=512,
latent_state_dim=512,
continuous_action_space=continuous_action_space,
num_of_sampled_actions=K,
discrete_action_encoding_type='one_hot',
norm_type='BN',
),
cuda=True,
env_type='not_board_games',
game_segment_length=400,
# use_augmentation=True,
update_per_collect=update_per_collect,
batch_size=batch_size,
optim_type='Adam',
lr_piecewise_constant_decay=False,
learning_rate=0.001,
num_simulations=num_simulations,
reanalyze_ratio=reanalyze_ratio,
policy_loss_type='cross_entropy',
n_episode=n_episode,
eval_freq=int(2e2),
replay_buffer_size=int(1e9), # the size/capacity of replay_buffer, in the terms of transitions.
collector_env_num=collector_env_num,
evaluator_env_num=evaluator_env_num,
),

@puyuan1996
Copy link
Collaborator

Hello,

Here are some modification recommendations to your configuration file, mainly focusing on the following aspects:

  • Number of simulations: The value of num-simulations has been adjusted to generate a larger number of simulations. In the original configuration, this parameter may have been set too low, which could be a primary factor contributing to suboptimal performance.
  • Updates after data collection: We have increased this parameter update_per_collect in the configuration, which should allow for more frequent network updates after each round of data collection. By increasing the value of update_per_collect, the network will have the opportunity to update more frequently and potentially adapt to the collected data more quickly. This can be beneficial in scenarios where the data distribution is non-stationary or changes rapidly over time.
  • Replay buffer size: To optimize memory usage and improve performance, the value of replay_buffer_size has been adjusted from 1e9 to 1e6. A value that is too large for replay_buffer_size can consume excessive memory and potentially result in suboptimal performance.
collector_env_num = 8
n_episode = 8
evaluator_env_num = 5
num_simulations = 50
update_per_collect = 200
replay_buffer_size=int(1e6), 
game_segment_length=400, # TODO: adjust according to your episode length
  • It seems that your environment takes a three-dimensional vector of shape (7,9,9) as an observation input, where 9 is the number of stacked frames. If your environment requires vector inputs rather than images, you might want to consider using a Multilayer Perceptron (MLP) model. Specifically, you can flatten the original 3-dimensional observation into a vector and use it as the input observation. Regarding the configuration of the model, you can refer to this config.
  • Also, given that your action space is three-dimensional and discrete, I would recommend prioritizing the MuZero algorithm. SampledEfficientZero is primarily designed to handle environments with a continuous action space.

These optimization suggestions aim to enhance the model's performance while maintaining a balance in efficiency and memory usage. I hope you find these recommendations helpful.

@amineoui
Copy link
Author

(7, 9, 9) mean 7 images with size of 9x9 , i also found some problem with that because i should declare it as (7, 9, 9) and feed it to model as (9, 9, 7) this the only way i got it to work i apply this code to change the shape without affecting the images:

    def restack(self, gaf_images):
        images = []
        for i, gaf_image in enumerate(gaf_images):
            images.append(gaf_image)
        image_tensor = np.stack(images, axis=-1)
        return image_tensor

is this correct or i made a mistake ?

also what about the neural network size and also the hidden layers , i think is also important to be able to handle more data ? or im wrong ?
if yes please what the recommendation like change fc_policy_layers , fc_value_layers .... on model

thank you so much @puyuan1996

@puyuan1996
Copy link
Collaborator

Hello,

  • Your method to reshape the image stack from (7, 9, 9) to (9, 9, 7) seems correct. The restack function you wrote is essentially moving the first axis (which has 7 elements) to the end. Here is the simplified version of your function using numpy's built-in transpose function:
def restack(self, gaf_images):
    """
    Restack the images along the last dimension.
    Args:
        gaf_images (np.array): array of images with shape (7, 9, 9).
    Returns:
        image_tensor (np.array): reshaped array of images with shape (9, 9, 7).
    """
    image_tensor = np.transpose(gaf_images, (1, 2, 0))
    return image_tensor

This function will transpose the tensor from shape (7, 9, 9) to (9, 9, 7). However, for our implementation of the MuZero algorithm, the input to a conv type model should indeed be in the form of images with a shape like (7,9,9). In this case, the first dimension represents the number of channels, while the following two dimensions correspond to the width and height of the image, respectively. You may refer to the existing Atari MuZero configuration as an example.

  • Based on our experimental experience, the default configuration of LightZero like this should provide adequate network capacity for tasks with complexity on par with Atari games. The performance degradation observed in your experiments is likely due to other factors described here.
  • We recommend you to adjust and optimize your configuration parameters following the guidance provided earlier, and then proceed with the experimental testing again. We anticipate that these revisions will lead to improved experimental outcomes.

Best wishes for your experiments.

@amineoui
Copy link
Author

Hello, Mr. @puyuan1996! I want to express my sincere gratitude for your kindness, and I must say that this repository is truly an astonishing work of AI art. Your effort and dedication shine brightly in this project, and it's genuinely commendable. Great job!

im trying to teach ai to ebserve only and no take action on an expiration time to get reward then will be able to take an other action
i mean is stay observing and learning with no action tell got reward then allow to make other action

is this possible ?

i think about this this parammetres ? but im not sure please can you guide me

to_play=-1
action_mask = np.array([1., 1., 1.], dtype=np.float32)
obs = {'observation': to_ndarray(obs), 'action_mask': action_mask, 'to_play': to_play}

i try :
to_play=-1
action_mask [0., 1., 0.] but its give me error on child_visit_segment it will be like [1] object array

i also try:
to_play=-1 as ai and to_play=1 as other player
action_mask = np.array([1., 1., 1.], dtype=np.float32)

@puyuan1996
Copy link
Collaborator

Hello,

  • First of all, thank you for your support and encouragement.

Regarding your question about the special environment's MDP:

  • I understand that your special environment requires the intelligent agent to generate actions based on the observations and rewards provided by the environment within a specific threshold time. In addition, only when the given reward value meets certain conditions, the generated actions will take effect on the environment. Does the step method of the environment do not require the action of the agent as input until the reward value meets certain conditions?
  • I believe this setup can be implemented by modifying your Gym environment. However, it's important to note whether or not to collect the observation-reward pairs before the reward value meets the specified conditions, and whether the agent needs to be trained during this period. Different handling methods are necessary on both the environment and algorithm side.

Regarding your question about action_mask and to_play:

  • to_play is an integer variable used in board game environments, indicating the index of the player who needs to take the next action. Its value range is {1, 2}. However, for single-player game environments such as Atari and single-player board games like 2048, to_play should be set to -1, indicating that this is a single-player environment.
  • action_mask is an A-dimensional numpy array, representing the valid actions in environments where the action space can vary. For example, in the tic-tac-toe environment, where the original complete discrete action space is 9-dimensional, action_mask is a 9-dimensional numpy array. A value of 1 indicates that the corresponding action is valid, while a value of 0 indicates that the action is invalid. For environments like Atari with a fixed discrete action space, action_mask should be an all-1 numpy array. For continuous action space environments like MuJoCo, action_mask should be set to None, as the size of its valid action set is infinite.
  • Please note that we have not yet conducted tests in multi-player (more than two players) adversarial game environments. If you plan to integrate the LightZero algorithm into a multi-player game (like Dou Di Zhu), you may need to adjust the code accordingly. For multi-agent cooperative environments, you could consider using the concepts from Multi-Agent Reinforcement Learning (MARL). One initial idea is to regard it as an independent learning process. For specific implementation, you can refer to this paper, and our example cases in pettingzoo and GoBigger environments, detailed in this PR.

Best Wishes.

@puyuan1996 puyuan1996 added help wanted Extra attention is needed environment New or improved environment labels Sep 27, 2023
@amineoui
Copy link
Author

amineoui commented Sep 27, 2023

Hello, Mr. @puyuan1996 , thank you so much for your help and kindness, i notice that ckpt_best.pth.tar not save every new best evaluation on trainning , what the factor is take to decide save ckpt_best.pth.tar , because is like save 1 to 3 times and not more even reach multi more better points , sometimes save ! i not clearly understand the factors or parrameter that control it

also im still have sometimes spikes on my gpu and memory limitation , memory should just not feed big resolution data this work
even just spikes on gpu 3d calculation , is work and train just take some time

image

im really wonder about way ckpt_best.pth.tar not save , my last trainning is save only on first time even is going learning.
its like base on reward_std ? can i change it to other thing ?

also i have error on eval after finish , this returns is list of None: [None, None, None, ...]
image

@puyuan1996
Copy link
Collaborator

puyuan1996 commented Oct 7, 2023

Hello,

Regarding the storage frequency of model checkpoints (ckpt), LightZero's underlying implementation is based on DI-engine, which uses a hook mechanism to save the model's checkpoints. You can refer to the test file for more details. You can adjust the following settings under the policy field in the configuration file to change the storage frequency of the model checkpoints:

 policy=dict(
    ...
    learn=dict(
        learner=dict(
            hook=dict(
                save_ckpt_after_iter=200,
                save_ckpt_after_run=True,
                log_show_after_iter=100,
            ),
        ),
    ),
    ...
 ),

In this configuration:

  • The save_ckpt_after_iter parameter controls how often a model checkpoint is saved after a certain number of iterations.
  • The save_ckpt_after_run parameter indicates whether to save the model again after all specified training iterations have ended.
  • The log_show_after_iter parameter is used to set the frequency of displaying training statistics on the command line.

Regarding the return value error of eval_muzero, this is due to a change in the muzero_evaluator API. If you pull the latest code, this issue should no longer exist.

Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion of a typical issue or concept environment New or improved environment help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants