add batch environment #146

Sid-Bhatia-0 · 2021-06-15T18:31:13Z

Create a separate module for a new batch environment SingleRoomUndirectedBatch.
Redesign objects and actions. Try to keep things as simple as possible for now.

This is a work in progress. Several things need to be added:

Parallelization using multi-threading / GPU.
Interactive playability of batch environments in the REPL for testing.
Tests.
Update README.

codecov-commenter · 2021-06-15T18:33:18Z

Codecov Report

Merging #146 (6d114c6) into master (6d87e18) will decrease coverage by 5.80%.
The diff coverage is 30.29%.

@@            Coverage Diff             @@
##           master     #146      +/-   ##
==========================================
- Coverage   78.21%   72.40%   -5.81%     
==========================================
  Files          21       24       +3     
  Lines        2226     2533     +307     
==========================================
+ Hits         1741     1834      +93     
- Misses        485      699     +214

Impacted Files	Coverage Δ
src/GridWorlds.jl	`100.00% <ø> (ø)`
src/envs/envs.jl	`100.00% <ø> (ø)`
src/envs/single_room_undirected.jl	`0.00% <0.00%> (ø)`
src/play.jl	`0.00% <0.00%> (ø)`
src/envs/single_room_undirected_batch.jl	`57.05% <57.05%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d87e18...6d114c6. Read the comment docs.

Sid-Bhatia-0 · 2021-06-17T18:27:45Z

Added REPL-based playing.
Added tests.
Added script for benchmarking various aspects of batch environments

Here are some benchmarks for this environment with batch size 64 on a single thread without any parallelization. I'm sure the numbers can be improved. These are just preliminary benchmarks.

Right now I am copying the state, reward, and done arrays each time RLBase.state, RLBase.reward, RLBase.is_terminated is called respectively. @findmyway maybe we could have an RLBase.state! method in the RLBase API that pre-allocates memory for the state vector? Otherwise there always exists the question of whether to return a copy of the state or not. If we directly return env.tile_map, there is a risk of unintended consequences. If we generate a new copy each time, that is inefficient.

Date: 2021_06_17_23_41_30

List of Environments

SingleRoomUndirectedBatch

Benchmarks

SingleRoomUndirectedBatch

Run uniformly random policy, NUM_ENVS = 64, NUM_RESETS = 100, STEPS_PER_RESET = 100, TOTAL_STEPS = 10000

memory estimate: 118.94 MiB

allocs estimate: 3256401

minimum time: 245.287 ms (1.38% GC)

median time: 248.607 ms (2.08% GC)

mean time: 250.885 ms (1.88% GC)

maximum time: 277.825 ms (1.28% GC)

samples: 20

evals/sample: 1

GridWorlds.ModuleSingleRoomUndirectedBatch.SingleRoomUndirectedBatch()

memory estimate: 2.23 MiB

allocs estimate: 928

minimum time: 1.481 ms (0.00% GC)

median time: 1.559 ms (0.00% GC)

mean time: 1.689 ms (3.84% GC)

maximum time: 4.262 ms (30.46% GC)

samples: 2952

evals/sample: 1

RLBase.reset!(env)

memory estimate: 7.00 KiB

allocs estimate: 64

minimum time: 6.581 μs (0.00% GC)

median time: 7.057 μs (0.00% GC)

mean time: 8.222 μs (1.48% GC)

maximum time: 197.964 μs (93.81% GC)

samples: 10000

evals/sample: 5

RLBase.state(env)

memory estimate: 1.66 KiB

allocs estimate: 2

minimum time: 196.419 ns (0.00% GC)

median time: 259.854 ns (0.00% GC)

mean time: 291.363 ns (9.01% GC)

maximum time: 2.413 μs (56.48% GC)

samples: 10000

evals/sample: 676

RLBase.action_space(env)

memory estimate: 0 bytes

allocs estimate: 0

minimum time: 2.533 ns (0.00% GC)

median time: 2.797 ns (0.00% GC)

mean time: 2.943 ns (0.00% GC)

maximum time: 35.666 ns (0.00% GC)

samples: 10000

evals/sample: 1000

RLBase.is_terminated(env)

memory estimate: 128 bytes

allocs estimate: 2

minimum time: 61.851 ns (0.00% GC)

median time: 65.807 ns (0.00% GC)

mean time: 76.281 ns (9.40% GC)

maximum time: 3.087 μs (96.85% GC)

samples: 10000

evals/sample: 975

RLBase.reward(env)

memory estimate: 336 bytes

allocs estimate: 1

minimum time: 56.887 ns (0.00% GC)

median time: 62.510 ns (0.00% GC)

mean time: 70.044 ns (5.47% GC)

maximum time: 1.226 μs (92.14% GC)

samples: 10000

evals/sample: 976

env(1)

memory estimate: 10.00 KiB

allocs estimate: 320

minimum time: 20.442 μs (0.00% GC)

median time: 22.449 μs (0.00% GC)

mean time: 25.575 μs (1.55% GC)

maximum time: 2.016 ms (98.68% GC)

samples: 10000

evals/sample: 1

env(2)

memory estimate: 10.00 KiB

allocs estimate: 320

minimum time: 20.467 μs (0.00% GC)

median time: 22.522 μs (0.00% GC)

mean time: 25.235 μs (1.59% GC)

maximum time: 2.075 ms (98.64% GC)

samples: 10000

evals/sample: 1

env(4)

memory estimate: 10.00 KiB

allocs estimate: 320

minimum time: 20.011 μs (0.00% GC)

median time: 21.992 μs (0.00% GC)

mean time: 24.767 μs (1.61% GC)

maximum time: 2.046 ms (98.64% GC)

samples: 10000

evals/sample: 1

env(3)

memory estimate: 10.00 KiB

allocs estimate: 320

minimum time: 20.017 μs (0.00% GC)

median time: 22.013 μs (0.00% GC)

mean time: 24.846 μs (1.60% GC)

maximum time: 2.060 ms (98.60% GC)

samples: 10000

evals/sample: 1

zsunberg · 2021-06-19T01:47:00Z

If we directly return env.tile_map, there is a risk of unintended consequences.

In my experience, you are right to worry about this. Returning a reference to the internal state of the environment will result in a lot of frustrating unexpected behavior. On the other hand, I don't have experience with "batch" environments, so I may be missing something important about how this will be used.

If we generate a new copy each time, that is inefficient.

I am not sure that it is that inefficient. Copying memory is not too bad; the only thing that is really expensive is allocating memory on the heap (which happens when you create a new variable-sized array).

If the compiler knows the size of the grid world and you use static arrays, I think it should be pretty efficient to return a copy.

Sid-Bhatia-0 · 2021-06-19T06:14:16Z

Thank you for your comments @zsunberg

On the other hand, I don't have experience with "batch" environments, so I may be missing something important about how this will be used.

From my understanding, this part shouldn't be too different for single vs. batch environments (I am imagining a simple case where all the state vectors are batched together and fed to a neural network as an input vs. a single state vector fed to the neural network as input).

Returning a reference to the internal state of the environment will result in a lot of frustrating unexpected behavior.

In order to avoid this safely, we anyways need at least one memory copy operation. This should hold true even for batch environments.

I am not sure that it is that inefficient. Copying memory is not too bad; the only thing that is really expensive is allocating memory on the heap (which happens when you create a new variable-sized array).

Yes. So right now in the current implementation of all the single environments, fetching the state is along the following lines:

RLBase.state(env) = copy(env.tile_map)

As you pointed out, it is relatively expensive to (unnecessarily) allocate new memory on the heap for each step of the game loop.
We could do better by letting the algorithm allocate the desired memory for the state vector only once, and then do something like the following at each step:

RLBase.state!(state_cache, env) = copy!(state_cache, env.tile_map)

This requires only one memory allocation during initialization, and then one memory copy operation at each time step (which we inevitably need to avoid unintended mutations of internal state of environment).

If the compiler knows the size of the grid world and you use static arrays, I think it should be pretty efficient to return a copy.

Right. I think it might still be slightly better to pre-allocate memory for a mutable static array instead of creating a new static array at each step. I'm not totally sure about this one though. I think both of them might be pretty close for static arrays. More importantly, the number of elements in the state vector could go well above 100 elements even for small grid worlds, so using static arrays might not be a scalable solution, especially for larger sized grid worlds. Maybe the maximum useful size could be higher for static BitArrays. I don't know if BitArrays are supported in static arrays though.

Another benefit of having something like RLBase.state! is consistency. As opposed to RLBase.state, RLBase.state! should hopefully clearly communicate that the array has been preallocated, and one need not worry about whether to return a copy of the state vector or not.

@findmyway Please let us know your thoughts as well.

zsunberg · 2021-06-19T19:19:21Z

I don't know if BitArrays are supported in static arrays though.

Unfortunately no :( JuliaArrays/StaticArrays.jl#412

I did some benchmarks:

World = falses(20, 20)

display(@benchmark SMatrix{20,20}($World)) # mean: 103 ns; max:  248 ns (0 alloc)
display(@benchmark MMatrix{20,20}($World)) # mean: 239 ns; max: 3384 ns (1 alloc)
display(@benchmark copy($World))           # mean:  69 ns; max: 2297 ns (2 alloc)

It looks like the savings of keeping it in terms of bits rather than Bools is quite significant - in these tests it actually seems to outweigh the cost of garbage collection/allocation on average.

It seems like state! would indeed be most efficient. However, I don't really like the idea of state! being part of the interface because then environment implementers will have to implement it. (By the way, this is why I thought optional interface functions and provided would be useful in CommonRLInterface. Algorithm writers can use state! if it is provided, but then also fall back to a slower state if it is not, and the AutomaticDefault submodule could make this automatic).)

findmyway · 2021-06-20T08:04:26Z

Thanks for @zsunberg 's comments.

In summary, we are mainly discussing the following points here:

Whether to copy the state?
If so, how should the state be represented?

Re # 1, for GridWorlds.jl this package alone, I prefer to expose the inner state directly. We don't know what the package users will do with the environment states. Making decisions for users is not a good practice in my opinion. For example, one may simply transfer the state from CPU to GPU and do some heavy operations there， or they'd like to resize the image. Always copying the state is a waste in these cases. However, the main problem here is that this package is now tightly bound with RLBase.jl (which I've warned @Sid-Bhatia-0 not to do so at the early state of this package... ). RLBase.state doesn't specify whether the state is mutable or not. So one needs to add a wrapper to copy the state sometimes. But some other interfaces may have a different assumption from RLBase (e.g. CommonRLInterface.jl or the interfaces used in AlphaZero.jl).

Re # 2, based on @zsunberg 's benchmark result. It seems directly copying seems to be the fastest.

Talking about how to implement the "batch" environment, this is still experimental so that we can leverage as much resource as possible to run multiple replicas at the same time.

@Sid-Bhatia-0 could you make a summary of the benchmark result you listed above? I'm afraid it's too verbose, making others hard to understand what you want to express here.

src/envs/single_room_undirected_batch.jl

src/play.jl

Sid-Bhatia-0 · 2021-06-20T11:43:54Z

Unfortunately no :( JuliaArrays/StaticArrays.jl#412

I see.

However, I don't really like the idea of state! being part of the interface because then environment implementers will have to implement it. (By the way, this is why I thought optional interface functions and provided would be useful in CommonRLInterface. Algorithm writers can use state! if it is provided, but then also fall back to a slower state if it is not, and the AutomaticDefault submodule could make this automatic).)

Right. This is an interface concern. I don't have an informed opinion on this yet. I haven't looked into CommonRLInterface much, but maybe the solution could be along the lines of "optional interfaces", as you mentioned.

We don't know what the package users will do with the environment states. Making decisions for users is not a good practice in my opinion. For example, one may simply transfer the state from CPU to GPU and do some heavy operations there， or they'd like to resize the image. Always copying the state is a waste in these cases.

Yes. You are right in that not all users may want to go with the defaults and some users may want to do custom things to the get the environment state, and may need access to the internal state directly. But making it the default behavior might cause more problems than it solves (for most users). Moreover isn't that what method overloading is for? If someone wants to retrieve any internals and do anything fancy with it, they could always access to the env struct and can overload the default RLBase.state method like this:

RLBase.state(env) = env.tile_map

However, the main problem here is that this package is now tightly bound with RLBase.jl (which I've warned @Sid-Bhatia-0 not to do so at the early state of this package... )

Correct me if I am missing something, but here is what I understand... From usability standpoint, for GridWorlds.jl, using the RLBase interface seems like the obvious choice to me because of the ability to leverage the existing infrastructure (RLCore, RLZoo) around it. Sure, I could have created my own separate interface for this package, but I don't see a point in creating yet another general/flexible enough RL environment interface from scratch, especially since it would not provide any algorithms/examples of its own to go along with the environments.

But some other interfaces may have a different assumption from RLBase (e.g. CommonRLInterface.jl or the interfaces used in AlphaZero.jl).

Sure. And if one wants to use some other interface, they will anyways have to write the conversion. Compare the following:

MyCustomInterface.get_state(env) = RLBase.state(env)

and

MyCustomInterface.get_state(env) = HypotheticalGridWorldInterface.state(env)

Why would the first one be worse than the second one?

RLBase.state doesn't specify whether the state is mutable or not.

Wouldn't this cause confusion if left unspecified?

@Sid-Bhatia-0 could you make a summary of the benchmark result you listed above? I'm afraid it's too verbose, making others hard to understand what you want to express here.

The benchmarks I mentioned are just BenchmarkTools.@benchmark for all the common RL operations of an environment like env(action), RLBase.state(env), RLBase.reset!(env), etc... along with instantiation of an environment and running an overall benchmark for running a random policy for 10000 steps. I can use @btime macro instead of @benchmark to reduce verbosity, but other than that, compressing it further will result in loss of relevant information.

findmyway · 2021-06-20T13:57:38Z

Moreover isn't that what method overloading is for? If someone wants to retrieve any internals and do anything fancy with it, they could always access to the env struct and can overload the default RLBase.state method like this:

RLBase.state(env) = env.tile_map

Well, it's doable though...

Why would the first one be worse than the second one?

This is just a taste of software design here. Personally I prefer

  A
 / \
B   C

instead of

A
|
B
|
C

RLBase.state doesn't specify whether the state is mutable or not.

Wouldn't this cause confusion if left unspecified?

Yes, I think so. Let me add a warning in the docstring first.

I can use @Btime macro instead of @benchmark to reduce verbosity, but other than that, compressing it further will result in loss of relevant information.

Yes, please do. And it would be great to make it into a table so that we can easily compare different implementations later.

Address the concern in JuliaReinforcementLearning/GridWorlds.jl#146 (comment)

Sid-Bhatia-0 · 2021-06-21T09:10:16Z

This is just a taste of software design here. Personally I prefer
  A
 / \
B   C
instead of
A
|
B
|
C

I see. Thank you for the clarification. I'll think more on it later. Here's an issue to track it #148

And it would be great to make it into a table so that we can easily compare different implementations later.

Right. A table would be very nice to have. I'll see if I can fit them in a page. I'll do this is in a separate PR. Here's an issue to track it #147

Sid-Bhatia-0 · 2021-06-24T11:02:44Z

Ok. So here are some benchmarks comparing optimized single-threaded CPU implementations of batch vs. non-batch for the SingleRoomUndirected environment.

For the random_policy column, NUM_RESETS = 100, STEPS_PER_EPISODE = 100.

SingleRoomUndirected (non-batch)

random_policy	reset	state	MOVE_UP	MOVE_DOWN	MOVE_LEFT	MOVE_RIGHT	action_space	is_terminated	reward
0 bytes 316.639 μs	0 bytes 64.814 ns	0 bytes 1.875 ns	0 bytes 12.734 ns	0 bytes 14.151 ns	0 bytes 13.275 ns	0 bytes 14.229 ns	0 bytes 2.185 ns	0 bytes 1.842 ns	0 bytes 1.828 ns

SingleRoomUndirectedBatch (`NUM_ENVS = 1`)

random_policy	reset	state	MOVE_UP	MOVE_DOWN	MOVE_LEFT	MOVE_RIGHT	action_space	is_terminated	reward
96 bytes 629.735 μs	0 bytes 86.335 ns	0 bytes 2.914 ns	0 bytes 43.705 ns	0 bytes 43.630 ns	0 bytes 44.786 ns	0 bytes 44.370 ns	0 bytes 2.426 ns	0 bytes 2.432 ns	0 bytes 2.799 ns

SingleRoomUndirectedBatch (`NUM_ENVS = 1024`)

random_policy	reset	state	MOVE_UP	MOVE_DOWN	MOVE_LEFT	MOVE_RIGHT	action_space	is_terminated	reward
8.12 KiB 383.722 ms	0 bytes 111.034 μs	0 bytes 1.879 ns	0 bytes 15.101 μs	0 bytes 14.594 μs	0 bytes 14.129 μs	0 bytes 14.021 μs	0 bytes 1.854 ns	0 bytes 2.200 ns	0 bytes 1.887 ns

The Batch version with NUM_ENVS = 1 is worse than the non-batch version, which was expected.

The scaling from non-batch to batch is roughly linear (slightly worse) in NUM_ENVS for large NUM_ENVS, which is reasonable. Both batch and non-batch versions are fairly optimized and so we don't observe any gains on a single CPU thread. Right now we are just benchmarking the RLBase operations on these environments without any other overheads, but if we actually benchmark end-to-end using an actual algorithm, we should ideally see some benefits of batch version since it uses struct of arrays, even with a single CPU thread.

Note: The allocation during random_policy for batch version is single time allocation for storing the batch of actions generated at each step.

findmyway · 2021-06-29T16:25:42Z

Hi @Sid-Bhatia-0 ,

After thinking for a while, I'm wondering maybe it is possible to have one unified codebase to implement the batch environments.

First, let's assume the game state and game spec are separated. Basically, a game state may look like this:

struct SomeSpecificaGameState
    board::AbstractArray{Bool}
    agent_position::CartesianIndex
    reward::Float32
end

Then, ideally we can use StructArrays.jl to create a vector of game states. As long as our game logic is implemented in pure Julia code, the same code should work well on both CPU and GPU.

The only thing I haven't figured out is how to update some scalar values under the StructArray context (the agent_position and reward fields in the above example). Because now each element is an immutable struct.

Based on the example in StructArrays.jl, the whole element is updated with the setindex method. Maybe we can modify the syntax slightly from dest[i].reward = src[i].reward to dest[:reward, i] = src[:reward, i] to overcome the immutable problem. You may give it a try, or propose some other approaches.

Sid-Bhatia-0 · 2021-06-29T16:55:30Z

Hmm.. It'll take me a while before I can think more deeply about this and share an informed opinion.

As discussed in our last call, let us put the batch environments on hold for now. I will be focusing on getting all the existing single versions working perfectly (resolving all non-batch issues) before JuliaCon.

Sid-Bhatia-0 added 2 commits June 15, 2021 23:47

add StatsBase package

e7f4a8a

add SingleRoomUndirectedBatch

fd659a7

Sid-Bhatia-0 added 9 commits June 16, 2021 15:26

add playability to SingleRoomUndirectedBatch

0b8c1cb

add keyword force to RLBase.reset! method

1bb2641

fix replay method

0d115af

write characters to terminal out while playing

d39fae7

ignore scratchpad.jl

20d103f

add tests for SingleRoomUndirectedBatch

4c16b20

set state style to internal state, copy reward & done

342deb0

add benchmark_multi_threaded.jl

4926c99

print NUM_ENVS

7824551

findmyway reviewed Jun 20, 2021

View reviewed changes

src/envs/single_room_undirected_batch.jl Outdated Show resolved Hide resolved

findmyway reviewed Jun 20, 2021

View reviewed changes

src/envs/single_room_undirected_batch.jl Outdated Show resolved Hide resolved

findmyway reviewed Jun 20, 2021

View reviewed changes

src/play.jl Show resolved Hide resolved

findmyway added a commit to JuliaReinforcementLearning/ReinforcementLearning.jl that referenced this pull request Jun 21, 2021

Add a warning in docstring of state

42351df

Address the concern in JuliaReinforcementLearning/GridWorlds.jl#146 (comment)

findmyway mentioned this pull request Jun 21, 2021

Add a warning in docstring of state JuliaReinforcementLearning/ReinforcementLearning.jl#327

Merged

findmyway added a commit to JuliaReinforcementLearning/ReinforcementLearning.jl that referenced this pull request Jun 21, 2021

Add a warning in docstring of state (#327)

7671d8a

Address the concern in JuliaReinforcementLearning/GridWorlds.jl#146 (comment)

Sid-Bhatia-0 mentioned this pull request Jun 21, 2021

Separate API for this package? #148

Closed

Sid-Bhatia-0 added 3 commits June 21, 2021 17:56

move num_envs to the last dimension

d711bcc

update tests for batch envs

2d68a0a

don't copy tile_map, reward, and done in RLBase API

3dbd497

Sid-Bhatia-0 added 14 commits June 21, 2021 18:10

remove unnecessary RLBase.DefaultPlayer

8ec31a0

rename benchmark_multi_threaded.jl to benchmark_batch.jl

73406ea

fix and cleanup benchmark_batch

45bf86a

make move function type stable (huge improvement in performance)

ed2b37b

add function sample_two_positions_without_replacement

2ceedad

add DataStructures package in benchmarking code

2426955

add ACTION_NAMES in ModuleSingleRoomUndirectedBatch

cda573e

refactor benchmark_batch.jl

26d1231

rename benchmark_batch.jl to benchmark_utils.jl

bb048cc

add SingleRoomUndirected

e8854e2

ignore generated benchmark files

9f3b1bd

add benchmarking for non-batch envs

b730958

make SingleRoomUndirected mutable and improve performance

e2f313f

remove constants NUM_RESETS, STEPS_PER_EPISODE, NUM_ENVS

6d114c6

Sid-Bhatia-0 mentioned this pull request Aug 17, 2021

GSoC 2021 Work Product #185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add batch environment #146

add batch environment #146

Sid-Bhatia-0 commented Jun 15, 2021

codecov-commenter commented Jun 15, 2021 •

edited

Sid-Bhatia-0 commented Jun 17, 2021

zsunberg commented Jun 19, 2021 •

edited

Sid-Bhatia-0 commented Jun 19, 2021

zsunberg commented Jun 19, 2021 •

edited

findmyway commented Jun 20, 2021

Sid-Bhatia-0 commented Jun 20, 2021

findmyway commented Jun 20, 2021

Sid-Bhatia-0 commented Jun 21, 2021

Sid-Bhatia-0 commented Jun 24, 2021 •

edited

findmyway commented Jun 29, 2021

Sid-Bhatia-0 commented Jun 29, 2021

add batch environment #146

Are you sure you want to change the base?

add batch environment #146

Conversation

Sid-Bhatia-0 commented Jun 15, 2021

codecov-commenter commented Jun 15, 2021 • edited

Codecov Report

Sid-Bhatia-0 commented Jun 17, 2021

List of Environments

Benchmarks

SingleRoomUndirectedBatch

Run uniformly random policy, NUM_ENVS = 64, NUM_RESETS = 100, STEPS_PER_RESET = 100, TOTAL_STEPS = 10000

GridWorlds.ModuleSingleRoomUndirectedBatch.SingleRoomUndirectedBatch()

RLBase.reset!(env)

RLBase.state(env)

RLBase.action_space(env)

RLBase.is_terminated(env)

RLBase.reward(env)

env(1)

env(2)

env(4)

env(3)

zsunberg commented Jun 19, 2021 • edited

Sid-Bhatia-0 commented Jun 19, 2021

zsunberg commented Jun 19, 2021 • edited

findmyway commented Jun 20, 2021

Sid-Bhatia-0 commented Jun 20, 2021

findmyway commented Jun 20, 2021

Sid-Bhatia-0 commented Jun 21, 2021

Sid-Bhatia-0 commented Jun 24, 2021 • edited

SingleRoomUndirected (non-batch)

SingleRoomUndirectedBatch (NUM_ENVS = 1)

SingleRoomUndirectedBatch (NUM_ENVS = 1024)

findmyway commented Jun 29, 2021

Sid-Bhatia-0 commented Jun 29, 2021

codecov-commenter commented Jun 15, 2021 •

edited

zsunberg commented Jun 19, 2021 •

edited

zsunberg commented Jun 19, 2021 •

edited

Sid-Bhatia-0 commented Jun 24, 2021 •

edited

SingleRoomUndirectedBatch (`NUM_ENVS = 1`)

SingleRoomUndirectedBatch (`NUM_ENVS = 1024`)