SAC: Soft Actor Critic
This baseline comes from the paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Description
This module proposes an implementation of the SAC algorithm.
This is an old implementation that is probably not correct, it was included out of backward compatibility with earlier version (< 0.5.0) of this package
An example to train this model is available in the train function Examples.
Warning
This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).
For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.
Exported class
You can use this class with:
from l2rpn_baselines.SACOld import train, evaluate, SACOld
Classes:
|
Do not use this SACOld class that has lots of known (but forgotten) issues. |
|
Do not use this SACOld class that has lots of known (but forgotten) issues. |
Functions:
|
How to evaluate the performances of the trained SAC agent (old implementation). |
|
This function implements the "training" part of the baselines "SAC" (old buggy implementation). |
- class l2rpn_baselines.SACOld.SACOld(action_space, nn_archi, name='DeepQAgent', store_action=True, istraining=False, filter_action_fun=None, verbose=False, observation_space=None, **kwargs_converters)[source]
Do not use this SACOld class that has lots of known (but forgotten) issues.
Warning
This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).
For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.
Warning
We plan to add SAC based agents relying on external frameworks, such as stable baselines3 or ray / rllib.
We will not code any SAC agent “from scratch”.
- class l2rpn_baselines.SACOld.SACOld_NNParam(action_size, observation_size, sizes, activs, list_attr_obs, sizes_value, activs_value, sizes_policy, activs_policy)[source]
Do not use this SACOld class that has lots of known (but forgotten) issues.
Warning
This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).
For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.
Warning
We plan to add SAC based agents relying on external frameworks, such as stable baselines3 or ray / rllib.
We will not code any SAC agent “from scratch”.
- sizes_value
List of integer, each one representing the size of the hidden layer for the “value” neural network.
- Type:
list
- activs_value
List of
str
for each hidden layer of the “value” neural network, indicates which hidden layer to use- Type:
list
- sizes_policy
List of integers, each reprenseting the size of the hidden layer for the “policy” network.
- Type:
list
- activs_policy
List of
str
: The activation functions (for each layer) of the policy network- Type:
list
Classes:
alias of
SACOld_NN
- nn_class
alias of
SACOld_NN
Methods:construct_q_network
()This constructs all the networks needed for the SAC agent.
load_network
(path[, name, ext])We load all the models using the keras "load_model" function.
predict_movement
(data, epsilon[, ...])predict the next movements in a vectorized fashion
save_network
(path[, name, ext])Saves all the models with unique names
target_train
()This update the target model.
train
(s_batch, a_batch, r_batch, d_batch, ...)Trains networks to fit given parameters
- l2rpn_baselines.SACOld.evaluate(env, name='SACOld', load_path=None, logs_path='./logs-eval/do-nothing-baseline', nb_episode=1, nb_process=1, max_steps=-1, verbose=False, save_gif=False)[source]
How to evaluate the performances of the trained SAC agent (old implementation).
Please use the new implementation instead.
Warning
This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).
For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.
- Parameters:
env (
grid2op.Environment
) – The environment on which you evaluate your agent.name (
str
) – The name of the trained baselineload_path (
str
) – Path where the agent has been storedlogs_path (
str
) – Where to write the results of the assessmentnb_episode (
str
) – How many episodes to run during the assessment of the performancesnb_process (
int
) – On how many process the assessment will be made. (setting this > 1 can lead to some speed ups but can be unstable on some plaform)max_steps (
int
) – How many steps at maximum your agent will be assessedverbose (
bool
) – Currently un usedsave_gif (
bool
) – Whether or not you want to save, as a gif, the performance of your agent. It might cause memory issues (might take a lot of ram) and drastically increase computation time.
- Returns:
agent (
l2rpn_baselines.utils.DeepQAgent
) – The loaded agent that has been evaluated thanks to the runner.res (
list
) – The results of the Runner on which the agent was tested.
Examples
You can evaluate a DeepQSimple this way:
from grid2op.Reward import L2RPNSandBoxScore, L2RPNReward from l2rpn_baselines.SACOld import eval # Create dataset env env = make("l2rpn_case14_sandbox", reward_class=L2RPNSandBoxScore, other_rewards={ "reward": L2RPNReward }) # Call evaluation interface evaluate(env, name="MyAwesomeAgent", load_path="/WHERE/I/SAVED/THE/MODEL", logs_path=None, nb_episode=10, nb_process=1, max_steps=-1, verbose=False, save_gif=False)
- l2rpn_baselines.SACOld.train(env, name='SACOld', iterations=1, save_path=None, load_path=None, logs_dir=None, training_param=None, filter_action_fun=None, verbose=True, kwargs_converters={}, kwargs_archi={})[source]
This function implements the “training” part of the baselines “SAC” (old buggy implementation).
Warning
This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).
For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.
Warning
We plan to add SAC based agents relying on external frameworks, such as stable baselines3 or ray / rllib.
We will not code any SAC agent “from scratch”.
- Parameters:
env (
grid2op.Environment
) – Then environment on which you need to train your agent.name (
str`
) – The name of your agent.iterations (
int
) – For how many iterations (steps) do you want to train your agent. NB these are not episode, these are steps.save_path (
str
) – Where do you want to save your baseline.load_path (
str
) – If you want to reload your baseline, specify the path where it is located. NB if a baseline is reloaded some of the argument provided to this function will not be used.logs_dir (
str
) – Where to store the tensorboard generated logs during the training.None
if you don’t want to log them.verbose (
bool
) – If you want something to be printed on the terminal (a better logging strategy will be put at some point)training_param (
l2rpn_baselines.utils.TrainingParam
) – The parameters describing the way you will train your model.filter_action_fun (
function
) – A function to filter the action space. See IdToAct.filter_action documentation.kwargs_converters (
dict
) – A dictionary containing the key-word arguments pass at this initialization of thegrid2op.Converter.IdToAct
that serves as “Base” for the Agent.kwargs_archi (
dict
) – Key word arguments used for making theDeepQ_NNParam
object that will be used to build the baseline.
- Returns:
baseline – The trained baseline.
- Return type:
Examples
Here is an example on how to train a
SACOld
baseline.First define a python script, for example
import grid2op from grid2op.Reward import L2RPNReward from l2rpn_baselines.utils import TrainingParam, NNParam from l2rpn_baselines.SACOld import train # define the environment env = grid2op.make("l2rpn_case14_sandbox", reward_class=L2RPNReward) # use the default training parameters tp = TrainingParam() # this will be the list of what part of the observation I want to keep # more information on https://grid2op.readthedocs.io/en/latest/observation.html#main-observation-attributes li_attr_obs_X = ["day_of_week", "hour_of_day", "minute_of_hour", "prod_p", "prod_v", "load_p", "load_q", "actual_dispatch", "target_dispatch", "topo_vect", "time_before_cooldown_line", "time_before_cooldown_sub", "rho", "timestep_overflow", "line_status"] # neural network architecture observation_size = NNParam.get_obs_size(env, li_attr_obs_X) sizes_q = [800, 800, 800, 494, 494, 494] # sizes of each hidden layers sizes_v = [800, 800] # sizes of each hidden layers sizes_pol = [800, 800, 800, 494, 494, 494] # sizes of each hidden layers kwargs_archi = {'observation_size': observation_size, 'sizes': sizes_q, 'activs': ["relu" for _ in range(len(sizes_q))], "list_attr_obs": li_attr_obs_X, "sizes_value": sizes_v, "activs_value": ["relu" for _ in range(len(sizes_v))], "sizes_policy": sizes_pol, "activs_policy": ["relu" for _ in range(len(sizes_pol))] } # select some part of the action # more information at https://grid2op.readthedocs.io/en/latest/converter.html#grid2op.Converter.IdToAct.init_converter kwargs_converters = {"all_actions": None, "set_line_status": False, "change_bus_vect": True, "set_topo_vect": False } # define the name of the model nm_ = "AnneOnymous" try: train(env, name=nm_, iterations=10000, save_path="/WHERE/I/SAVED/THE/MODEL", load_path=None, logs_dir="/WHERE/I/SAVED/THE/LOGS", training_param=tp, kwargs_converters=kwargs_converters, kwargs_archi=kwargs_archi) finally: env.close()
Other non exported class
These classes need to be imported, if you want to import them with (non exhaustive list):
from l2rpn_baselines.SACOld.sacOld_NN import SACOld_NN
from l2rpn_baselines.SACOld.sacOld_NNParam import SACOld_NNParam
- class l2rpn_baselines.SACOld.sacOld_NN.SACOld_NN(nn_params, training_param=None, verbose=False)[source]
Constructs the desired soft actor critic network.
Warning
This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).
For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.
Compared to other baselines shown elsewhere (eg
l2rpn_baselines.DeepQSimple
orl2rpn_baselines.DeepQSimple
) the implementation of the SAC is a bit more tricky (and was most likely NOT done properly in this class). For a more correct implementation of SAC please look at thel2rpn_baselines.SAC.SAC
instead. This class is only present for backward compatibility.However, we demonstrate here that the use of
l2rpn_baselines.utils.BaseDeepQ
with custom parameters class (in this caseSACOld_NNParam
is flexible enough to meet our needs.References
Original paper: https://arxiv.org/abs/1801.01290
modified for discrete action space: https://arxiv.org/abs/1910.07207
Methods:
This constructs all the networks needed for the SAC agent.
load_network
(path[, name, ext])We load all the models using the keras "load_model" function.
predict_movement
(data, epsilon[, ...])predict the next movements in a vectorized fashion
save_network
(path[, name, ext])Saves all the models with unique names
This update the target model.
train
(s_batch, a_batch, r_batch, d_batch, ...)Trains networks to fit given parameters
- load_network(path, name=None, ext='h5')[source]
We load all the models using the keras “load_model” function.
- class l2rpn_baselines.SACOld.sacOld_NNParam.SACOld_NNParam(action_size, observation_size, sizes, activs, list_attr_obs, sizes_value, activs_value, sizes_policy, activs_policy)[source]
Do not use this SACOld class that has lots of known (but forgotten) issues.
Warning
This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).
For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.
Warning
We plan to add SAC based agents relying on external frameworks, such as stable baselines3 or ray / rllib.
We will not code any SAC agent “from scratch”.
- sizes_value
List of integer, each one representing the size of the hidden layer for the “value” neural network.
- Type:
list
- activs_value
List of
str
for each hidden layer of the “value” neural network, indicates which hidden layer to use- Type:
list
- sizes_policy
List of integers, each reprenseting the size of the hidden layer for the “policy” network.
- Type:
list
- activs_policy
List of
str
: The activation functions (for each layer) of the policy network- Type:
list
Classes:
alias of
SACOld_NN
- nn_class
alias of
SACOld_NN
Methods:construct_q_network
()This constructs all the networks needed for the SAC agent.
load_network
(path[, name, ext])We load all the models using the keras "load_model" function.
predict_movement
(data, epsilon[, ...])predict the next movements in a vectorized fashion
save_network
(path[, name, ext])Saves all the models with unique names
target_train
()This update the target model.
train
(s_batch, a_batch, r_batch, d_batch, ...)Trains networks to fit given parameters