SAC: Soft Actor Critic

This baseline comes from the paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Description

This module proposes an implementation of the SAC algorithm.

This is an old implementation that is probably not correct, it was included out of backward compatibility with earlier version (< 0.5.0) of this package

An example to train this model is available in the train function Examples.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Exported class

You can use this class with:

from l2rpn_baselines.SACOld import train, evaluate, SACOld

Classes:

SACOld(action_space, nn_archi[, name, ...])

Do not use this SACOld class that has lots of known (but forgotten) issues.

SACOld_NNParam(action_size, ...)

Do not use this SACOld class that has lots of known (but forgotten) issues.

Functions:

evaluate(env[, name, load_path, logs_path, ...])

How to evaluate the performances of the trained SAC agent (old implementation).

train(env[, name, iterations, save_path, ...])

This function implements the "training" part of the baselines "SAC" (old buggy implementation).

class l2rpn_baselines.SACOld.SACOld(action_space, nn_archi, name='DeepQAgent', store_action=True, istraining=False, filter_action_fun=None, verbose=False, observation_space=None, **kwargs_converters)[source]

Do not use this SACOld class that has lots of known (but forgotten) issues.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Warning

We plan to add SAC based agents relying on external frameworks, such as stable baselines3 or ray / rllib.

We will not code any SAC agent “from scratch”.

class l2rpn_baselines.SACOld.SACOld_NNParam(action_size, observation_size, sizes, activs, list_attr_obs, sizes_value, activs_value, sizes_policy, activs_policy)[source]

Do not use this SACOld class that has lots of known (but forgotten) issues.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Warning

We plan to add SAC based agents relying on external frameworks, such as stable baselines3 or ray / rllib.

We will not code any SAC agent “from scratch”.

sizes_value

List of integer, each one representing the size of the hidden layer for the “value” neural network.

Type:

list

activs_value

List of str for each hidden layer of the “value” neural network, indicates which hidden layer to use

Type:

list

sizes_policy

List of integers, each reprenseting the size of the hidden layer for the “policy” network.

Type:

list

activs_policy

List of str: The activation functions (for each layer) of the policy network

Type:

list

Classes:

nn_class

alias of SACOld_NN

nn_class

alias of SACOld_NN Methods:

construct_q_network()

This constructs all the networks needed for the SAC agent.

load_network(path[, name, ext])

We load all the models using the keras "load_model" function.

predict_movement(data, epsilon[, ...])

predict the next movements in a vectorized fashion

save_network(path[, name, ext])

Saves all the models with unique names

target_train()

This update the target model.

train(s_batch, a_batch, r_batch, d_batch, ...)

Trains networks to fit given parameters

l2rpn_baselines.SACOld.evaluate(env, name='SACOld', load_path=None, logs_path='./logs-eval/do-nothing-baseline', nb_episode=1, nb_process=1, max_steps=-1, verbose=False, save_gif=False)[source]

How to evaluate the performances of the trained SAC agent (old implementation).

Please use the new implementation instead.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Parameters:
  • env (grid2op.Environment) – The environment on which you evaluate your agent.

  • name (str) – The name of the trained baseline

  • load_path (str) – Path where the agent has been stored

  • logs_path (str) – Where to write the results of the assessment

  • nb_episode (str) – How many episodes to run during the assessment of the performances

  • nb_process (int) – On how many process the assessment will be made. (setting this > 1 can lead to some speed ups but can be unstable on some plaform)

  • max_steps (int) – How many steps at maximum your agent will be assessed

  • verbose (bool) – Currently un used

  • save_gif (bool) – Whether or not you want to save, as a gif, the performance of your agent. It might cause memory issues (might take a lot of ram) and drastically increase computation time.

Returns:

  • agent (l2rpn_baselines.utils.DeepQAgent) – The loaded agent that has been evaluated thanks to the runner.

  • res (list) – The results of the Runner on which the agent was tested.

Examples

You can evaluate a DeepQSimple this way:

from grid2op.Reward import L2RPNSandBoxScore, L2RPNReward
from l2rpn_baselines.SACOld import eval

# Create dataset env
env = make("l2rpn_case14_sandbox",
           reward_class=L2RPNSandBoxScore,
           other_rewards={
               "reward": L2RPNReward
           })

# Call evaluation interface
evaluate(env,
         name="MyAwesomeAgent",
         load_path="/WHERE/I/SAVED/THE/MODEL",
         logs_path=None,
         nb_episode=10,
         nb_process=1,
         max_steps=-1,
         verbose=False,
         save_gif=False)
l2rpn_baselines.SACOld.train(env, name='SACOld', iterations=1, save_path=None, load_path=None, logs_dir=None, training_param=None, filter_action_fun=None, verbose=True, kwargs_converters={}, kwargs_archi={})[source]

This function implements the “training” part of the baselines “SAC” (old buggy implementation).

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Warning

We plan to add SAC based agents relying on external frameworks, such as stable baselines3 or ray / rllib.

We will not code any SAC agent “from scratch”.

Parameters:
  • env (grid2op.Environment) – Then environment on which you need to train your agent.

  • name (str`) – The name of your agent.

  • iterations (int) – For how many iterations (steps) do you want to train your agent. NB these are not episode, these are steps.

  • save_path (str) – Where do you want to save your baseline.

  • load_path (str) – If you want to reload your baseline, specify the path where it is located. NB if a baseline is reloaded some of the argument provided to this function will not be used.

  • logs_dir (str) – Where to store the tensorboard generated logs during the training. None if you don’t want to log them.

  • verbose (bool) – If you want something to be printed on the terminal (a better logging strategy will be put at some point)

  • training_param (l2rpn_baselines.utils.TrainingParam) – The parameters describing the way you will train your model.

  • filter_action_fun (function) – A function to filter the action space. See IdToAct.filter_action documentation.

  • kwargs_converters (dict) – A dictionary containing the key-word arguments pass at this initialization of the grid2op.Converter.IdToAct that serves as “Base” for the Agent.

  • kwargs_archi (dict) – Key word arguments used for making the DeepQ_NNParam object that will be used to build the baseline.

Returns:

baseline – The trained baseline.

Return type:

SACOld

Examples

Here is an example on how to train a SACOld baseline.

First define a python script, for example

import grid2op
from grid2op.Reward import L2RPNReward
from l2rpn_baselines.utils import TrainingParam, NNParam
from l2rpn_baselines.SACOld import train

# define the environment
env = grid2op.make("l2rpn_case14_sandbox",
                   reward_class=L2RPNReward)

# use the default training parameters
tp = TrainingParam()

# this will be the list of what part of the observation I want to keep
# more information on https://grid2op.readthedocs.io/en/latest/observation.html#main-observation-attributes
li_attr_obs_X = ["day_of_week", "hour_of_day", "minute_of_hour", "prod_p", "prod_v", "load_p", "load_q",
                 "actual_dispatch", "target_dispatch", "topo_vect", "time_before_cooldown_line",
                 "time_before_cooldown_sub", "rho", "timestep_overflow", "line_status"]

# neural network architecture
observation_size = NNParam.get_obs_size(env, li_attr_obs_X)
sizes_q = [800, 800, 800, 494, 494, 494]  # sizes of each hidden layers
sizes_v = [800, 800]  # sizes of each hidden layers
sizes_pol = [800, 800, 800, 494, 494, 494]  # sizes of each hidden layers
kwargs_archi = {'observation_size': observation_size,
                'sizes': sizes_q,
                'activs': ["relu" for _ in range(len(sizes_q))],
                "list_attr_obs": li_attr_obs_X,
                "sizes_value": sizes_v,
                "activs_value": ["relu" for _ in range(len(sizes_v))],
                "sizes_policy": sizes_pol,
                "activs_policy": ["relu" for _ in range(len(sizes_pol))]
                }

# select some part of the action
# more information at https://grid2op.readthedocs.io/en/latest/converter.html#grid2op.Converter.IdToAct.init_converter
kwargs_converters = {"all_actions": None,
                     "set_line_status": False,
                     "change_bus_vect": True,
                     "set_topo_vect": False
                     }
# define the name of the model
nm_ = "AnneOnymous"
try:
    train(env,
          name=nm_,
          iterations=10000,
          save_path="/WHERE/I/SAVED/THE/MODEL",
          load_path=None,
          logs_dir="/WHERE/I/SAVED/THE/LOGS",
          training_param=tp,
          kwargs_converters=kwargs_converters,
          kwargs_archi=kwargs_archi)
finally:
    env.close()

Other non exported class

These classes need to be imported, if you want to import them with (non exhaustive list):

from l2rpn_baselines.SACOld.sacOld_NN import SACOld_NN
from l2rpn_baselines.SACOld.sacOld_NNParam import SACOld_NNParam
class l2rpn_baselines.SACOld.sacOld_NN.SACOld_NN(nn_params, training_param=None, verbose=False)[source]

Constructs the desired soft actor critic network.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Compared to other baselines shown elsewhere (eg l2rpn_baselines.DeepQSimple or l2rpn_baselines.DeepQSimple) the implementation of the SAC is a bit more tricky (and was most likely NOT done properly in this class). For a more correct implementation of SAC please look at the l2rpn_baselines.SAC.SAC instead. This class is only present for backward compatibility.

However, we demonstrate here that the use of l2rpn_baselines.utils.BaseDeepQ with custom parameters class (in this case SACOld_NNParam is flexible enough to meet our needs.

References

Original paper: https://arxiv.org/abs/1801.01290

modified for discrete action space: https://arxiv.org/abs/1910.07207

Methods:

construct_q_network()

This constructs all the networks needed for the SAC agent.

load_network(path[, name, ext])

We load all the models using the keras "load_model" function.

predict_movement(data, epsilon[, ...])

predict the next movements in a vectorized fashion

save_network(path[, name, ext])

Saves all the models with unique names

target_train()

This update the target model.

train(s_batch, a_batch, r_batch, d_batch, ...)

Trains networks to fit given parameters

construct_q_network()[source]

This constructs all the networks needed for the SAC agent.

load_network(path, name=None, ext='h5')[source]

We load all the models using the keras “load_model” function.

predict_movement(data, epsilon, batch_size=None, training=False)[source]

predict the next movements in a vectorized fashion

save_network(path, name=None, ext='h5')[source]

Saves all the models with unique names

target_train()[source]

This update the target model.

train(s_batch, a_batch, r_batch, d_batch, s2_batch, tf_writer=None, batch_size=None)[source]

Trains networks to fit given parameters

class l2rpn_baselines.SACOld.sacOld_NNParam.SACOld_NNParam(action_size, observation_size, sizes, activs, list_attr_obs, sizes_value, activs_value, sizes_policy, activs_policy)[source]

Do not use this SACOld class that has lots of known (but forgotten) issues.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Warning

We plan to add SAC based agents relying on external frameworks, such as stable baselines3 or ray / rllib.

We will not code any SAC agent “from scratch”.

sizes_value

List of integer, each one representing the size of the hidden layer for the “value” neural network.

Type:

list

activs_value

List of str for each hidden layer of the “value” neural network, indicates which hidden layer to use

Type:

list

sizes_policy

List of integers, each reprenseting the size of the hidden layer for the “policy” network.

Type:

list

activs_policy

List of str: The activation functions (for each layer) of the policy network

Type:

list

Classes:

nn_class

alias of SACOld_NN

nn_class

alias of SACOld_NN Methods:

construct_q_network()

This constructs all the networks needed for the SAC agent.

load_network(path[, name, ext])

We load all the models using the keras "load_model" function.

predict_movement(data, epsilon[, ...])

predict the next movements in a vectorized fashion

save_network(path[, name, ext])

Saves all the models with unique names

target_train()

This update the target model.

train(s_batch, a_batch, r_batch, d_batch, ...)

Trains networks to fit given parameters