utils: Some utility functions and classes

Description

In this files are present a few utilitary scripts used in some other baselines, or that could be used in different context.

They have been put together in this “module” as they can be reused for different baselines (avoid code duplicate)

The main tools are:

BaseDeepQ which is the root class for some baselines. This class holds only the code of the neural network. The architecture of the neural network can be customized thanks to the NNParam
DeepQAgent this class will create an instance of BaseDeepQ and will implement the agent interface (eg the train, load and save methods). The training procedure is unified (epsilon greedy for exploration, training for a certain amount of steps etc.) but can be customized with TrainingParam. The training procedure can be stopped at any given time and restarted from the last point almost flawlessly, it saves it neural network frequently as well as the other parameters etc.
TrainingParam allows to customized for some “common” procedure how to train the agent. More information can be gathered in the Focus on the training parameters section. This is fully serializable / de serializable in json format.
NNParam is used to specify the architecture of your neural network. Just like TrainingParam this class also fully supports serialization / de serialization in json format. More about it is specified in the section Focus on the architecture

Focus on the training parameters

The class TrainingParam regroup a certain number of attributes with different roles. In the table below we tried to list all the attributes and group them into attributes serving the same purpose.

Utility	Attribute names
exploration	initial_epsilon, step_for_final_epsilon, final_epsilon
neural network learning	minibatch_size, update_freq, min_observation
RL meta parameters	discount_factor, tau
limit duration of episode	step_increase_nb_iter * , min_iter, max_iter, update_nb_iter, max_iter_fun
start an episode at random	random_sample_datetime_start *
oversampling hard scenarios	oversampling_rate *
optimizer	lr, lr_decay_steps, lr_decay_rate, max_global_norm_grad, max_value_grad, max_loss
saving / logging	update_tensorboard_freq, save_model_each

* when a “star” is present it means this parameters deactivate the whole utility. For example, setting step_increase_nb_iter to None will deactivate the functionality “limit duration of episode”

Focus on the architecture

TODO

Implementation Details

Classes:

`BaseDeepQ`(nn_params[, training_param, verbose])	This class aims at representing the Q value (or more in case of SAC) parametrization by a neural network.
`DeepQAgent`(action_space, nn_archi[, name, ...])	This class allows to train and log the training of different Q learning algorithm.
`GymAgent`(g2op_action_space, gym_act_space, ...)	This class maps a neural network (trained using ray / rllib or stable baselines for example
`GymEnvWithHeuristics`(env_init, *args[, ...])	This abstract class is used to perform some actions, independantly of a RL agent on a grid2op environment.
`GymEnvWithReco`(env_init, *args[, reward_cumul])	This specific type of environment with "heuristics" / "expert rules" / "expert actions" is an example to illustrate how to perfom an automatic powerline reconnection.
`GymEnvWithRecoWithDN`(env_init, *args[, ...])	This environment is slightly more complex that the other one.
`NNParam`(action_size, observation_size, ...)	This class provides an easy way to save and restore, as json, the shape of your neural networks (number of layers, non linearities, size of each layers etc.)
`ReplayBuffer`(buffer_size)	Constructs a buffer object that stores the past moves and samples a set of subsamples
`TrainingParam`([buffer_size, minibatch_size, ...])	A class to store the training parameters of the models.

Functions:

`cli_eval`()	some usefull command line arguments (CLI) for the evaluation of the baseline.
`cli_train`()	some default command line arguments (cli) for training the baselines.
`make_multi_env`(env_init, nb_env)	This function creates a multi environment compatible with what is expected in the baselines.
`save_log_gif`(path_log, res[, gif_name])	Output a gif named (by default "episode.gif") that is the replay of the episode in a gif format, for each episode in the input.
`str2bool`(v)	INTERNAL DO NOT USE
`train_generic`(agent, env[, name, ...])	This function is a helper to train more easily some agent using their default "train" method.

class l2rpn_baselines.utils.BaseDeepQ(nn_params, training_param=None, verbose=False)[source]

This class aims at representing the Q value (or more in case of SAC) parametrization by a neural network.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Prefer to use the GymAgent class and the GymEnvWithHeuristics classes to train agent interacting with grid2op and fully compatible with gym framework.

It is composed of 2 different networks:

model: which is the main model
target_model: which has the same architecture and same initial weights as “model” but is updated less frequently to stabilize training

It has basic methods to make predictions, to train the model, and train the target model.

This class is abstraction and need to be overide in order to create object from this class. The only pure virtual function is BaseDeepQ.construct_q_network() that creates the neural network from the nn_params (NNParam) provided as input

_action_size

Total number of actions

Type:: int

_observation_size

Size of the observation space considered

Type:: int

_nn_archi

The parameters of the neural networks that will be created

Type:: NNParam

_training_param

The meta parameters for the training scheme (used especially for learning rate or gradient clipping for example)

Type:: TrainingParam

_lr

The initial learning rate

Type:: float

_lr_decay_steps

The decay step of the learning rate

Type:: float

_lr_decay_rate

The rate at which the learning rate will decay

Type:: float

_model: Main neural network model, here a keras Model object.

_target_model: a copy of the main neural network that will be updated less frequently (also known as “target model” in RL community)

Methods:

`construct_q_network`()	Abstract method that need to be overide.
`get_path_model`(path[, name])	Get the location at which the neural networks will be saved.
`load_network`(path[, name, ext])	Load the neural networks.
`make_optimiser`()	helper function to create the proper optimizer (Adam) with the learning rates and its decay parameters.
`predict_movement`(data, epsilon[, ...])	Predict movement of game controler where is epsilon probability randomly move.
`save_network`(path[, name, ext])	save the neural networks.
`save_tensorboard`(current_step)	function used to save other information to tensorboard
`target_train`([tau])	update the target model with the parameters given in the `BaseDeepQ._training_param`.
`train`(s_batch, a_batch, r_batch, d_batch, ...)	Trains network to fit given parameters:
`train_on_batch`(model, optimizer_model, x, y_true)	train the model on a batch of example.

abstractmethod construct_q_network()[source]

Abstract method that need to be overide.

It should create BaseDeepQ._model and BaseDeepQ._target_model

staticmethod get_path_model(path, name=None)[source]

Get the location at which the neural networks will be saved.

Returns:

path_model (str) – The path at which the model will be saved (path include both path and name, it is the full path at which the neural networks are saved)
path_target_model (str) – The path at which the target model will be saved

load_network(path, name=None, ext='h5')[source]: Load the neural networks. :param path: The path at which the models need to be saved :type path: str :param name: The name given to this model :type name: str :param ext: The file extension (by default h5) :type ext: str

make_optimiser()[source]: helper function to create the proper optimizer (Adam) with the learning rates and its decay parameters.

predict_movement(data, epsilon, batch_size=None, training=False)[source]: Predict movement of game controler where is epsilon probability randomly move.

save_network(path, name=None, ext='h5')[source]

save the neural networks.

Parameters:

path (str) – The path at which the models need to be saved
name (str) – The name given to this model
ext (str) – The file extension (by default h5)

save_tensorboard(current_step)[source]: function used to save other information to tensorboard

target_train(tau=None)[source]: update the target model with the parameters given in the BaseDeepQ._training_param.

train(s_batch, a_batch, r_batch, d_batch, s2_batch, tf_writer=None, batch_size=None)[source]

Trains network to fit given parameters:

Parameters:

s_batch – the state vector (before the action is taken)
a_batch – the action taken
s2_batch – the state vector (after the action is taken)
d_batch – says whether or not the episode was over
r_batch – the reward obtained this step

train_on_batch(model, optimizer_model, x, y_true)[source]: train the model on a batch of example. This can be overide

class l2rpn_baselines.utils.DeepQAgent(action_space, nn_archi, name='DeepQAgent', store_action=True, istraining=False, filter_action_fun=None, verbose=False, observation_space=None, **kwargs_converters)[source]

This class allows to train and log the training of different Q learning algorithm.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Prefer to use the GymAgent class and the GymEnvWithHeuristics classes to train agent interacting with grid2op and fully compatible with gym framework.

It is not meant to be the state of the art implement of some baseline. It is rather meant to be a set of useful functions that allows to easily develop an environment if we want to get started in RL using grid2op.

It derives from grid2op.Agent.AgentWithConverter and as such implements the DeepQAgent.convert_obs() and DeepQAgent.my_act()

It is suppose to be a Baseline, so it implements also the

DeepQAgent.load(): to load the agent
DeepQAgent.save(): to save the agent
DeepQAgent.train(): to train the agent

TODO description of the training scheme!

filter_action_fun

The function used to filter the action of the action space. See the documentation of grid2op: grid2op.Converter.IdToAct here for more information.

Type:: callable

replay_buffer: The experience replay buffer

deep_q

The neural network, represented as a BaseDeepQ object.

Type:: BaseDeepQ

name

The name of the Agent

Type:: str

store_action

Whether you want to register which action your agent took or not. Saving the action can slow down a bit the computation (less than 1%) but can help understand what your agent is doing during its learning process.

Type:: bool

dict_action

The action taken by the agent, represented as a dictionnary. This can be useful to know which type of actions is taken by your agent. Only filled if :attr:DeepQAgent.store_action` is True

Type:: str

istraining

Whether you are training this agent or not. No more really used. Mainly used for backward compatibility.

Type:: bool

epsilon

The epsilon greedy exploration parameter.

Type:: float

nb_injection

Number of action tagged as “injection”. See the official grid2op documentation for more information.

Type:: int

nb_voltage

Number of action tagged as “voltage”. See the official grid2op documentation for more information.

Type:: int

nb_topology

Number of action tagged as “topology”. See the official grid2op documentation for more information.

Type:: int

nb_redispatching

Number of action tagged as “redispatching”. See the official grid2op documentation for more information.

Type:: int

nb_storage

Number of action tagged as “storage”. See the official grid2op documentation for more information.

Type:: int

nb_curtail

Number of action tagged as “curtailment”. See the official grid2op documentation for more information.

Type:: int

nb_do_nothing

Number of action tagged as “do_nothing”, ie when an action is not modifiying the state of the grid. See the official grid2op documentation for more information.

Type:: int

verbose

An effort will be made on the logging (outside of trensorboard) of the training. For now: verbose=True will allow some printing on the command prompt, and verbose=False will drastically reduce the amount of information printed during training.

Type:: bool

Methods:

`convert_obs`(observation)	Generic way to convert an observation.
`get_action_size`(action_space, filter_fun, ...)	This function allows to get the size of the action space if we were to built a `DeepQAgent` with this parameters.
`init_obs_extraction`(observation_space)	This method should be called to initialize the observation (feed as a vector in the neural network) from its description as a list of its attribute names.
`load`(path)	Part of the l2rpn_baselines interface, this function allows to read back a trained model, to continue the training or to evaluate its performance for example.
`my_act`(transformed_observation, reward[, done])	This function will return the action (its id) selected by the underlying `DeepQAgent.deep_q` network.
`save`(path)	Part of the l2rpn_baselines interface, this allows to save a model.
`train`(env, iterations, save_path, logdir[, ...])	Part of the public l2rpn-baselines interface, this function allows to train the baseline.

convert_obs(observation)[source]

Generic way to convert an observation. This transform it to a vector and the select the attributes that were selected in l2rpn_baselines.utils.NNParams.list_attr_obs (that have been extracted once and for all in the DeepQAgent._indx_obs vector).

Parameters:: observation (grid2op.Observation.BaseObservation) – The current observation sent by the environment
Returns:: _tmp_obs – The observation as vector with only the proper attribute selected (TODO scaling will be available in future version)
Return type:: numpy.ndarray

staticmethod get_action_size(action_space, filter_fun, kwargs_converters)[source]

This function allows to get the size of the action space if we were to built a DeepQAgent with this parameters.

Parameters:

action_space (grid2op.ActionSpace) – The grid2op action space used.
filter_fun (callable) – see DeepQAgent.filter_fun for more information
kwargs_converters (dict) –
see the documentation of grid2op for more information: here

init_obs_extraction(observation_space)[source]: This method should be called to initialize the observation (feed as a vector in the neural network) from its description as a list of its attribute names.

load(path)[source]

Part of the l2rpn_baselines interface, this function allows to read back a trained model, to continue the training or to evaluate its performance for example.

NB To reload an agent, it must have exactly the same name and have been saved at the right location.

Parameters:: path (str) – The path where the agent has previously beens saved.

my_act(transformed_observation, reward, done=False)[source]

This function will return the action (its id) selected by the underlying DeepQAgent.deep_q network.

Before being used, this method require that the DeepQAgent.deep_q is created. To that end a call to DeepQAgent.init_deep_q() needs to have been performed (this is automatically done if you use baseline we provide and their evaluate and train scripts).

Parameters:

transformed_observation (numpy.ndarray) – The observation, as transformed after DeepQAgent.convert_obs()
reward (float) – The reward of the last time step. Ignored by this method. Here for retro compatibility with openAI gym interface.
done (bool) – Whether the episode is over or not. This is not used, and is only present to be compliant with open AI gym interface

Returns:

res – The id the action taken.

Return type:

int

save(path)[source]

Part of the l2rpn_baselines interface, this allows to save a model. Its name is used at saving time. The same name must be reused when loading it back.

Parameters:: path (str) – The path where to save the agent.

train(env, iterations, save_path, logdir, training_param=None)[source]

Part of the public l2rpn-baselines interface, this function allows to train the baseline.

If save_path is not None, the the model is saved regularly, and also at the end of training.

TODO explain a bit more how you can train it.

Parameters:

env (grid2op.Environment.Environment or grid2op.Environment.MultiEnvironment) – The environment used to train your model.
iterations (int) – The number of training iteration. NB when reloading a model, this is NOT the training steps that will be used when re training. Indeed, if iterations is 1000 and the model was already trained for 750 time steps, then when reloaded, the training will occur on 250 (=1000 - 750) time steps only.
save_path (str) – Location at which to save the model
logdir (str) – Location at which tensorboard related information will be kept.
training_param (l2rpn_baselines.utils.TrainingParam) – The meta parameters for the training procedure. This is currently ignored if the model is reloaded (in that case the parameters used when first created will be used)

class l2rpn_baselines.utils.GymAgent(g2op_action_space, gym_act_space, gym_obs_space, *, nn_path=None, nn_kwargs=None, gymenv=None, _check_both_set=True, _check_none_set=True)[source]

This class maps a neural network (trained using ray / rllib or stable baselines for example

It can then be used as a “regular” grid2op agent, in a runner, grid2viz, grid2game etc.

It is also compatible with the “l2rpn baselines” interface.

Use it only with a trained agent. It does not provide the “save” method and is not suitable for training.

Note

To load a previously saved agent the function GymAgent.load will be called and you must provide the nn_path keyword argument.

To build a new agent, the function GymAgent.build is called and you must provide the nn_kwargs keyword argument.

Examples

Some examples of such agents are provided in the classes:

l2rpn_baselines.PPO_SB3.PPO_SB3 that implements such an agent with the “stable baselines3” RL framework
l2rpn_baselines.PPO_RLLIB.PPO_RLLIB that implements such an agent with the “ray / rllib” RL framework

Both can benefit from the feature of this class, most notably the possibility to include “heuristics” (such as: “if a powerline can be reconnected, do it” or “do not act if the grid is not in danger”)

Notes

The main goal of this class is to be able to use “heuristics” (both for training and at inference time) quite simply and with out of the box support of external libraries.

All top performers in all l2rpn competitions (as of writing) used some kind of heuristics in their agent (such as: “if a powerline can be reconnected, do it” or “do not act if the grid is not in danger”). This is why we made some effort to develop a generic class that allows to train agents directly using these “heuristics”.

This features is split in two parts:

At training time, the “heuristics” are part of the environment. The agent will see only observations that are relevant to it (and not the stat handled by the heuristic.)
At inference time, the “heuristics” of the environment used to train the agent are included in the “agent.act” function. If a heuristic has been used at training time, the agent will first “ask” the environment is a heuristic should be performed on the grid (in this case it will do it) otherwise it will ask the underlying neural network what to do.

Some examples are provided in the “examples” code (under the “examples/ppo_stable_baselines”) repository that demonstrates the use of l2rpn_baselines.utils.GymEnvWithRecoWithDN .

Methods:

`act`(observation, reward, done)	This function is called to "map" the grid2op world into a usable format by a neural networks (for example in a format usable by stable baselines or ray/rllib)
`build`()	Build the NN model.
`clean_heuristic_actions`(observation, reward, ...)	This function allows to cure the heuristic actions.
`get_act`(gym_obs, reward, done)	retrieve the action from the NN model
`load`()	Load the NN model

act(observation: BaseObservation, reward: float, done: bool) → BaseAction[source]

This function is called to “map” the grid2op world into a usable format by a neural networks (for example in a format usable by stable baselines or ray/rllib)

Parameters:

observation (BaseObservation) – The grid2op observation
reward (float) – The reward
done (function) – the flag “done” by open ai gym.

Returns:

The action taken by the agent, in a form of a grid2op BaseAction.

Return type:

BaseAction

Notes

In case your “real agent” wants to implement some “non learned” heuristic, you can also put them here.

In this case the “gym agent” will only be used in particular settings.

abstractmethod build()[source]

Build the NN model.

..info:: Only called if the agent has been build with nn_path=None and nn_kwargs not None

clean_heuristic_actions(observation: BaseObservation, reward: float, done: bool) → None[source]

This function allows to cure the heuristic actions.

It is called at each step, just after the heuristic actions are computed (but before they are selected).

It can be used, for example, to reorder the self._action_list for example.

It is not used during training.

Parameters:

observation (BaseObservation) – The current observation
reward (float) – the current reward
done (bool) – the current flag “done”

abstractmethod get_act(gym_obs, reward, done)[source]: retrieve the action from the NN model

abstractmethod load()[source]

Load the NN model

..info:: Only called if the agent has been build with nn_path not None and nn_kwargs=None

class l2rpn_baselines.utils.GymEnvWithHeuristics(env_init, *args, reward_cumul='last', **kwargs)[source]

This abstract class is used to perform some actions, independantly of a RL agent on a grid2op environment.

It can be used, for example, to train an agent (for example a deep-rl agent) if you want to use some heuristics at inference time (for example you reconnect every powerline that you can.)

The heuristic you want to implement should be implemented in GymEnvWithHeuristics.heuristic_actions().

Examples

Let’s imagine, for example, that you want to implement an RL agent that performs actions on the grid. But you noticed that your agent performs better if the all the powerlines are reconnected (which is often the case by the way).

To that end, you want to force the reconnection of powerline each time it’s possible. When it’s not possible, you want to let the neural network do what is best for the environment.

Training an agent on such setting might be difficult and require recoding some (deep) part of the training framework (eg stable-baselines). Unless… You use a dedicated “environment”.

In this environment (compatible, inheriting the base class gym.Env) will handle all the “heuristic” part and only show the agent with the state where it should act.

Basically a “step” happens like this:

the agent issue an action (gym format)
the action (gym format) is decoded to a grid2op compatible action (thanks to the action_space)
this grid2op action is implemented on the grid (thanks to the underlying grid2op environment) and the corresponding grid2op observation is generated
this observation is processed by the GymEnvWithHeuristics.apply_heuristics_actions(): the grid2op_env.step is called until the NN agent is require to take a decision (or the flag done=True is set).
the observation (corresponding to the last step above) is then converted to a gym action (thanks to the observation_space) which is forwarded to the agent.

The agent then only “sees” what is not processed by the heuristic. It is trained only on the relevant “state”.

Methods:

`apply_heuristics_actions`(g2op_obs, reward, ...)	This function implements the "logic" behind the heuristic part.
`fix_action`(grid2op_action, g2op_obs)	This function can be used to "fix" / "modify" / "cut" / "change" a grid2op action just before it will be applied to the underlying "env.step(...)"
`heuristic_actions`(g2op_obs, reward, done, info)	This function has the same signature as the "agent.act" function.
`reset`(*[, seed, return_info, options])	This function implements the "reset" function.
`step`(gym_action)	This function implements the special case of the "step" function (as seen by the "gym environment") that might call multiple times the "step" function of the underlying "grid2op environment" depending on the heuristic.

apply_heuristics_actions(g2op_obs: BaseObservation, reward: float, done: bool, info: Dict) → Tuple[BaseObservation, float, bool, Dict][source]

This function implements the “logic” behind the heuristic part. Unless you have a particular reason too, you probably should not modify this function.

If you modify it, you should also modify the way the agent implements it (remember: this function is used at training time, the “GymAgent” part is used at inference time. Both behaviour should match for the best performance).

While there are “heuristics” / “expert rules” / etc. this function should perform steps in the underlying grid2op environment.

It is expected to return when:

either the flag done is True
or the neural network agent is asked to perform action on the grid

The neural network agent will receive the outpout of this function.

Parameters:

g2op_obs (BaseObservation) – The grid2op observation.
reward (float) – The reward
done (bool) – The flag that indicates whether the environment is over or not.
info (Dict) – Other information flags

Returns:

It should return obs, reward, done, info`(same as a single call to `grid2op_env.step(grid2op_act))

Then, this will be transmitted to the neural network agent (but before the observation will be transformed to a gym observation thanks to the observation space.)

Return type:

Tuple[BaseObservation, float, bool, Dict]

fix_action(grid2op_action, g2op_obs)[source]

This function can be used to “fix” / “modify” / “cut” / “change” a grid2op action just before it will be applied to the underlying “env.step(…)”

This can be used, for example to “limit the curtailment or storage” of the action in case this one is too strong and would lead to a game over.

By default it does nothing.

Parameters:: grid2op_action (_type_) – _description_

abstractmethod heuristic_actions(g2op_obs: BaseObservation, reward: float, done: bool, info: Dict) → List[BaseAction][source]

This function has the same signature as the “agent.act” function. It allows to implement a heuristic.

It can be called multiple times per “gymenv step” and is expect to return a list of grid2op actions (in the correct order) to be done on the underlying grid2op environment.

An implementation of such a function (for example) can be found at GymEnvWithReco.heuristic_actions() or GymEnvWithRecoWithDN.heuristic_actions()

This function can return a list of action that will “in turn” be executed on the grid. It is only after each and every actions that are returned that this function is called again.

Note

You MUST return “[do_nothing]” if your heuristic chose to do nothing at a certain step. Otherwise (if the returned list is empty “[]” the agent is asked to perform an action.)

Note

We remind that inside a “gym env” step, a lot of “grid2op env” steps might be happening.

As long as a heuristic action is selected (ie as long as this function does not return the empty list) this action is performed on the grid2op environment.

Parameters:

g2op_obs (BaseObservation) – [description]
reward (float) – The last reward the agent (or the heuristic) had. This is the reward part of the last call to obs, reward, done, info = grid2op_env.step(grid2op_act)
done (bool) – Whether the environment is “done” or not. It should be “False” in most cases. This is the done part of the last call to obs, reward, done, info = grid2op_env.step(grid2op_act)
info (Dict) – info part of the last call to obs, reward, done, info = grid2op_env.step(grid2op_act)

Returns:

The ordered list of actions to implement, selected by the “heuristic” / “expert knowledge” / “automatic action”.

Return type:

List[BaseAction]

reset(*, seed=None, return_info=False, options=None)[source]

This function implements the “reset” function. It is called at the end of every episode and marks the beginning of a new one.

Again, before the agents sees any observations from the environment, they are processed by the “heuristics” / “expert rules”.

Note

The first observation seen by the agent is not necessarily the first observation of the grid2op environment.

Returns:: The first open ai gym observation received by the agent
Return type:: gym_obs

step(gym_action)[source]

This function implements the special case of the “step” function (as seen by the “gym environment”) that might call multiple times the “step” function of the underlying “grid2op environment” depending on the heuristic.

It takes a gym action, convert it to a grid2op action (thanks to the action space).

Then process the heuristics / expert rules / forced actions / etc. and return the next gym observation that will be processed by the agent.

The number of “grid2op steps” can vary between different “gym environment” call to “step”.

It has the same signature as the gym.Env “step” function, of course.

Parameters:

gym_action – the action (represented as a gym one) that the agent wants to perform.

Returns:

gym_obs – The gym observation that will be processed by the agent
reward (float) – The reward of the agent (that might be computed by the )
done (bool) – Whether the episode is over or not
info (Dict) – Other type of informations

class l2rpn_baselines.utils.GymEnvWithReco(env_init, *args, reward_cumul='last', **kwargs)[source]

This specific type of environment with “heuristics” / “expert rules” / “expert actions” is an example to illustrate how to perfom an automatic powerline reconnection.

For this type of environment the only heuristic implemented is the following: “each time i can reconnect a powerline, i don’t ask the agent, i reconnect it and send it the state after the powerline has been reconnected”.

With the proposed class, implementing it is fairly easy as shown in function GymEnvWithReco.heuristic_actions()

Methods:

heuristic_actions(g2op_obs, reward, done, info)

The heuristic is pretty simple: each there is a powerline with a cooldown at 0 and that is disconnected the heuristic reconnects it.

heuristic_actions(g2op_obs, reward, done, info) → List[BaseAction][source]

The heuristic is pretty simple: each there is a powerline with a cooldown at 0 and that is disconnected the heuristic reconnects it.

:param See parameters of GymEnvWithHeuristics.heuristic_actions():

Return type:: See return values of GymEnvWithHeuristics.heuristic_actions()

class l2rpn_baselines.utils.GymEnvWithRecoWithDN(env_init, *args, reward_cumul='init', safe_max_rho=0.9, **kwargs)[source]

This environment is slightly more complex that the other one.

It consists in 2 things:

reconnecting the powerlines if possible
doing nothing is the state of the grid is “safe” (for this class, the notion of “safety” is pretty simple: if all flows are bellow 90% (by default) of the thermal limit, then it is safe)

If for a given step, non of these things is applicable, the underlying trained agent is asked to perform an action

Warning

When using this environment, we highly recommend to adapt the parameter safe_max_rho to suit your need.

Sometimes, 90% of the thermal limit is too high, sometimes it is too low.

Methods:

heuristic_actions(g2op_obs, reward, done, info)

To match the description of the environment, this heuristic will:

heuristic_actions(g2op_obs, reward, done, info) → List[BaseAction][source]

To match the description of the environment, this heuristic will:

return the list of all the powerlines that can be reconnected if any
return the list “[do nothing]” is the grid is safe
return the empty list (signaling the agent should take control over the heuristics) otherwise

:param See parameters of GymEnvWithHeuristics.heuristic_actions():

Return type:: See return values of GymEnvWithHeuristics.heuristic_actions()

class l2rpn_baselines.utils.NNParam(action_size, observation_size, sizes, activs, list_attr_obs)[source]

This class provides an easy way to save and restore, as json, the shape of your neural networks (number of layers, non linearities, size of each layers etc.)

It is recommended to overload this class for each specific model.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Prefer to use the GymAgent class and the GymEnvWithHeuristics classes to train agent interacting with grid2op and fully compatible with gym framework.

nn_class

The neural network class that will be created with each call of l2rpn_baselines.make_nn()

Type:: l2rpn_baselines.BaseDeepQ

observation_size

The size of the observation space.

Type:: int

action_size

The size of the action space.

Type:: int

sizes

A list of integer, each will represent the number of hidden units. The number of hidden layer is given by the size / length of this list.

Type:: list

activs

List of activation functions (given as string). It should have the same length as the NNParam.sizes. This function should be name of keras activation function.

Type:: list

list_attr_obs

List of the attributes that will be used from the observation and concatenated to be fed to the neural network.

Type:: list

Methods:

`center_reduce`(env)	currently not implemented for this class, "coming soon" as we might say
`from_dict`(tmp)	load from a dictionnary
`from_json`(json_path)	load from a json file
`get_obs_attr`()	get the names of the observation attributes that will be extracted
`get_obs_size`(env, list_attr_name)	get the size of the flatten observation
`get_path_model`(path[, name])	get the path at which the model will be saved
`make_nn`(training_param)	build the appropriate BaseDeepQ
`save_as_json`(path[, name])	save as a json file
`to_dict`()	convert this instance to a dictionnary

Classes:

nn_class

alias of BaseDeepQ

center_reduce(env)[source]: currently not implemented for this class, “coming soon” as we might say

classmethod from_dict(tmp)[source]: load from a dictionnary

classmethod from_json(json_path)[source]: load from a json file

get_obs_attr()[source]: get the names of the observation attributes that will be extracted

staticmethod get_obs_size(env, list_attr_name)[source]: get the size of the flatten observation

classmethod get_path_model(path, name=None)[source]: get the path at which the model will be saved

make_nn(training_param)[source]: build the appropriate BaseDeepQ

nn_class

alias of BaseDeepQ Methods:

`construct_q_network`()	Abstract method that need to be overide.
`get_path_model`(path[, name])	Get the location at which the neural networks will be saved.
`load_network`(path[, name, ext])	Load the neural networks.
`make_optimiser`()	helper function to create the proper optimizer (Adam) with the learning rates and its decay parameters.
`predict_movement`(data, epsilon[, ...])	Predict movement of game controler where is epsilon probability randomly move.
`save_network`(path[, name, ext])	save the neural networks.
`save_tensorboard`(current_step)	function used to save other information to tensorboard
`target_train`([tau])	update the target model with the parameters given in the `BaseDeepQ._training_param`.
`train`(s_batch, a_batch, r_batch, d_batch, ...)	Trains network to fit given parameters:
`train_on_batch`(model, optimizer_model, x, y_true)	train the model on a batch of example.

save_as_json(path, name=None)[source]: save as a json file

to_dict()[source]: convert this instance to a dictionnary

class l2rpn_baselines.utils.ReplayBuffer(buffer_size)[source]

Constructs a buffer object that stores the past moves and samples a set of subsamples

Methods:

`add`(s, a, r, d, s2)	Add an experience to the buffer
`sample`(batch_size)	Samples a total of elements equal to batch_size from buffer if buffer contains enough elements.

add(s, a, r, d, s2)[source]: Add an experience to the buffer

sample(batch_size)[source]: Samples a total of elements equal to batch_size from buffer if buffer contains enough elements. Otherwise return all elements

class l2rpn_baselines.utils.TrainingParam(buffer_size=40000, minibatch_size=64, step_for_final_epsilon=100000, min_observation=5000, final_epsilon=0.000496031746031746, initial_epsilon=0.4, lr=0.0001, lr_decay_steps=10000, lr_decay_rate=0.999, num_frames=1, discount_factor=0.99, tau=0.01, update_freq=256, min_iter=50, max_iter=8064, update_nb_iter=10, step_increase_nb_iter=0, update_tensorboard_freq=1000, save_model_each=10000, random_sample_datetime_start=None, oversampling_rate=None, max_global_norm_grad=None, max_value_grad=None, max_loss=None, min_observe=None, sample_one_random_action_begin=None)[source]

A class to store the training parameters of the models. It was hard coded in the getting_started/notebook 3 of grid2op and put in this repository instead.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Prefer to use the GymAgent class and the GymEnvWithHeuristics classes to train agent interacting with grid2op and fully compatible with gym framework.

buffer_size

Size of the replay buffer

Type:: int

minibatch_size

Size of the training minibatch

Type:: int

update_freq

Frequency at which the model is trained. Model is trained once every update_freq steps using minibatch_size from an experience replay buffer.

Type:: int

final_epsilon

value for the final epsilon (for the e-greedy)

Type:: float

initial_epsilon

value for the initial epsilon (for the e-greedy)

Type:: float

step_for_final_epsilon

number of step at which the final epsilon (for the epsilon greedy exploration) will be reached

Type:: int

min_observation

number of observations before starting to train the neural nets. Before this number of iterations, the agent will simply interact with the environment.

Type:: int

lr

The initial learning rate

Type:: float

lr_decay_steps

The learning rate decay step

Type:: int

lr_decay_rate

The learning rate decay rate

Type:: float

num_frames

Currently not used

Type:: int

discount_factor

The discount factor (a high discount factor is in favor of longer episode, a small one not really). This is often called “gamma” in some RL paper. It’s the gamma in: “RL wants to minize the sum of the dicounted reward, which are sum_{t >= t_0} gamma^{t - t_0} r_t

Type:: float

tau

Update the target model. Target model is updated according to $target_model_weights[i] = self.training_param.tau * model_weights[i] + (1 - self.training_param.tau) * target_model_weights[i]$

Type:: float

min_iter

It is possible in the training schedule to limit the number of time steps an episode can last. This is mainly useful at beginning of training, to not get in a state where the grid has been modified so much the agent will never get into a state resembling this one ever again). Stopping the episode before this happens can help the learning.

Type:: int

max_iter

Just like “min_iter” but instead of being the minimum number of iteration, it’s the maximum.

Type:: int

update_nb_iter

If max_iter_fun is the default one, this numer give the number of time we need to succeed a scenario before having to increase the maximum number of timestep allowed

Type:: int

step_increase_nb_iter

Of how many timestep we increase the maximum number of timesteps allowed per episode. Set it to O to deactivate this.

Type:: int or None

max_iter_fun

A function that return the maximum number of steps an episode can count as for the current epoch. For example it can be max_iter_fun = lambda epoch_num : np.sqrt(50 * epoch_num) [default lambda x: x / self.update_nb_iter]

Type:: function

oversampling_rate

Set it to None to deactivate the oversampling of hard scenarios. Otherwise, this oversampling is done with something like proba = 1. / (time_step_lived**oversampling_rate + 1) where proba is the probability to be selected at the next call to “reset” and time_step_lived is the number of time steps

Type:: float or None

random_sample_datetime_start

If None during training the chronics will always start at the datetime the chronics start. Otherwise, the training scheme will skip a number of time steps between 0 and random_sample_datetime_start when loading the next chronics. This is particularly useful when you want your agent to learn to operate the grid regardless of the hour of day or day of the week.

Type:: int or None

update_tensorboard_freq

Frequency at which tensorboard is refresh (tensorboard summaries are saved every update_tensorboard_freq steps)

Type:: int

save_model_each

Frequency at which the model is saved (it is saved every “save_model_each” steps)

Type:: int

max_global_norm_grad

Maximum gradient norm allowed (can make the training more stable) default to None if deactivated. Not all baselines are compatible.

Type:: float

max_value_grad

Maximum value the gradient can take. Assign it to None to deactivate it. This can make the training more stable in some cases, but can slow down the training process too. Not all baselines are compatible.

Type:: float

max_loss

Clip the value of the loss function. Set it to None to deactivate it. Again, this can make the training more stable but possibly slower. Not all baselines are compatible.

Type:: float

Methods:

`default_max_iter_fun`(nb_success)	the default max iteration function used
`do_train`()	return whether or not i should train the model at this time step
`from_dict`(tmp)	initialize this instance from a dictionary
`from_json`(json_path)	initialize this instance from a json
`get_next_epsilon`(current_step)	get the next epsilon for the e greedy exploration
`save_as_json`(path[, name])	save this instance as a json
`tell_step`(current_step)	tell this instance the number of training steps that have been made
`to_dict`()	serialize this instance to a dictionnary.

default_max_iter_fun(nb_success)[source]: the default max iteration function used

do_train()[source]: return whether or not i should train the model at this time step

staticmethod from_dict(tmp)[source]: initialize this instance from a dictionary

staticmethod from_json(json_path)[source]: initialize this instance from a json

get_next_epsilon(current_step)[source]: get the next epsilon for the e greedy exploration

save_as_json(path, name=None)[source]: save this instance as a json

tell_step(current_step)[source]: tell this instance the number of training steps that have been made

to_dict()[source]: serialize this instance to a dictionnary.

l2rpn_baselines.utils.cli_eval()[source]: some usefull command line arguments (CLI) for the evaluation of the baseline.

l2rpn_baselines.utils.cli_train()[source]: some default command line arguments (cli) for training the baselines. Can be reused in some baselines here.

l2rpn_baselines.utils.make_multi_env(env_init, nb_env)[source]

This function creates a multi environment compatible with what is expected in the baselines. In particular, it adds the observation_space, the action_space and the reward_range attribute.

The way this function works is explained in the getting_started of grid2op.

l2rpn_baselines.utils.env_init

The environment to duplicates

Type:: grid2op.Environment.Environment

l2rpn_baselines.utils.nb_env

The number of environment on with which you want to interact at the same time

Type:: int

Returns:: res – A copy of the initial environment (if nb_env = 1) or a MultiEnvironment based on the initial environment if nb_env >= 2.
Return type:: grid2op.Environment.MultiEnvironment or grid2op.Environment.Environment

l2rpn_baselines.utils.save_log_gif(path_log, res, gif_name=None)[source]

Output a gif named (by default “episode.gif”) that is the replay of the episode in a gif format, for each episode in the input.

Parameters:

path_log (str) – Path where the log of the agents are saved.
res (list) – List resulting from the call to runner.run
gif_name (str) – Name of the gif that will be used.

l2rpn_baselines.utils.str2bool(v)[source]: INTERNAL DO NOT USE

l2rpn_baselines.utils.train_generic(agent, env, name='Template', iterations=1, save_path=None, load_path=None, **kwargs_train)[source]

This function is a helper to train more easily some agent using their default “train” method.

Warning

This baseline recodes entire the RL training procedure. You can use it if you want to have a deeper look at Deep Q Learning algorithm and a possible (non optimized, slow, etc. implementation ).

For a much better implementation, you can reuse the code of “PPO_RLLIB” or the “PPO_SB3” baseline.

Prefer to use the GymAgent class and the GymEnvWithHeuristics classes to train agent interacting with grid2op and fully compatible with gym framework.

Parameters:

agent (grid2op.Agent) – A grid2op agent that must implement all the baseline attributes and the train method.
env (grid2op.Environment) – The environment on which to train your baseline. It must be compatible with the agent created.
name (str) – Here for compatibility with the baseline “train” method. Currently unused (define the name when you create your baseline)
iterations (int) – Number of iterations on which to train your agent.
save_path (str) – Where to save your results (put None do deactivate saving)
load_path (str) – Path to load the agent from.
kwargs_train (dict) – Other argument that will be passed to agent.train(…)

Returns:

agent – The trained agent.

Return type:

grid2op.Agent