PPO: with stable-baselines3

Description

This “baseline” aims at providing a code example on how to use an agent from the Sable Baselines3 repository (see https://stable-baselines3.readthedocs.io/en/master/) with grid2op.

It also serve a second goal, to show how to train a PPO agent to perform continuous actions on the powergrid (eg adjusting the generator value, either by applying redispatching kind of action for controlable generators or by with curtailment on generator using new renewable energy sources - solar and wind or even to control the state of the storage units.)

Exported class

You can use this class with:

from l2rpn_baselines.PPO_SB3 import train, evaluate, PPO_SB3

Used a trained agent

You first need to train it:

import re
import grid2op
from grid2op.Reward import LinesCapacityReward  # or any other rewards
from lightsim2grid import LightSimBackend  # highly recommended !
from grid2op.Chronics import MultifolderWithCache  # highly recommended for training
from l2rpn_baselines.PPO_SB3 import train

env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name,
                   reward_class=LinesCapacityReward,
                   backend=LightSimBackend(),
                   chronics_class=MultifolderWithCache)

env.chronics_handler.real_data.set_filter(lambda x: re.match(".*0$", x) is not None)
env.chronics_handler.real_data.reset()
# see https://grid2op.readthedocs.io/en/latest/environment.html#optimize-the-data-pipeline
# for more information !
train(env,
      iterations=1_000,
      logs_dir="./logs",
      save_path="./saved_model",
      name="test",
      net_arch=[200, 200, 200],
      save_every_xxx_steps=2000,
      )

Then you can load it:

import grid2op
from grid2op.Reward import LinesCapacityReward  # or any other rewards
from lightsim2grid import LightSimBackend  # highly recommended !
from l2rpn_baselines.PPO_SB3 import evaluate

nb_episode = 7
nb_process = 1
verbose = True

env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name,
                   reward_class=LinesCapacityReward,
                   backend=LightSimBackend()
                  )

try:
    trained_agent, res_eval = evaluate(
                env,
                nb_episode=nb_episode,
                load_path="./saved_model",
                name="test4",
                nb_process=1,
                verbose=verbose,
                )

    # you can also compare your agent with the do nothing agent relatively
    # easily
    runner_params = env.get_params_for_runner()
    runner = Runner(**runner_params)

    res = runner.run(nb_episode=nb_episode,
                    nb_process=nb_process
                    )

    # Print summary
    if verbose:
        print("Evaluation summary for DN:")
        for _, chron_name, cum_reward, nb_time_step, max_ts in res:
            msg_tmp = "chronics at: {}".format(chron_name)
            msg_tmp += "\ttotal score: {:.6f}".format(cum_reward)
            msg_tmp += "\ttime steps: {:.0f}/{:.0f}".format(nb_time_step, max_ts)
            print(msg_tmp)
finally:
    env.close()

Create an agent from scratch

For example, to create an agent from scratch, with some parameters:

import grid2op
from grid2op.gym_compat import BoxGymObsSpace, BoxGymActSpace
from lightsim2grid import LightSimBackend
from stable_baselines3.ppo import MlpPolicy
from l2rpn_baselines.PPO_SB3 import PPO_SB3

env_name = "l2rpn_case14_sandbox"  # or any other name

# customize the observation / action you want to keep
obs_attr_to_keep = ["day_of_week", "hour_of_day", "minute_of_hour", "prod_p", "prod_v", "load_p", "load_q",
                    "actual_dispatch", "target_dispatch", "topo_vect", "time_before_cooldown_line",
                    "time_before_cooldown_sub", "rho", "timestep_overflow", "line_status",
                    "storage_power", "storage_charge"]
act_attr_to_keep = ["redispatch"]

# create the grid2op environment
env = grid2op.make(env_name, backend=LightSimBackend())

# define the action space and observation space that your agent
# will be able to use
env_gym = GymEnv(env)
env_gym.observation_space.close()
env_gym.observation_space = BoxGymObsSpace(env.observation_space,
                                           attr_to_keep=obs_attr_to_keep)
env_gym.action_space.close()
env_gym.action_space = BoxGymActSpace(env.action_space,
                                      attr_to_keep=act_attr_to_keep)

# create the key word arguments used for the NN
nn_kwargs = {
    "policy": MlpPolicy,
    "env": env_gym,
    "verbose": 0,
    "learning_rate": 1e-3,
    "tensorboard_log": ...,
    "policy_kwargs": {
        "net_arch": [100, 100, 100]
    }
}

# create a grid2gop agent based on that (this will reload the save weights)
grid2op_agent = PPO_SB3(env.action_space,
                        env_gym.action_space,
                        env_gym.observation_space,
                        nn_kwargs=nn_kwargs
                       )

Note

The agent above is NOT trained. So it will basically output “random” actions.

You should probably train it before hand (see the train function)

Caveats

Be carefull, at time of writing, there is a migration from all major RL packages from the legacy “open ai gym” api to the more recent “gymanisium” api.

This transition will happen soon in grid2op too. But for now, and not to break legacy code that might have worked on previous L2RPN competitions, the switch from gym to gymanisium is not yet made.

This means that installing stable-baselines3 might cause some issues with grid2op and l2rpn baselines due to the gym / gymanisium migration.

Detailed documentation

Classes:

PPO_SB3

alias of SB3Agent

Functions:

evaluate(env[, load_path, name, logs_path, ...])

This function will use stable baselines 3 to evaluate a previously trained PPO agent (with stable baselines 3) on a grid2op environment "env".

remove_non_usable_attr(grid2openv, ...)

This function modifies the attribute (of the actions) to remove the one that are non usable with your gym environment.

save_used_attribute(save_path, name, ...)

Serialize, as jon the obs_attr_to_keep and act_attr_to_keep

train(env[, name, iterations, save_path, ...])

This function will use stable baselines 3 to train a PPO agent on a grid2op environment "env".

l2rpn_baselines.PPO_SB3.PPO_SB3

alias of SB3Agent Methods:

build()

Create the underlying NN model from scratch.

get_act(gym_obs, reward, done)

Retrieve the gym action from the gym observation and the reward.

load()

Load the NN model.

l2rpn_baselines.PPO_SB3.evaluate(env, load_path='.', name='PPO_SB3', logs_path=None, nb_episode=1, nb_process=1, max_steps=-1, verbose=False, save_gif=False, gymenv_class=<class 'grid2op.gym_compat.gymenv.GymnasiumEnv'>, gymenv_kwargs=None, obs_space_kwargs=None, act_space_kwargs=None, iter_num=None, **kwargs)[source]

This function will use stable baselines 3 to evaluate a previously trained PPO agent (with stable baselines 3) on a grid2op environment “env”.

It will use the grid2op “gym_compat” module to convert the action space to a BoxActionSpace and the observation to a BoxObservationSpace.

It is suited for the studying the impact of continuous actions:

  • on storage units

  • on dispatchable generators

  • on generators with renewable energy sources

Parameters:
  • env (grid2op.Environment) – Then environment on which you need to train your agent.

  • name (str`) – The name of your agent.

  • load_path (str) – If you want to reload your baseline, specify the path where it is located. NB if a baseline is reloaded some of the argument provided to this function will not be used.

  • logs_dir (str) – Where to store the tensorboard generated logs during the training. None if you don’t want to log them.

  • nb_episode (str) – How many episodes to run during the assessment of the performances

  • nb_process (int) – On how many process the assessment will be made. (setting this > 1 can lead to some speed ups but can be unstable on some plaform)

  • max_steps (int) – How many steps at maximum your agent will be assessed

  • verbose (bool) – Currently un used

  • save_gif (bool) – Whether or not you want to save, as a gif, the performance of your agent. It might cause memory issues (might take a lot of ram) and drastically increase computation time.

  • gymenv_class – The class to use as a gym environment. By default GymEnv (from module grid2op.gym_compat)

  • gymenv_kwargs (dict) – Extra key words arguments to build the gym environment.

  • iter_num – Which training iteration do you want to restore (by default: None means “the last one”)

  • kwargs – extra parameters passed to the PPO from stable baselines 3

Returns:

The loaded baseline as a stable baselines PPO element.

Return type:

baseline

Examples

Here is an example on how to evaluate an PPO agent (previously trained with stable baselines3):

import grid2op
from grid2op.Reward import LinesCapacityReward  # or any other rewards
from lightsim2grid import LightSimBackend  # highly recommended !
from l2rpn_baselines.PPO_SB3 import evaluate

nb_episode = 7
nb_process = 1
verbose = True

env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name,
                   reward_class=LinesCapacityReward,
                   backend=LightSimBackend()
                   )

try:
    evaluate(env,
            nb_episode=nb_episode,
            load_path="./saved_model",  # should be the same as what has been called in the train function !
            name="test",  # should be the same as what has been called in the train function !
            nb_process=1,
            verbose=verbose,
            )

    # you can also compare your agent with the do nothing agent relatively
    # easily
    runner_params = env.get_params_for_runner()
    runner = Runner(**runner_params)

    res = runner.run(nb_episode=nb_episode,
                    nb_process=nb_process
                    )

    # Print summary
    if verbose:
        print("Evaluation summary for DN:")
        for _, chron_name, cum_reward, nb_time_step, max_ts in res:
            msg_tmp = "chronics at: {}".format(chron_name)
            msg_tmp += "        total score: {:.6f}".format(cum_reward)
            msg_tmp += "        time steps: {:.0f}/{:.0f}".format(nb_time_step, max_ts)
            print(msg_tmp)
finally:
    env.close()
l2rpn_baselines.PPO_SB3.remove_non_usable_attr(grid2openv, act_attr_to_keep: List[str]) List[str][source]

This function modifies the attribute (of the actions) to remove the one that are non usable with your gym environment.

If only filters things if the default variables are used (see _default_act_attr_to_keep)

Parameters:
  • grid2openv (grid2op.Environment.Environment) – The used grid2op environment

  • act_attr_to_keep (List[str]) – The attributes of the actions to keep.

Returns:

The same as act_attr_to_keep if the user modified the default. Or the attributes usable by the environment from the default list.

Return type:

List[str]

l2rpn_baselines.PPO_SB3.save_used_attribute(save_path: str | None, name: str, obs_attr_to_keep: List[str], act_attr_to_keep: List[str]) bool[source]

Serialize, as jon the obs_attr_to_keep and act_attr_to_keep

This is typically called in the train function.

Parameters:
  • save_path (Optional[str]) – where to save the used attributes (put None if you don’t want to save it)

  • name (str) – Name of the model

  • obs_attr_to_keep (List[str]) – List of observation attributes to keep

  • act_attr_to_keep (List[str]) – List of action attributes to keep

Returns:

whether the data have been saved or not

Return type:

bool

l2rpn_baselines.PPO_SB3.train(env, name='PPO_SB3', iterations=1, save_path=None, load_path=None, net_arch=None, logs_dir=None, learning_rate=0.0003, checkpoint_callback=None, save_every_xxx_steps=None, model_policy=<class 'l2rpn_baselines.PPO_SB3.train.MlpPolicy'>, obs_attr_to_keep=['day_of_week', 'hour_of_day', 'minute_of_hour', 'prod_p', 'prod_v', 'load_p', 'load_q', 'actual_dispatch', 'target_dispatch', 'topo_vect', 'time_before_cooldown_line', 'time_before_cooldown_sub', 'rho', 'timestep_overflow', 'line_status', 'storage_power', 'storage_charge'], obs_space_kwargs=None, act_attr_to_keep=['redispatch', 'curtail', 'set_storage'], act_space_kwargs=None, policy_kwargs=None, normalize_obs=False, normalize_act=False, gymenv_class=<class 'grid2op.gym_compat.gymenv.GymnasiumEnv'>, gymenv_kwargs=None, verbose=True, seed=None, eval_env=None, **kwargs)[source]

This function will use stable baselines 3 to train a PPO agent on a grid2op environment “env”.

It will use the grid2op “gym_compat” module to convert the action space to a BoxActionSpace and the observation to a BoxObservationSpace.

It is suited for the studying the impact of continuous actions:

  • on storage units

  • on dispatchable generators

  • on generators with renewable energy sources

Parameters:
  • env (grid2op.Environment) – The environment on which you need to train your agent.

  • name (str`) – The name of your agent.

  • iterations (int) – For how many iterations (steps) do you want to train your agent. NB these are not episode, these are steps.

  • save_path (str) – Where do you want to save your baseline.

  • load_path (str) – If you want to reload your baseline, specify the path where it is located. NB if a baseline is reloaded some of the argument provided to this function will not be used.

  • net_arch – The neural network architecture, used to create the neural network of the PPO (see https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html)

  • logs_dir (str) – Where to store the tensorboard generated logs during the training. None if you don’t want to log them.

  • learning_rate (float) – The learning rate, see https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html

  • save_every_xxx_steps (int) – If set (by default it’s None) the stable baselines3 model will be saved to the hard drive each save_every_xxx_steps steps performed in the environment.

  • model_policy – Type of neural network model trained in stable baseline. By default it’s MlpPolicy

  • obs_attr_to_keep (list of string) – Grid2op attribute to use to build the BoxObservationSpace. It is passed as the “attr_to_keep” value of the BoxObservation space (see https://grid2op.readthedocs.io/en/latest/gym.html#grid2op.gym_compat.BoxGymObsSpace)

  • obs_space_kwargs – Extra kwargs to build the BoxGymObsSpace (NOT saved then NOT restored)

  • act_attr_to_keep (list of string) – Grid2op attribute to use to build the BoxGymActSpace. It is passed as the “attr_to_keep” value of the BoxAction space (see https://grid2op.readthedocs.io/en/latest/gym.html#grid2op.gym_compat.BoxGymActSpace)

  • act_space_kwargs – Extra kwargs to build the BoxGymActSpace (NOT saved then NOT restored)

  • verbose (bool) – If you want something to be printed on the terminal (a better logging strategy will be put at some point)

  • normalize_obs (bool) – Attempt to normalize the observation space (so that gym-based stuff will only see numbers between 0 and 1)

  • normalize_act (bool) – Attempt to normalize the action space (so that gym-based stuff will only manipulate numbers between 0 and 1)

  • gymenv_class – The class to use as a gym environment. By default GymEnv (from module grid2op.gym_compat)

  • gymenv_kwargs (dict) – Extra key words arguments to build the gym environment., NOT saved / restored by this class

  • policy_kwargs (dict) – extra parameters passed to the PPO “policy_kwargs” key word arguments (defaults to None)

  • kwargs – extra parameters passed to the PPO from stable baselines 3

Returns:

The trained baseline as a stable baselines PPO element.

Return type:

baseline

Examples

Here is an example on how to train a ppo_stablebaseline .

First define a python script, for example

import re
import grid2op
from grid2op.Reward import LinesCapacityReward  # or any other rewards
from grid2op.Chronics import MultifolderWithCache  # highly recommended
from lightsim2grid import LightSimBackend  # highly recommended for training !
from l2rpn_baselines.PPO_SB3 import train

env_name = "l2rpn_case14_sandbox"
env = grid2op.make(env_name,
                   reward_class=LinesCapacityReward,
                   backend=LightSimBackend(),
                   chronics_class=MultifolderWithCache)

env.chronics_handler.real_data.set_filter(lambda x: re.match(".*00$", x) is not None)
env.chronics_handler.real_data.reset()
# see https://grid2op.readthedocs.io/en/latest/environment.html#optimize-the-data-pipeline
# for more information !

try:
    trained_agent = train(
          env,
          iterations=10_000,  # any number of iterations you want
          logs_dir="./logs",  # where the tensorboard logs will be put
          save_path="./saved_model",  # where the NN weights will be saved
          name="test",  # name of the baseline
          net_arch=[100, 100, 100],  # architecture of the NN
          save_every_xxx_steps=2000,  # save the NN every 2k steps
          )
finally:
    env.close()