{"id":5647872,"date":"2023-05-14T00:00:20","date_gmt":"2023-05-14T04:00:20","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5647872"},"modified":"2023-05-16T06:45:38","modified_gmt":"2023-05-16T10:45:38","slug":"how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/","title":{"rendered":"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Takeaways<\/h3> Learn how to build and train a Reinforcement Learning model with PyTorch and <a href=\"https:\/\/lightning.ai\/pages\/open-source\/fabric\/\">Lightning Fabric<\/a>. You will also create and train a Reinforcement Learning agent to play a game in a simulated environment using Proximal Policy Optimization (PPO) algorithm. Based on the contribution <a href=\"https:\/\/github.com\/Lightning-AI\/lightning\/tree\/master\/examples\/fabric\/reinforcement_learning\">here<\/a> <\/div>\n<h2>About the Author<\/h2>\n<p>Federico is currently working as Data Scientist at Orobix, a front-runner in the AI industry in Italy, where he not only solves complex, real-world problems but also effectively bridges the gap between theory and application. With 3.5 years of rich experience in the fields of Computer Vision, with an emphasis on Self-Supervised Learning, segmentation, and classification, and Reinforcement Learning, his work consistently aims to push the envelope in terms of what AI can achieve in the industry.<\/p>\n<h2>Introduction to Reinforcement Learning<\/h2>\n<p>Reinforcement Learning (RL) is a type of machine learning algorithm that trains intelligent agents to make decisions by interacting with an environment and adapting their behavior to maximize a certain goal over time. It is inspired by how humans and animals learn from their experiences and adjust their actions accordingly.<\/p>\n<p>Reinforcement learning has been extremely successful in various applications, including robotics, autonomous vehicles, recommendation systems, and game-playing. One of the most famous examples is <a href=\"https:\/\/www.deepmind.com\/research\/highlighted-research\/alphago\">AlphaGo<\/a>, an AI system developed by DeepMind. It combined reinforcement learning with deep neural networks to defeat the world champion <a href=\"https:\/\/en.wikipedia.org\/wiki\/Go_(game)\">Go<\/a> player. Go is a strategy board game for two players that was invented in China more than 2500 years ago. There are almost 2&#215;10170 possible legal board positions, and the game is played on a 19&#215;19 board. The aim is to surround more territory than the opponent.<\/p>\n<h3>Components of reinforcement learning<\/h3>\n<ul>\n<li><strong>Agent<\/strong>: The agent is the entity (e.g., an AI algorithm or a robot) that learns and makes decisions based on its interactions with the environment.<\/li>\n<li><strong>Environment<\/strong>: The environment represents the external context or the world in which the agent operates. It can be as simple as a 2D Tic-Tac-Toe grid or as complex as the real world.<\/li>\n<li><strong>States<\/strong>: A state is a snapshot of the environment at a given point in time and represents what the agent perceives. It provides the agent with the necessary information to make decisions.<\/li>\n<li><strong>Actions<\/strong>: Actions are the set of possible moves or choices the agent can make in a given state. The agent&#8217;s objective is to choose the most appropriate action based on its current understanding of the environment.<\/li>\n<li><strong>Rewards<\/strong>: Rewards are the feedback the agent receives from the environment after performing an action. They indicate how well the agent is doing in achieving its goal. The agent&#8217;s objective is to learn a strategy that maximizes the cumulative reward over time.<\/li>\n<\/ul>\n<p>The high-level representation is shown in the following figure:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i.imgur.com\/jsyheWQ.png\" alt=\"https:\/\/i.imgur.com\/jsyheWQ.png\" \/><\/p>\n<p>In the above figure, time is discretized and represented by <strong>t<\/strong>, the agent interacts with the <strong>environment<\/strong>. The agent receives an <strong>observation<\/strong> from the environment, which represents the <strong>state<\/strong> of the environment at that point in time. The agent then performs an <strong>action<\/strong> based on that state and receives a <strong>reward<\/strong> in return. The reward is a scalar value that indicates how good or bad the action was with respect to the particular goal or task that the agent is trying to achieve.<\/p>\n<h2>Journey with PyTorch for Reinforcement Learning<\/h2>\n<p>Orobix, an AI company from Italy, worked with a video game company to develop an RL framework. The goal is to improve the racing performance of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Non-player_character\">non-player characters<\/a> (NPCs) in the game. This collaboration aims to create a more competitive and immersive experience.<\/p>\n<p>The framework was built from scratch to allow us to have full flexibility over the training loop and the distributed training infrastructure.<\/p>\n<p>We needed manual control over distributed training, half-precision, and every part of the code to make it more flexible. <a href=\"https:\/\/lightning.ai\/docs\/fabric\/stable\/\">Fabric<\/a>, a new library launched by Lightning AI (formerly PyTorch Lightning) team, stepped our way. It helped us give full flexibility over the custom training loop and at the same time abstract multiple devices, distributed and half-precision training.<\/p>\n<h2>Fabric-accelerated Reinforcement Learning<\/h2>\n<p>Now we will build and train an RL agent to play in a <a href=\"https:\/\/gymnasium.farama.org\/environments\/classic_control\/cart_pole\/\">CartPole<\/a> environment, where a pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. This agent is based on the <a href=\"https:\/\/spinningup.openai.com\/en\/latest\/algorithms\/ppo.html#proximal-policy-optimization\">Proximal Policy Optimization (PPO)<\/a> algorithm. The objective is to balance the pole by applying forces in the left and right direction on the cart:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i.imgur.com\/8tcJPCz.gif\" alt=\"https:\/\/i.imgur.com\/8tcJPCz.gif\" \/><\/p>\n<h3>What&#8217;s needed<\/h3>\n<p>We need to install the following libraries:<\/p>\n<ul>\n<li><a href=\"https:\/\/gymnasium.farama.org\/\">Gymnasium<\/a>: a standard API for reinforcement learning containing a diverse collection of reference environments<\/li>\n<li><a href=\"https:\/\/lightning.ai\/docs\/fabric\/stable\/\">Fabric<\/a>: used to accelerate and distribute our training<\/li>\n<\/ul>\n<p>The complete list of requirements can be looked up <a href=\"https:\/\/github.com\/Lightning-AI\/lightning\/blob\/master\/examples\/fabric\/reinforcement_learning\/requirements.txt\">here<\/a>.<\/p>\n<h3>Environment coupled with the Agent<\/h3>\n<p>Let\u2019s first understand when environment is coupled with the agent. The main idea is depicted in the following figure:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i.imgur.com\/nYXi8H6.png\" alt=\"https:\/\/i.imgur.com\/nYXi8H6.png\" \/><\/p>\n<p>where we will spawn <em>N+1 processes<\/em>, called <em>rank-0<\/em>, &#8230;, <em>rank-N<\/em>; every process contains both the environment (possibly multiple, <em>M+1<\/em> in the above figure, and different copies) and the agent: they are <em>coupled<\/em> together in the same process.<\/p>\n<p>Let us first define our <code>main(...)<\/code> function where we will initialize the distributed training settings using Fabric.<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\nfrom lightning.fabric import Fabric\n\ndef main(args):<br \/>\n    # Initialize Fabric<br \/>\n    fabric = Fabric()<br \/>\n    rank = fabric.global_rank  # The rank of the current process<br \/>\n    world_size = fabric.world_size  # Number of processes spawned<br \/>\n    device = fabric.device<br \/>\n    fabric.seed_everything(42)  # We seed everything for reproduciability purpose<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>Next, we create the environment using <a href=\"https:\/\/github.com\/openai\/gym\">gymnasium<\/a>. First, we define a helper function <code>make_env<\/code> that creates a single environment.<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\ndef make_env(env_id: str, seed: int, idx: int, capture_video: bool, run_name: Optional[str] = None, prefix: str = \"\"):<br \/>\n    def thunk():<br \/>\n        env = gym.make(env_id, render_mode=\"rgb_array\")<br \/>\n        env = gym.wrappers.RecordEpisodeStatistics(env)<br \/>\n        if capture_video:<br \/>\n            if idx == 0 and run_name is not None:<br \/>\n                env = gym.wrappers.RecordVideo(<br \/>\n                    env, os.path.join(run_name, prefix + \"_videos\" if prefix else \"videos\"), disable_logger=True<br \/>\n                )<br \/>\n        env.action_space.seed(seed)<br \/>\n        env.observation_space.seed(seed)<br \/>\n        return env\n\n    return thunk<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>Now, we will create a pool of parallel synchronized environments through the <a href=\"https:\/\/gymnasium.farama.org\/api\/vector\/#sync-vector-env\">SyncVectorEnv<\/a> object using the <code>make_env<\/code> function we just created.<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\nimport gymnasium as gym\n\n# given an initial seed of 42 and 4 environments per rank, then<br \/>\n# rank-0 will seed the environments with --&gt; 42, 43, 44, 45<br \/>\n# rank-1 will seed the environments with --&gt; 46, 47, 48, 49<br \/>\n# and so on<br \/>\nrl_environment = gym.vector.SyncVectorEnv([<br \/>\n    make_env(<br \/>\n        args.env_id,<br \/>\n        args.seed + rank * args.num_envs + i,<br \/>\n        rank,<br \/>\n        args.capture_video,<br \/>\n        logger.log_dir,<br \/>\n        \"train\"<br \/>\n    )<br \/>\n    for i in range(args.num_envs)<br \/>\n])\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>In the last step, we create the agent, optimizer and integrate it with Fabric for faster training.<\/p>\n<p>We have defined <a href=\"https:\/\/github.com\/Lightning-AI\/lightning\/blob\/master\/examples\/fabric\/reinforcement_learning\/rl\/agent.py#L97\">PPOLightningAgent<\/a>, a <a href=\"https:\/\/lightning.ai\/docs\/pytorch\/stable\/common\/lightning_module.html\">LightningModule<\/a>, which is an <strong>Actor-Critic agent<\/strong>. In Actor Critic Agent, the actor proposes a set of possible actions in a given <em>state<\/em>, and the critic evaluates <em>actions<\/em> taken by the actor.<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\nagent = PPOLightningAgent(<br \/>\n    rl_environment,<br \/>\n    act_fun=args.activation_function,<br \/>\n    vf_coef=args.vf_coef,<br \/>\n    ent_coef=args.ent_coef,<br \/>\n    clip_coef=args.clip_coef,<br \/>\n    clip_vloss=args.clip_vloss,<br \/>\n    ortho_init=args.ortho_init,<br \/>\n    normalize_advantages=args.normalize_advantages,<br \/>\n)<br \/>\noptimizer = agent.configure_optimizers(args.learning_rate)\n\n# accelerated training with Fabric<br \/>\nagent, optimizer = fabric.setup(agent, optimizer)<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>Now we need to create the <em>&#8220;infinite&#8221;<\/em> loop in which:<\/p>\n<ol>\n<li>the agent collects experiences interacting with the environment, where a single experience is composed by $$(\\text{observation}_t, \\text{reward}_t, \\text{action}_t, \\text{done}_t)$$, where the $$\\text{done}_t$$ is a boolean flag indicating whether the game has finished or not. The agent collects experiences until the game terminates or a predefined number of steps has been played.<\/li>\n<li>given the collected experiences, train the agent to improve its behaviour<\/li>\n<li>repeat from <em>step 1<\/em> until convergence or a maximum number of interactions with environment has been reached<\/li>\n<\/ol>\n<p>The experience-collecting loop is the following:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\nimport torch\n\nwith fabric.device:<br \/>\n    # with fabric.device is only supported in PyTorch 2.x+<br \/>\n    obs = torch.zeros((args.num_steps, args.num_envs) + envs.single_observation_space.shape)<br \/>\n    actions = torch.zeros((args.num_steps, args.num_envs) + envs.single_action_space.shape)<br \/>\n    rewards = torch.zeros((args.num_steps, args.num_envs))<br \/>\n    dones = torch.zeros((args.num_steps, args.num_envs))\n\n    # Log-probabilities of the action played are needed later on during the training phase<br \/>\n    logprobs = torch.zeros((args.num_steps, args.num_envs))\n\n    # The same happens for the critic values<br \/>\n    values = torch.zeros((args.num_steps, args.num_envs))\n\n# Global variables<br \/>\nglobal_step = 0<br \/>\nsingle_global_rollout = int(args.num_envs * args.num_steps * world_size)<br \/>\nnum_updates = args.total_timesteps \/\/ single_global_rollout\n\nwith fabric.device:<br \/>\n    # Get the first environment observation and start the optimization<br \/>\n    next_obs = torch.tensor(envs.reset(seed=args.seed)[0])<br \/>\n    next_done = torch.zeros(args.num_envs)\n\n# Collect `num_steps` experiences `num_updates` times<br \/>\nfor update in range(1, num_updates + 1):<br \/>\n    # Learning rate annealing<br \/>\n    if args.anneal_lr:<br \/>\n        linear_annealing(optimizer, update, num_updates, args.learning_rate)\n\n    for step in range(0, args.num_steps):<br \/>\n        global_step += args.num_envs * world_size<br \/>\n        obs[step] = next_obs<br \/>\n        dones[step] = next_done\n\n        # Sample an action given the observation received by the environment<br \/>\n        with torch.no_grad():<br \/>\n            action, logprob, _, value = agent.get_action_and_value(next_obs)<br \/>\n            values[step] = value.flatten()<br \/>\n        actions[step] = action<br \/>\n        logprobs[step] = logprob\n\n        # Single environment step<br \/>\n        next_obs, reward, done, truncated, info = envs.step(action.cpu().numpy())\n\n        # Check whether the game has finished or not<br \/>\n        done = torch.logical_or(torch.tensor(done), torch.tensor(truncated))\n\n        with fabric.device:<br \/>\n            rewards[step] = torch.tensor(reward).view(-1)<br \/>\n            next_obs, next_done = torch.tensor(next_obs), done<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>To train both the actor and the critic we need to estimates <a href=\"https:\/\/spinningup.openai.com\/en\/latest\/spinningup\/rl_intro.html#reward-and-return\">returns<\/a> and <a href=\"https:\/\/spinningup.openai.com\/en\/latest\/spinningup\/rl_intro.html#advantage-functions\">advantages<\/a>:<\/p>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li>the advantage describes how much better it is to take a specific action $$a$$ in state $$s$$, over randomly selecting an action according to the actor<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li>the return is the sum of discounted rewards received by the environments: $$G_t=\\sum_{t=t&#8217;}^{T}\\gamma^{t-t&#8217;}r_t$$, where $$y \\in (0,1)$$ is the <span class=\"notion-enable-hover\" data-token-index=\"1\">discount factor<\/span>. Intuitively, the return simply implies that rewards now are better than rewards later<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\n# Estimate advantages and returns with GAE ()<br \/>\nreturns, advantages = agent.estimate_returns_and_advantages(<br \/>\n    rewards, values, dones, next_obs, next_done, args.num_steps, args.gamma, args.gae_lambda<br \/>\n)<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>We are now finally able to train the agent:<br \/>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\n# Flatten the batch<br \/>\nlocal_data = {<br \/>\n    \"obs\": obs.reshape((-1,) + envs.single_observation_space.shape),<br \/>\n    \"logprobs\": logprobs.reshape(-1),<br \/>\n    \"actions\": actions.reshape((-1,) + envs.single_action_space.shape),<br \/>\n    \"advantages\": advantages.reshape(-1),<br \/>\n    \"returns\": returns.reshape(-1),<br \/>\n    \"values\": values.reshape(-1),<br \/>\n}\n\n# Train the agent<br \/>\ntrain(fabric, agent, optimizer, local_data, global_step, args)<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\nfrom torch.utils.data import BatchSampler, RandomSampler\n\ndef train(<br \/>\n    fabric: Fabric,<br \/>\n    agent: PPOLightningAgent,<br \/>\n    optimizer: torch.optim.Optimizer,<br \/>\n    data: Dict[str, Tensor],<br \/>\n    global_step: int,<br \/>\n    args: argparse.Namespace,<br \/>\n):<br \/>\n    sampler = RandomSampler(list(range(data[\"obs\"].shape[0])))<br \/>\n    sampler = BatchSampler(sampler, batch_size=args.per_rank_batch_size, drop_last=False)\n\n    for _ in range(args.update_epochs):<br \/>\n        for batch_idxes in sampler:<br \/>\n            loss = agent.training_step({k: v[batch_idxes] for k, v in data.items()})<br \/>\n            optimizer.zero_grad(set_to_none=True)<br \/>\n            fabric.backward(loss)<br \/>\n            fabric.clip_gradients(agent, optimizer, max_norm=args.max_grad_norm)<br \/>\n            optimizer.step()<br \/>\n        agent.on_train_epoch_end(global_step)\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>For more detailed information on the complete training step of the agent, please refer to this <a href=\"https:\/\/github.com\/Lightning-AI\/lightning\/blob\/master\/examples\/fabric\/reinforcement_learning\/rl\/agent.py#L196\">link<\/a>.<\/p>\n<p>As we have witnessed, there is no boilerplate code required for distributed training; Fabric abstracts that process for us. To train our agent in a distributed way, simply execute the following command:<\/p>\n<pre><code>lightning run model \\\r\n    --accelerator=gpu \\\r\n    --strategy=ddp \\\r\n    --devices=2 \\\r\n    train_fabric.py \\\r\n    --capture-video \\\r\n    --env-id CartPole-v1 \\\r\n    --total-timesteps 100000 \\\r\n    --num-envs 2 \\\r\n    --num-steps 512\r\n\r\n<\/code><\/pre>\n<p>The trained agent should then play the game like the following:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/i.imgur.com\/Z0K1S1C.gif\" alt=\"https:\/\/i.imgur.com\/Z0K1S1C.gif\" \/><\/p>\n<h2>Conclusion<\/h2>\n<p>Reinforcement learning is a powerful machine learning technique that enables agents to learn from their experiences and improve their decision-making capabilities over time. It has the potential to revolutionize various industries and contribute to the development of more intelligent and adaptive AI systems.<\/p>\n<p>In this blog-post we have briefly introduced the high-level concepts of Reinforcement Learning and showcase how to train an agent to play optimally the Cart-Pole game and thanks to Fabric we were able to accelerate the training without boilerplate code.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>About the Author Federico is currently working as Data Scientist at Orobix, a front-runner in the AI industry in Italy, where he not only solves complex, real-world problems but also effectively bridges the gap between theory and application. With 3.5 years of rich experience in the fields of Computer Vision, with an emphasis on Self-Supervised<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/\" title=\"ReadHow To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":16,"featured_media":5647898,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[41],"tags":[179,178,190],"glossary":[],"acf":{"additional_authors":[{"author_name":"Federico Belotti","author_url":""}],"mathjax":true,"default_editor":true,"show_table_of_contents":false,"table_of_contents":"","hide_from_archive":false,"content_type":"Blog Post","sticky":false,"custom_styles":"body mjx-container[jax=\"CHTML\"][display=\"true\"] {\r\n    display: inline-block;\r\n}"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm - Lightning AI<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm - Lightning AI\" \/>\n<meta property=\"og:description\" content=\"About the Author Federico is currently working as Data Scientist at Orobix, a front-runner in the AI industry in Italy, where he not only solves complex, real-world problems but also effectively bridges the gap between theory and application. With 3.5 years of rich experience in the fields of Computer Vision, with an emphasis on Self-Supervised... Read more &raquo;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-05-14T04:00:20+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-05-16T10:45:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1360\" \/>\n\t<meta property=\"og:image:height\" content=\"1008\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"JP Hennessy\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JP Hennessy\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/\"},\"author\":{\"name\":\"JP Hennessy\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\"},\"headline\":\"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm\",\"datePublished\":\"2023-05-14T04:00:20+00:00\",\"dateModified\":\"2023-05-16T10:45:38+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/\"},\"wordCount\":2022,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png\",\"keywords\":[\"fabric\",\"lightning fabric\",\"Reinforcement Learning\"],\"articleSection\":[\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/\",\"url\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/\",\"name\":\"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm - Lightning AI\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png\",\"datePublished\":\"2023-05-14T04:00:20+00:00\",\"dateModified\":\"2023-05-16T10:45:38+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png\",\"width\":1360,\"height\":1008},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\",\"name\":\"JP Hennessy\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"caption\":\"JP Hennessy\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm - Lightning AI","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/","og_locale":"en_US","og_type":"article","og_title":"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm - Lightning AI","og_description":"About the Author Federico is currently working as Data Scientist at Orobix, a front-runner in the AI industry in Italy, where he not only solves complex, real-world problems but also effectively bridges the gap between theory and application. With 3.5 years of rich experience in the fields of Computer Vision, with an emphasis on Self-Supervised... Read more &raquo;","og_url":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/","og_site_name":"Lightning AI","article_published_time":"2023-05-14T04:00:20+00:00","article_modified_time":"2023-05-16T10:45:38+00:00","og_image":[{"width":1360,"height":1008,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png","type":"image\/png"}],"author":"JP Hennessy","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"JP Hennessy","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/"},"author":{"name":"JP Hennessy","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6"},"headline":"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm","datePublished":"2023-05-14T04:00:20+00:00","dateModified":"2023-05-16T10:45:38+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/"},"wordCount":2022,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png","keywords":["fabric","lightning fabric","Reinforcement Learning"],"articleSection":["Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/","url":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/","name":"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm - Lightning AI","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png","datePublished":"2023-05-14T04:00:20+00:00","dateModified":"2023-05-16T10:45:38+00:00","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/05\/carbon-6.png","width":1360,"height":1008},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/how-to-train-reinforcement-learning-model-to-play-game-using-proximal-policy-optimization-ppo-algorithm\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"How To Train Reinforcement Learning Model To Play Game Using Proximal Policy Optimization (PPO) Algorithm"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6","name":"JP Hennessy","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","caption":"JP Hennessy"},"url":"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5647872"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5647872"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5647872\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5647898"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5647872"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5647872"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5647872"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5647872"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}