The agent then executes an MCTS procedure and uses this to select the next action to take. The generated games are then used to update the network’s parameters, enabling the agent to learn.
2023 June, Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, “Faster sorting algorithms discovered using deep reinforcement learning”, in Nature, volume 618, number 7964, pages 257–263