US 11,836,620 B2
	Meta-gradient updates for training return functions for reinforcement learning systems
Zhongwen Xu, London (GB); Hado Philip van Hasselt, London (GB); and David Silver, Hitchin (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Dec. 4, 2020, as Appl. No. 17/112,220.
Application 17/112,220 is a continuation of application No. 16/417,536, filed on May 20, 2019, granted, now 10,860,926.
Claims priority of provisional application 62/673,844, filed on May 18, 2018.
Prior Publication US 2021/0089915 A1, Mar. 25, 2021
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01)

CPC G06N 3/08 (2013.01)

20 Claims

1. A reinforcement learning system comprising one or more computers configured to:

retrieve training data comprising a plurality of experiences generated as a result of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, each experience comprising an observation characterizing a state of the environment, an action performed by the agent in response to the observation and a reward received in response to the action; and

train a reinforcement learning neural network having one or more policy parameters to control the agent to perform the task by jointly training (i) the reinforcement learning neural network and (ii) an objective function that has one or more objective function parameters and that evaluates performance of the agent based on the actions performed by the agent, comprising:

updating the one or more policy parameters for the reinforcement learning neural network based on a first set of the experiences using the objective function;

updating the one or more objective function parameters of the objective function based on the one or more updated policy parameters and a second set of the experiences, wherein the one or more objective function parameters are updated via a gradient ascent or descent method using a meta-objective function differentiated with respect to the one or more objective function parameters, wherein the meta-objective function is dependent on the one or more policy parameters;

retrieving updated experiences generated as a result of the agent interacting with the environment to perform the task under the control of the reinforcement neural network using the one or more updated policy parameters and the one or more updated objective function parameters;

further updating the one or more policy parameters based on a first set of the updated experiences using the one or more updated objective function parameters; and

further updating the one or more objective function parameters based on the further updated policy parameters and a second set of the updated experiences via the gradient ascent or descent method.