CPC G06N 3/08 (2013.01) | 20 Claims |
1. A reinforcement learning system comprising one or more computers configured to:
retrieve training data comprising a plurality of experiences generated as a result of an agent interacting with an environment to perform a task in an attempt to achieve a specified result, each experience comprising an observation characterizing a state of the environment, an action performed by the agent in response to the observation and a reward received in response to the action; and
train a reinforcement learning neural network having one or more policy parameters to control the agent to perform the task by jointly training (i) the reinforcement learning neural network and (ii) an objective function that has one or more objective function parameters and that evaluates performance of the agent based on the actions performed by the agent, comprising:
updating the one or more policy parameters for the reinforcement learning neural network based on a first set of the experiences using the objective function;
updating the one or more objective function parameters of the objective function based on the one or more updated policy parameters and a second set of the experiences, wherein the one or more objective function parameters are updated via a gradient ascent or descent method using a meta-objective function differentiated with respect to the one or more objective function parameters, wherein the meta-objective function is dependent on the one or more policy parameters;
retrieving updated experiences generated as a result of the agent interacting with the environment to perform the task under the control of the reinforcement neural network using the one or more updated policy parameters and the one or more updated objective function parameters;
further updating the one or more policy parameters based on a first set of the updated experiences using the one or more updated objective function parameters; and
further updating the one or more objective function parameters based on the further updated policy parameters and a second set of the updated experiences via the gradient ascent or descent method.
|