Normalized Advantage Functions

Actions space: Continuous

References: Continuous Deep Q-Learning with Model-based Acceleration

Network Structure

Algorithm Description

Choosing an action

The current state is used as an input to the network. The action mean is extracted from the output head. It is then passed to the exploration policy which adds noise in order to encourage exploration.

Training the network

The network is trained by using the following targets: Use the next states as the inputs to the target network and extract the value, from within the head, to get . Then, update the online network using the current states and actions as inputs, and as the targets. After every training step, use a soft update in order to copy the weights from the online network to the target network.