Normalized Advantage Functions
Actions space: Continuous
Choosing an action
The current state is used as an input to the network. The action mean is extracted from the output head. It is then passed to the exploration policy which adds noise in order to encourage exploration.
Training the network
The network is trained by using the following targets: Use the next states as the inputs to the target network and extract the value, from within the head, to get . Then, update the online network using the current states and actions as inputs, and as the targets. After every training step, use a soft update in order to copy the weights from the online network to the target network.