N-Step Q Learning
Actions space: Discrete
Training the network
The -step Q learning algorithm works in similar manner to DQN except for the following changes:
No replay buffer is used. Instead of sampling random batches of transitions, the network is trained every steps using the latest steps played by the agent.
In order to stabilize the learning, multiple workers work together to update the network. This creates the same effect as uncorrelating the samples used for training.
Instead of using single-step Q targets for the network, the rewards from consequent steps are accumulated to form the -step Q targets, according to the following equation: where is for each state in the batch