A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle

03/22/2022
by   Ziniu Li, et al.
10

Q-learning with function approximation could diverge in the off-policy setting and the target network is a powerful technique to address this issue. In this manuscript, we examine the sample complexity of the associated target Q-learning algorithm in the tabular case with a generative oracle. We point out a misleading claim in [Lee and He, 2020] and establish a tight analysis. In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is 𝒪(|𝒮|^2|𝒜|^2 (1-γ)^-5ε^-2). Furthermore, we show that this sample complexity is improved to 𝒪(|𝒮||𝒜| (1-γ)^-5ε^-2) if we can sequentially update all state-action pairs and 𝒪(|𝒮||𝒜| (1-γ)^-4ε^-2) if γ is further in (1/2, 1). Compared with the vanilla Q-learning, our results conclude that the introduction of a periodically-frozen target Q-function does not sacrifice the sample complexity.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro