On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data

06/18/2021
by   Chenjun Xiao, et al.
0

We study the fundamental question of the sample complexity of learning a good policy in finite Markov decision processes (MDPs) when the data available for learning is obtained by following a logging policy that must be chosen without knowledge of the underlying MDP. Our main results show that the sample complexity, the minimum number of transitions necessary and sufficient to obtain a good policy, is an exponential function of the relevant quantities when the planning horizon H is finite. In particular, we prove that the sample complexity of obtaining ϵ-optimal policies is at least Ω(A^min(S-1, H+1)) for γ-discounted problems, where S is the number of states, A is the number of actions, and H is the effective horizon defined as H=⌊ln(1/ϵ)ln(1/γ)⌋; and it is at least Ω(A^min(S-1, H)/ε^2) for finite horizon problems, where H is the planning horizon of the problem. This lower bound is essentially matched by an upper bound. For the average-reward setting we show that there is no algorithm finding ϵ-optimal policies with a finite amount of data.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro