Nearly Horizon-Free Offline Reinforcement Learning

03/25/2021
by   Tongzheng Ren, et al.
0

We revisit offline reinforcement learning on episodic time-homogeneous tabular Markov Decision Processes with S states, A actions and planning horizon H. Given the collected N episodes data with minimum cumulative reaching probability d_m, we obtain the first set of nearly H-free sample complexity bounds for evaluation and planning using the empirical MDPs: 1.For the offline evaluation, we obtain an Õ(√(1/Nd_m)) error rate, which matches the lower bound and does not have additional dependency on (S,A) in higher-order term, that is different from previous works <cit.>. 2.For the offline policy optimization, we obtain an Õ(√(1/Nd_m) + S/Nd_m) error rate, improving upon the best known result by <cit.>, which has additional H and S factors in the main term. Furthermore, this bound approaches the Ω(√(1/Nd_m)) lower bound up to logarithmic factors and a high-order term. To the best of our knowledge, these are the first set of nearly horizon-free bounds in offline reinforcement learning.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro