Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP

01/29/2021
by   Zihan Zhang, et al.
0

We show how to construct variance-aware confidence sets for linear bandits and linear mixture Markov Decision Process (MDP). Our method yields the following new regret bounds: * For linear bandits, we obtain an O(poly(d)√(1 + ∑_i=1^Kσ_i^2)) regret bound, where d is the feature dimension, K is the number of rounds, and σ_i^2 is the (unknown) variance of the reward at the i-th round. This is the first regret bound that only scales with the variance and the dimension, with no explicit polynomial dependency on K. * For linear mixture MDP, we obtain an O(poly(d, log H)√(K)) regret bound for linear mixture MDP, where d is the number of base models, K is the number of episodes, and H is the planning horizon. This is the first regret bound that only scales logarthmically with H in the reinforcement learning (RL) with linear function approximation setting, thus exponentially improving existing results. Our methods utilize three novel ideas that may be of independent interest: 1) applications of the layering techniques to the norm of input and the magnitude of variance, 2) a recursion-based approach to estimate the variance, and 3) a convex potential lemma that in a sense generalizes the seminal elliptical potential lemma.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro