Notation

diag of sf

Note that a set of states, \(\mathcal{S}\) and a set of actions, \(\mathcal{A}\), which might be infinite, i.e. \(S=|\mathcal{S}|\leq\infty\), \(A=|\mathcal{A}|\leq\infty\).

Task

Given a model of the environment, how many transitions do we need to observe for finding an ``near’’ optimal policy with high probability.

diag of sf

According to above figure, given a state-action pair \((s,a)\), we hope to estimate the transition probability to next state. That is, we need to fulfill this table with size \(SA\times S\). However, if there is infinitely many state action pair, then what can we do?

Introduction

Linear property To deal with large state action pair, traditional method is to apply linear property, such as Yang & Wang (2019), which still focus on finite state action pair. The idea is as following given a \(D\)-dimensional feature space, the feature map \(\phi:\mathcal{S}\times\mathcal{A}\rightarrow \mathbb{R}^D\). Exist \(\psi:\mathcal{S}\rightarrow\mathbb{R}^D\) such that

\[P(s'|s,a)=\phi(s,a)^\top \psi(s')=\phi_1(s,a)\psi_1(s')+\cdots+\phi_D(s,a)\psi_D(s')\,.\]

Since \(Q(s,a)=r(s,a)+\gamma \sum_{s'\in\mathcal{S}}P(s'\vert s,a)V(s')\), then exist a vector \(w\in\mathbb{R}^D\), such that

\[Q_w=r(s,a)+\gamma\phi(s,a)^\top w.\]

Moreover, we write \(\sum_{s'\in\mathcal{S}}P(s'\vert s,a)V(s'):=\gamma PV(s,a)\).

Kernel setting Now, we generalize the finite dimensional feature space to infinite dimensional space (reproducing kernel Hilbert space) via kernel method. Since reproducing kernel Hilbert space (RKHS) belongs to \(L^2(\mathcal{S}\times\mathcal{A})\) so that we can generalize to infinite state action pair.

diag of sf

The following is listed some algorithm and their sample complexity upper bound. Note that \(\tilde{O}\) stand for ignoring the \(log\) term.

  • Tabular setting, finite state action pair in Sidford et al. (2018). The sample complexity upper bound is \(\tilde{O}\left(\frac{SA}{(1-\gamma)^3\epsilon^2}\right)\). Actually, the theoretical lower bound of sample complexity is \(\tilde{\Omega}\left(\frac{SA}{(1-\gamma)^3\epsilon^2}\right)\).

  • Linear transition probability setting, finite (large) state action pair in Yang & Wang (2019). The sample complexity upper bound is \(\tilde{O}\left(\frac{D}{(1-\gamma)^3\epsilon^2}\right)\). The step of this method is choose the representative set first, and obtain the sample from the representative set to estimate the transition probability.

diag of sf

Methodology

I replace the finite dimensional feature space by RKHS in order to deal with infinitely many state action pairs. TBA, when I finish and publish the result.

Reference

  1. L.F. Yang & M. Wang, Sample-Optimal Parametric Q-Learning Using Linearly Additive Features., (2019).
  2. Z. Yang, C. Jin, Z. Wang, M. Wang & M.I. Jordan, On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces., (2020).