Voici les éléments 1 - 8 sur 8
  • Publication
    Accès libre
    Inferential Induction: A Novel Framework for Bayesian Reinforcement Learning
    (2020-02-08T06:19:15Z)
    Hannes Eriksson
    ;
    Emilio Jorge
    ;
    ;
    Debabrota Basu
    ;
    Divya Grover
    Bayesian reinforcement learning (BRL) offers a decision-theoretic solution for reinforcement learning. While "model-based" BRL algorithms have focused either on maintaining a posterior distribution on models or value functions and combining this with approximate dynamic programming or tree search, previous Bayesian "model-free" value function distribution approaches implicitly make strong assumptions or approximations. We describe a novel Bayesian framework, Inferential Induction, for correctly inferring value function distributions from data, which leads to the development of a new class of BRL algorithms. We design an algorithm, Bayesian Backwards Induction, with this framework. We experimentally demonstrate that the proposed algorithm is competitive with respect to the state of the art.
  • Publication
    Accès libre
    Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process
    (2019-06-20T06:32:36Z)
    Aristide Tossou
    ;
    ;
    Debabrota Basu
    We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.
  • Publication
    Accès libre
    Epistemic Risk-Sensitive Reinforcement Learning
    (2019-06-14T16:25:20Z)
    Hannes Eriksson
    ;
    We develop a framework for interacting with uncertain environments in reinforcement learning (RL) by leveraging preferences in the form of utility functions. We claim that there is value in considering different risk measures during learning. In this framework, the preference for risk can be tuned by variation of the parameter $\beta$ and the resulting behavior can be risk-averse, risk-neutral or risk-taking depending on the parameter choice. We evaluate our framework for learning problems with model uncertainty. We measure and control for \emph{epistemic} risk using dynamic programming (DP) and policy gradient-based algorithms. The risk-averse behavior is then compared with the behavior of the optimal risk-neutral policy in environments with epistemic risk.
  • Publication
    Accès libre
    Near-Optimal Online Egalitarian learning in General Sum Repeated Matrix Games
    (2019-06-04T17:43:08Z)
    Aristide Tossou
    ;
    ;
    Jaroslaw Rzepecki
    ;
    Katja Hofmann
    We study two-player general sum repeated finite games where the rewards of each player are generated from an unknown distribution. Our aim is to find the egalitarian bargaining solution (EBS) for the repeated game, which can lead to much higher rewards than the maximin value of both players. Our most important contribution is the derivation of an algorithm that achieves simultaneously, for both players, a high-probability regret bound of order O(lnT−−−√3⋅T2/3) after any T rounds of play. We demonstrate that our upper bound is nearly optimal by proving a lower bound of Ω(T2/3) for any algorithm.
  • Publication
    Accès libre
    Randomised Bayesian Least-Squares Policy Iteration
    (2019-04-06T21:50:24Z)
    Nikolaos Tziortziotis
    ;
    ;
    Michalis Vazirgiannis
    We introduce Bayesian least-squares policy iteration (BLSPI), an off-policy, model-free, policy iteration algorithm that uses the Bayesian least-squares temporal-difference (BLSTD) learning algorithm to evaluate policies. An online variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that improves its policy based on an incomplete policy evaluation step. In online setting, the exploration-exploitation dilemma should be addressed as we try to discover the optimal policy by using samples collected by ourselves. RBLSPI exploits the advantage of BLSTD to quantify our uncertainty about the value function. Inspired by Thompson sampling, RBLSPI first samples a value function from a posterior distribution over value functions, and then selects actions based on the sampled value function. The effectiveness and the exploration abilities of RBLSPI are demonstrated experimentally in several environments.
  • Publication
    Accès libre
    On The Differential Privacy of Thompson Sampling With Gaussian Prior
    (2018-06-24T18:37:09Z)
    Aristide C. Y. Tossou
    ;
    We show that Thompson Sampling with Gaussian Prior as detailed by Algorithm 2 in (Agrawal & Goyal, 2013) is already differentially private. Theorem 1 show that it enjoys a very competitive privacy loss of only O(ln2T) after T rounds. Finally, Theorem 2 show that one can control the privacy loss to any desirable ϵ level by appropriately increasing the variance of the samples from the Gaussian posterior. And this increases the regret only by a term of O(ln2Tϵ). This compares favorably to the previous result for Thompson Sampling in the literature ((Mishra & Thakurta, 2015)) which adds a term of O(Kln3Tϵ2) to the regret in order to achieve the same privacy level. Furthermore, our result use the basic Thompson Sampling with few modifications whereas the result of (Mishra & Thakurta, 2015) required sophisticated constructions.
  • Publication
    Accès libre
    Learning to Match
    (2017-07-30T21:50:50Z)
    Philip Ekman
    ;
    Sebastian Bellevik
    ;
    ;
    Aristide C. Y. Tossou
    Outsourcing tasks to previously unknown parties is becoming more common. One specific such problem involves matching a set of workers to a set of tasks. Even if the latter have precise requirements, the quality of individual workers is usually unknown. The problem is thus a version of matching under uncertainty. We believe that this type of problem is going to be increasingly important. When the problem involves only a single skill or type of job, it is essentially a type of bandit problem, and can be solved with standard algorithms. However, we develop an algorithm that can perform matching for workers with multiple skills hired for multiple jobs with multiple requirements. We perform an experimental evaluation in both single-task and multi-task problems, comparing with the bounded $\epsilon$-first algorithm, as well as an oracle that knows the true skills of workers. One of the algorithms we developed gives results approaching 85\% of oracle's performance. We invite the community to take a closer look at this problem and develop real-world benchmarks.
  • Publication
    Accès libre
    On the Differential Privacy of Bayesian Inference
    (2015-12-22T09:22:39Z)
    Zuhe Zhang
    ;
    Benjamin Rubinstein
    ;
    We study how to communicate findings of Bayesian inference to third parties, while preserving the strong guarantee of differential privacy. Our main contributions are four different algorithms for private Bayesian inference on proba-bilistic graphical models. These include two mechanisms for adding noise to the Bayesian updates, either directly to the posterior parameters, or to their Fourier transform so as to preserve update consistency. We also utilise a recently introduced posterior sampling mechanism, for which we prove bounds for the specific but general case of discrete Bayesian networks; and we introduce a maximum-a-posteriori private mechanism. Our analysis includes utility and privacy bounds, with a novel focus on the influence of graph structure on privacy. Worked examples and experiments with Bayesian na{\"i}ve Bayes and Bayesian linear regression illustrate the application of our mechanisms.