Options
Dimitrakakis, Christos
Résultat de la recherche
Environment Design for Inverse Reinforcement Learning
2022, Thomas Kleine Buening, Dimitrakakis, Christos
The task of learning a reward function from expert demonstrations suffers from high sample complexity as well as inherent limitations to what can be learned from demonstrations in a given environment. As the samples used for reward learning require human input, which is generally expensive, much effort has been dedicated towards designing more sample-efficient algorithms. Moreover, even with abundant data, current methods can still fail to learn insightful reward functions that are robust to minor changes in the environment dynamics. We approach these challenges differently than prior work by improving the sample-efficiency as well as the robustness of learned rewards through adaptively designing a sequence of demonstration environments for the expert to act in. We formalise a framework for this environment design process in which learner and expert repeatedly interact, and construct algorithms that actively seek information about the rewards by carefully curating environments for the human to demonstrate the task in.
Bayesian Reinforcement Learning via Deep, Sparse Sampling
2020, Divya Grover, Debabrota Basu, Dimitrakakis, Christos
We address the problem of Bayesian reinforcement learning using efficient model-based online planning. We propose an optimism-free Bayes-adaptive algorithm to induce deeper and sparser exploration with a theoretical bound on its performance relative to the Bayes optimal policy, with a lower computational complexity. The main novelty is the use of a candidate policy generator, to generate long-term options in the planning tree (over beliefs), which allows us to create much sparser and deeper trees. Experimental results on different environments show that in comparison to the state-of-the-art, our algorithm is both computationally more efficient, and obtains significantly higher reward in discrete environments.
Achieving Privacy in the Adversarial Multi-Armed Bandit
2017, Aristide C. Y. Tossou, Dimitrakakis, Christos
In this paper, we improve the previously best known regret bound to achieve ϵ-differential privacy in oblivious adversarial bandits from O(T2/3/ϵ) to O(T−−√lnT/ϵ). This is achieved by combining a Laplace Mechanism with EXP3. We show that though EXP3 is already differentially private, it leaks a linear amount of information in T. However, we can improve this privacy by relying on its intrinsic exponential mechanism for selecting actions. This allows us to reach O(lnT−−−√)-DP, with a regret of O(T2/3) that holds against an adaptive adversary, an improvement from the best known of O(T3/4). This is done by using an algorithm that run EXP3 in a mini-batch loop. Finally, we run experiments that clearly demonstrate the validity of our theoretical analysis.
Adaptive Belief Discretization for POMDP Planning
2021-04-15T07:04:32Z, Divya Grover, Dimitrakakis, Christos
Partially Observable Markov Decision Processes (POMDP) is a widely used model to represent the interaction of an environment and an agent, under state uncertainty. Since the agent does not observe the environment state, its uncertainty is typically represented through a probabilistic belief. While the set of possible beliefs is infinite, making exact planning intractable, the belief space's complexity (and hence planning complexity) is characterized by its covering number. Many POMDP solvers uniformly discretize the belief space and give the planning error in terms of the (typically unknown) covering number. We instead propose an adaptive belief discretization scheme, and give its associated planning error. We furthermore characterize the covering number with respect to the POMDP parameters. This allows us to specify the exact memory requirements on the planner, needed to bound the value function error. We then propose a novel, computationally efficient solver using this scheme. We demonstrate that our algorithm is highly competitive with the state of the art in a variety of scenarios.
Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities
2019, Aristide Tossou, Debabrota Basu, Dimitrakakis, Christos
We study model-based reinforcement learning in an unknown finite communicating Markov decision process. We propose a simple algorithm that leverages a variance based confidence interval. We show that the proposed algorithm, UCRL-V, achieves the optimal regret O~(DSAT−−−−−−√) up to logarithmic factors, and so our work closes a gap with the lower bound without additional assumptions on the MDP. We perform experiments in a variety of environments that validates the theoretical bounds as well as prove UCRL-V to be better than the state-of-the-art algorithms.
Rollout sampling approximate policy iteration
2008, Dimitrakakis, Christos, Michail G. Lagoudakis
Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which addresses the core sampling problem in evaluating a policy through simulation as a multi-armed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountain-car.
On Meritocracy in Optimal Set Selection
2021-02-23T20:36:36Z, Thomas Kleine Buening, Meirav Segal, Debabrota Basu, Dimitrakakis, Christos, Anne-Marie George
Typically, merit is defined with respect to some intrinsic measure of worth. We instead consider a setting where an individual's worth is \emph{relative}: when a Decision Maker (DM) selects a set of individuals from a population to maximise expected utility, it is natural to consider the \emph{Expected Marginal Contribution} (EMC) of each person to the utility. We show that this notion satisfies an axiomatic definition of fairness for this setting. We also show that for certain policy structures, this notion of fairness is aligned with maximising expected utility, while for linear utility functions it is identical to the Shapley value. However, for certain natural policies, such as those that select individuals with a specific set of attributes (e.g. high enough test scores for college admissions), there is a trade-off between meritocracy and utility maximisation. We analyse the effect of constraints on the policy on both utility and fairness in extensive experiments based on college admissions and outcomes in Norwegian universities.
Thompson Sampling For Stochastic Bandits with Graph Feedback
2017-01-16T10:52:51Z, Aristide C. Y. Tossou, Dimitrakakis, Christos, Devdatt Dubhashi
We present a novel extension of Thompson Sampling for stochastic sequential decision problems with graph feedback, even when the graph structure itself is unknown and/or changing. We provide theoretical guarantees on the Bayesian regret of the algorithm, linking its performance to the underlying properties of the graph. Thompson Sampling has the advantage of being applicable without the need to construct complicated upper confidence bounds for different problems. We illustrate its performance through extensive experimental results on real and simulated networks with graph feedback. More specifically, we tested our algorithms on power law, planted partitions and Erdo's-Renyi graphs, as well as on graphs derived from Facebook and Flixster data. These all show that our algorithms clearly outperform related methods that employ upper confidence bounds, even if the latter use more information about the graph.