Voici les éléments 1 - 10 sur 95
  • Publication
    Accès libre
    Explainable Machine Learning: Approximating Shapley Values for Dependent Predictors
    Modern Machine Learning algorithms often outperform classical statistical methods in predictive accuracy. This comes at the expense of model interpretability. As businesses and institutions increasingly rely on Machine Learning to support and automate decision making processes to reap the benefits of more accurate predictions, explaining these model outputs becomes more important. A universally applicable approach to explaining such complex models is based on the Shapley value, a concept originating from game theory. However, its calculation is very computer-intensive, so approximations have to be used. The state-of-the-art approach, Kernel SHAP, assumes independence of the predictors, which is unrealistic in practice. Recent research has developed improvements to incorporate dependencies between predictors. After a review of the theoretical underpinnings, the original KernelSHAP method is compared with improved versions in realistic settings, using three real-world datasets. While the improved versions are found to have smaller approximation error to exact Shapley values, they are also more computationally demanding. Further improvements are discussed and possible research directions are suggested. The thesis is structured as follows: After introducing explainable machine learning in chapter 1, the Shapley value and its applications to model explainability are explored in chapter 2. Chapter 3 presents methods to approximate Shapley values as well as recent improvements to these methods, which are tested on real datasets in chapter 4. Some possible directions for future research are pointed out in chapter 5, before giving a final conclusion in chapter 6. Code for the experiments of chapter 4 is found in the appendix.
  • Publication
    Accès libre
    Gender wage difference estimation at quantile levels using sample survey data
    (2023-09-19)
    Mihaela-Cătălina Anastasiade-Guinand
    ;
    ;
    This paper is motivated by the growing interest in estimating gender wage differences in official statistics. The wage of an employee is hypothetically a reflection of her or his characteristics, such as education level or work experience. It is possible that men and women with the same characteristics earn different wages. Our goal is to estimate the differences between wages at different quantiles, using sample survey data within a superpopulation framework. To do this, we use a parametric approach based on conditional distributions of the wages in function of some auxiliary information, as well as a counterfactual distribution. We show in our simulation studies that the use of auxiliary information well correlated with the wages reduces the variance of the counterfactual quantile estimates compared to those of the competitors. Since, in general, wage distributions are heavy-tailed, the interest is to model wages by using heavy-tailed distributions like the GB2 distribution. We illustrate the approach using this distribution and the wages for men and women using simulated and real data from the Swiss Federal Statistical Office.
  • Publication
    Accès libre
    Méthodes et systèmes de coordination d'échantillons
    (2023-08-17) ;
    Paul Smith
    La coordination d'échantillons fait référence aux méthodes qui permettent de créer une dépendance probabiliste entre les sélections d'échantillons aléatoires afin d'optimiser la taille de leur chevauchement. Si l'objectif d'une enquête est d'estimer les changements dans le temps, ou de réduire les coûts associés au recrutement d'une nouvelle unité d'échantillonnage, la taille du chevauchement des échantillons doit être maximisé. Parfois, on souhaite contrôler le risque que la même unité d'échantillonnage soit sélectionnée dans différentes enquêtes, et donc limiter la charge de réponse pour cette unité particulière dans une période donnée. Dans ce cas, l'objectif est de minimiser la taille du chevauchement des échantillons. Plusieurs méthodes sont utilisées pour la coordination d'échantillons. Des systèmes de coordination d'échantillons sont également utilisés dans les offices statistiques nationaux, et ils sont basés sur des méthodes de coordination d'échantillons. La littérature existante ne fait pas de distinction entre `méthodes' et `systèmes'. Mais, à notre avis, les deux représentent des concepts différents. Pour distinguer les deux concepts, nous fournissons une définition d'un système de coordination d'échantillons, et classifions les mesures qui sont actuellement utilisées dans de tels systèmes dans la statistique officielle.
  • Publication
    Accès libre
    Gender wage difference estimation at quantile levels using sample survey data and multiple imputation
    (2023-08-17)
    The presentation is motivated by the growing interest in estimating gender wage differences in official statistics. The wage of an employee is hypothetically a reflection of their characteristics, such as the education level or the work experience. It is possible that men and women, with the same characteristics obtain different wages. The gender wage differences are usually estimated at the mean level (see Blinder, 1973: J Hum Resour, 8, 436–455; Oaxaca, 1973: Int Econ Rev, 14, 693–709). Our goal is to estimate the differences between wages at different quantiles, using sample survey data into a super-population framework. To do this we use a parametric approach based on conditional distributions of the wages in function of some auxiliary information, as well as a counterfactual distribution (see Biewen and Jenkins, 2005: Empir Econ, 30, 331-358). The counterfactual distribution can be interpreted as an imputed distribution. It is constructed here by using a reweighting approach (see Fortin et al., 2011: Handbook of Labor Economics, 4, 1-102) based on calibration (see Anastasiade and Tillé, 2017: Surv Methodol, 43, 211-235). We use methods inspired from multiple imputation to estimate the quantiles of the wage distributions, as well as those of the counterfactual distribution. We show in our simulation studies that the use of auxiliary information well correlated with the wages reduces the variance of the counterfactual quantile estimates compared to those of the competitors. Since, in general, wage distributions are heavy-tailed, the interest is to model wages by using heavy-tailed distributions like the GB2 distribution. We illustrate the approach using this distribution and the wages for men and women using simulated and real data from the Swiss Federal Statistical Office.
  • Publication
    Accès libre
  • Publication
    Accès libre
  • Publication
    Accès libre
    Handling nonignorable nonresponse using generalized calibration with latent variables (accepted in Statistical Methods and Applications)
    (2022-5-3)
    Ranalli, Giovanna
    ;
    ;
    Neri, Andrea
    Sample surveys may suffer from nonignorable unit nonresponse. This happens when the decision of whether or not to participate in the survey is correlated with variables of interest; in such a case, nonresponse produces biased estimates for parameters related to those variables, even after adjustments that account for auxiliary information. This paper presents a method to deal with nonignorable unit nonresponse that uses generalised calibration and latent variable modelling. Generalised calibration enables to model unit nonresponse using a set of auxiliary variables (instrumental or model variables), that can be different from those used in the calibration constraints (calibration variables). We propose to use latent variables to estimate the probability to participate in the survey and to construct a reweighting system incorporating such latent variables. The proposed methodology is illustrated, its properties discussed and tested on two simulation studies. Finally, it is applied to adjust estimates of the finite population mean wealth from the Italian Survey of Household Income and Wealth.
  • Publication
    Restriction temporaire
    Sample coordination methods and systems for establishment surveys
    (Hoboken: Wiley, 2022) ;
    Smith, Paul
    Sample coordination has been a topic of interest in the world of establishment surveys, from well before the first International Conference on Establishment Surveys (ICES-I), where Ohlsson (1995) summarised the state of the art of methods using permanent random numbers (PRNs). A range of procedures have been proposed in the literature for sample coordination (divided into PRN and non-PRN methods), and these have given rise to several implementations in different countries. The national statistical offices of different countries currently use so-called ‘sample coordination systems’. ‘Sample coordination methods’ and ‘sample coordination systems’ represent, in our opinion, two different concepts. The existing literature does not distinguish between them. Moreover, a definition of a sample coordination system has not yet been provided, while the term is widely used. This chapter aims to review these two concepts and to underline the similarities and distinctions between them. First, we review the main existing methods for sample coordination, and highlight their strengths and weaknesses. Next, we enumerate the components of a coordination system and review some of those currently being used in different countries. Finally, we distinguish ‘sample coordination methods’ and ‘sample coordination systems’.