Voici les éléments 1 - 10 sur 18
Pas de vignette d'image disponible
Publication
Accès libre

Explainable Machine Learning: Approximating Shapley Values for Dependent Predictors

2024, Kasperek, Jan, Matei, Alina

Modern Machine Learning algorithms often outperform classical statistical methods in predictive accuracy. This comes at the expense of model interpretability. As businesses and institutions increasingly rely on Machine Learning to support and automate decision making processes to reap the benefits of more accurate predictions, explaining these model outputs becomes more important. A universally applicable approach to explaining such complex models is based on the Shapley value, a concept originating from game theory. However, its calculation is very computer-intensive, so approximations have to be used. The state-of-the-art approach, Kernel SHAP, assumes independence of the predictors, which is unrealistic in practice. Recent research has developed improvements to incorporate dependencies between predictors. After a review of the theoretical underpinnings, the original KernelSHAP method is compared with improved versions in realistic settings, using three real-world datasets. While the improved versions are found to have smaller approximation error to exact Shapley values, they are also more computationally demanding. Further improvements are discussed and possible research directions are suggested. The thesis is structured as follows: After introducing explainable machine learning in chapter 1, the Shapley value and its applications to model explainability are explored in chapter 2. Chapter 3 presents methods to approximate Shapley values as well as recent improvements to these methods, which are tested on real datasets in chapter 4. Some possible directions for future research are pointed out in chapter 5, before giving a final conclusion in chapter 6. Code for the experiments of chapter 4 is found in the appendix.

Pas de vignette d'image disponible
Publication
Accès libre

Gender wage difference estimation at quantile levels using sample survey data and multiple imputation

2023-08-17, Alina Matei

The presentation is motivated by the growing interest in estimating gender wage differences in official statistics. The wage of an employee is hypothetically a reflection of their characteristics, such as the education level or the work experience. It is possible that men and women, with the same characteristics obtain different wages. The gender wage differences are usually estimated at the mean level (see Blinder, 1973: J Hum Resour, 8, 436–455; Oaxaca, 1973: Int Econ Rev, 14, 693–709). Our goal is to estimate the differences between wages at different quantiles, using sample survey data into a super-population framework. To do this we use a parametric approach based on conditional distributions of the wages in function of some auxiliary information, as well as a counterfactual distribution (see Biewen and Jenkins, 2005: Empir Econ, 30, 331-358). The counterfactual distribution can be interpreted as an imputed distribution. It is constructed here by using a reweighting approach (see Fortin et al., 2011: Handbook of Labor Economics, 4, 1-102) based on calibration (see Anastasiade and Tillé, 2017: Surv Methodol, 43, 211-235). We use methods inspired from multiple imputation to estimate the quantiles of the wage distributions, as well as those of the counterfactual distribution. We show in our simulation studies that the use of auxiliary information well correlated with the wages reduces the variance of the counterfactual quantile estimates compared to those of the competitors. Since, in general, wage distributions are heavy-tailed, the interest is to model wages by using heavy-tailed distributions like the GB2 distribution. We illustrate the approach using this distribution and the wages for men and women using simulated and real data from the Swiss Federal Statistical Office.

Vignette d'image
Publication
Accès libre

Handling nonignorable nonresponse using generalized calibration with latent variables (accepted in Statistical Methods and Applications)

2022-5-3, Ranalli, Giovanna, Matei, Alina, Neri, Andrea

Sample surveys may suffer from nonignorable unit nonresponse. This happens when the decision of whether or not to participate in the survey is correlated with variables of interest; in such a case, nonresponse produces biased estimates for parameters related to those variables, even after adjustments that account for auxiliary information. This paper presents a method to deal with nonignorable unit nonresponse that uses generalised calibration and latent variable modelling. Generalised calibration enables to model unit nonresponse using a set of auxiliary variables (instrumental or model variables), that can be different from those used in the calibration constraints (calibration variables). We propose to use latent variables to estimate the probability to participate in the survey and to construct a reweighting system incorporating such latent variables. The proposed methodology is illustrated, its properties discussed and tested on two simulation studies. Finally, it is applied to adjust estimates of the finite population mean wealth from the Italian Survey of Household Income and Wealth.

Vignette d'image
Publication
Accès libre

Book review: Yves Tillé. Sampling and Estimation from Finite Populations. 2020 New York: Wiley, ISBN:978-0-470-68205-0, 448 pages

2021-12-26, Matei, Alina

Vignette d'image
Publication
Accès libre

Gender wage difference estimation at quantile levels using sample survey data

2023-09-19, Mihaela-Cătălina Anastasiade-Guinand, Matei, Alina, Tillé, Yves

This paper is motivated by the growing interest in estimating gender wage differences in official statistics. The wage of an employee is hypothetically a reflection of her or his characteristics, such as education level or work experience. It is possible that men and women with the same characteristics earn different wages. Our goal is to estimate the differences between wages at different quantiles, using sample survey data within a superpopulation framework. To do this, we use a parametric approach based on conditional distributions of the wages in function of some auxiliary information, as well as a counterfactual distribution. We show in our simulation studies that the use of auxiliary information well correlated with the wages reduces the variance of the counterfactual quantile estimates compared to those of the competitors. Since, in general, wage distributions are heavy-tailed, the interest is to model wages by using heavy-tailed distributions like the GB2 distribution. We illustrate the approach using this distribution and the wages for men and women using simulated and real data from the Swiss Federal Statistical Office.

Vignette d'image
Publication
Accès libre

Sample coordination methods and systems (invited)

2022-9-14, Matei, Alina, Smith, Paul

Vignette d'image
Publication
Accès libre

Book review: "Big Data Meets Survey Science. A Collection of Innovative Methods"

2022-1-15, Matei, Alina

Pas de vignette d'image disponible
Publication
Accès libre

Méthodes et systèmes de coordination d'échantillons

2023-08-17, Alina Matei, Paul Smith

La coordination d'échantillons fait référence aux méthodes qui permettent de créer une dépendance probabiliste entre les sélections d'échantillons aléatoires afin d'optimiser la taille de leur chevauchement. Si l'objectif d'une enquête est d'estimer les changements dans le temps, ou de réduire les coûts associés au recrutement d'une nouvelle unité d'échantillonnage, la taille du chevauchement des échantillons doit être maximisé. Parfois, on souhaite contrôler le risque que la même unité d'échantillonnage soit sélectionnée dans différentes enquêtes, et donc limiter la charge de réponse pour cette unité particulière dans une période donnée. Dans ce cas, l'objectif est de minimiser la taille du chevauchement des échantillons. Plusieurs méthodes sont utilisées pour la coordination d'échantillons. Des systèmes de coordination d'échantillons sont également utilisés dans les offices statistiques nationaux, et ils sont basés sur des méthodes de coordination d'échantillons. La littérature existante ne fait pas de distinction entre `méthodes' et `systèmes'. Mais, à notre avis, les deux représentent des concepts différents. Pour distinguer les deux concepts, nous fournissons une définition d'un système de coordination d'échantillons, et classifions les mesures qui sont actuellement utilisées dans de tels systèmes dans la statistique officielle.

Vignette d'image
Publication
Accès libre

Generalised calibration with latent variables for the treatment of unit nonresponse in sample surveys

2022-6-9, Ranalli, Giovanna, Matei, Alina, Neri, Andrea

Vignette d'image
Publication
Accès libre

Sample coordination methods and systems for establishment surveys

2022, Matei, Alina, Smith, Paul

Sample coordination has been a topic of interest in the world of establishment surveys, from well before the first International Conference on Establishment Surveys (ICES-I), where Ohlsson (1995) summarised the state of the art of methods using permanent random numbers (PRNs). A range of procedures have been proposed in the literature for sample coordination (divided into PRN and non-PRN methods), and these have given rise to several implementations in different countries. The national statistical offices of different countries currently use so-called ‘sample coordination systems’. ‘Sample coordination methods’ and ‘sample coordination systems’ represent, in our opinion, two different concepts. The existing literature does not distinguish between them. Moreover, a definition of a sample coordination system has not yet been provided, while the term is widely used. This chapter aims to review these two concepts and to underline the similarities and distinctions between them. First, we review the main existing methods for sample coordination, and highlight their strengths and weaknesses. Next, we enumerate the components of a coordination system and review some of those currently being used in different countries. Finally, we distinguish ‘sample coordination methods’ and ‘sample coordination systems’.