SYSTEMIC DISCRIMINATION AMONG LARGE U.S. EMPLOYERS

Link: https://eml.berkeley.edu//~crwalters/papers/randres.pdf

Graphic:

Abstract:

We study the results of a massive nationwide correspondence experiment sending more than
83,000 fictitious applications with randomized characteristics to geographically dispersed jobs
posted by 108 of the largest U.S. employers. Distinctively Black names reduce the probability of
employer contact by 2.1 percentage points relative to distinctively white names. The magnitude
of this racial gap in contact rates differs substantially across firms, exhibiting a between-company
standard deviation of 1.9 percentage points. Despite an insignificant average gap in contact rates
between male and female applicants, we find a between-company standard deviation in gender
contact gaps of 2.7 percentage points, revealing that some firms favor male applicants while
others favor women. Company-specific racial contact gaps are temporally and spatially persistent,
and negatively correlated with firm profitability, federal contractor status, and a measure of
recruiting centralization. Discrimination exhibits little geographical dispersion, but two digit
industry explains roughly half of the cross-firm variation in both racial and gender contact gaps.
Contact gaps are highly concentrated in particular companies, with firms in the top quintile of
racial discrimination responsible for nearly half of lost contacts to Black applicants in the
experiment. Controlling false discovery rates to the 5% level, 23 individual companies are found
to discriminate against Black applicants. Our findings establish that systemic illegal
discrimination is concentrated among a select set of large employers, many of which can be
identified with high confidence using large scale inference methods.

Author(s): Patrick M. Kline, Evan K. Rose, and Christopher R. Walters

Publication Date: July 2021, Revised August 2021

Publication Site: NBER Working Papers, also Christopher R. Walters’s own webpages

Autocorrect errors in Excel still creating genomics headache

Link: https://www.nature.com/articles/d41586-021-02211-4

Graphic:

Excerpt:

In 2016, Mark Ziemann and his colleagues at the Baker IDI Heart and Diabetes Institute in Melbourne, Australia, quantified the problem. They found that one-fifth of papers in top genomics journals contained gene-name conversion errors in Excel spreadsheets published as supplementary data2. These data sets are frequently accessed and used by other geneticists, so errors can perpetuate and distort further analyses.

However, despite the issue being brought to the attention of researchers — and steps being taken to fix it — the problem is still rife, according to an updated and larger analysis led by Ziemann, now at Deakin University in Geelong, Australia3. His team found that almost one-third of more than 11,000 articles with supplementary Excel gene lists published between 2014 and 2020 contained gene-name errors (see ‘A growing problem’).

Simple checks can detect autocorrect errors, says Ziemann, who researches computational reproducibility in genetics. But without those checks, the errors can easily go unnoticed because of the volume of data in spreadsheets.

Author(s): Dyani Lewis

Publication Date: 13 August 2021

Publication Site: nature

Israeli data: How can efficacy vs. severe disease be strong when 60% of hospitalized are vaccinated?

Link: https://www.covid-datascience.com/post/israeli-data-how-can-efficacy-vs-severe-disease-be-strong-when-60-of-hospitalized-are-vaccinated

Graphic:

Excerpt:

These efficacies are quite high and suggests the vaccines are doing a very good job of preventing severe disease in both older and young cohorts. These levels of efficacy are much higher than the 67.5% efficacy estimate we get if the analysis is not stratified by age. How can there be such a discrepancy between the age-stratified and overall efficacy numbers?

This is an example of Simpson’s Paradox, a well-known phenomenon in which misleading results can sometimes be obtained from observational data in the presence of confounding factors.

Author(s): Jeffrey Morris

Publication Date: 17 August 2021

Publication Site: Covid-19 Data Science

Machine Learning: The Mathematics of Support Vector Machines – Part 1

Link: https://www.yengmillerchang.com/post/svm-lin-sep-part-1/

Graphic:

Excerpt:

Introduction

The purpose of this post is to discuss the mathematics of support vector machines (SVMs) in detail, in the case of linear separability.

Background

SVMs are a tool for classification. The idea is that we want to find two lines (linear equations) so that a given set of points are linearly separable according to a binary classifier, coded as ±1, assuming such lines exist. These lines are given by the black lines given below.

Author(s): Yeng Miller-Chang

Publication Date: 6 August 2021

Publication Site: Math, Music Occasionally, and Stats

Restrict Insurers’ Use Of External Consumer Data, Colorado Senate Bill 21-169

Link: https://leg.colorado.gov/sites/default/files/2021a_169_signed.pdf

Link: https://leg.colorado.gov/bills/sb21-169

Excerpt:

The general assembly therefore declares that in order to ensure
that all Colorado residents have fair and equitable access to insurance
products, it is necessary to:
(a) Prohibit:
(I) Unfair discrimination based on race, color, national or ethnic
origin, religion, sex, sexual orientation, disability, gender identity, or gender
expression in any insurance practice; and
(II) The use of external consumer data and information sources, as
well as algorithms and predictive models using external consumer data and
information sources, which use has the result of unfairly discriminating
based on race, color, national or ethnic origin, religion, sex, sexual
orientation, disability, gender identity, or gender expression; and
(b) After notice and rule-making by the commissioner of insurance,
require insurers that use external consumer data and information sources,
algorithms, and predictive models to control for, or otherwise demonstrate
that such use does not result in, unfair discrimination.

Publication Date: 6 July 2021

Publication Site: Colorado Legislature

“Why Should I Trust You?” Explaining the Predictions of Any Classifier

Link: https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf

DOI: http://dx.doi.org/10.1145/2939672.2939778

Graphic:

Excerpt:

Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind
predictions is, however, quite important in assessing trust,
which is fundamental if one plans to take action based on a
prediction, or when choosing whether to deploy a new model.
Such understanding also provides insights into the model,
which can be used to transform an untrustworthy model or
prediction into a trustworthy one.
In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable
model locally around the prediction. We also propose a
method to explain models by presenting representative individual predictions and their explanations in a non-redundant
way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by
explaining different models for text (e.g. random forests)
and image classification (e.g. neural networks). We show the
utility of explanations via novel experiments, both simulated
and with human subjects, on various scenarios that require
trust: deciding if one should trust a prediction, choosing
between models, improving an untrustworthy classifier, and
identifying why a classifier should not be trusted.

Author(s): Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin

Publication Date: 2016

Publication Site: kdd, Association for Computing Machinery

A Unified Approach to Interpreting Model Predictions

Link: https://papers.nips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf

Graphic:

Excerpt:

Understanding why a model makes a certain prediction can be as crucial as the
prediction’s accuracy in many applications. However, the highest accuracy for large
modern datasets is often achieved by complex models that even experts struggle to
interpret, such as ensemble or deep learning models, creating a tension between
accuracy and interpretability. In response, various methods have recently been
proposed to help users interpret the predictions of complex models, but it is often
unclear how these methods are related and when one method is preferable over
another. To address this problem, we present a unified framework for interpreting
predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature
an importance value for a particular prediction. Its novel components include: (1)
the identification of a new class of additive feature importance measures, and (2)
theoretical results showing there is a unique solution in this class with a set of
desirable properties. The new class unifies six existing methods, notable because
several recent methods in the class lack the proposed desirable properties. Based
on insights from this unification, we present new methods that show improved
computational performance and/or better consistency with human intuition than
previous approaches.

Author(s): Scott M. Lundberg, Su-In Lee

Publication Date: 2017

Publication Site: Conference on Neural Information Processing Systems

Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

Link: https://christophm.github.io/interpretable-ml-book/

Graphic:

Excerpt:

Machine learning has great potential for improving products, processes and research. But computers usually do not explain their predictions which is a barrier to the adoption of machine learning. This book is about making machine learning models and their decisions interpretable.

After exploring the concepts of interpretability, you will learn about simple, interpretable models such as decision trees, decision rules and linear regression. Later chapters focus on general model-agnostic methods for interpreting black box models like feature importance and accumulated local effects and explaining individual predictions with Shapley values and LIME.

All interpretation methods are explained in depth and discussed critically. How do they work under the hood? What are their strengths and weaknesses? How can their outputs be interpreted? This book will enable you to select and correctly apply the interpretation method that is most suitable for your machine learning project.

Author(s): Christoph Molnar

Publication Date: 2021-06-14

Publication Site: github

Idea Behind LIME and SHAP

Link: https://towardsdatascience.com/idea-behind-lime-and-shap-b603d35d34eb

Graphic:

Excerpt:

In machine learning, there has been a trade-off between model complexity and model performance. Complex machine learning models e.g. deep learning (that perform better than interpretable models e.g. linear regression) have been treated as black boxes. Research paper by Ribiero et al (2016) titled “Why Should I Trust You” aptly encapsulates the issue with ML black boxes. Model interpretability is a growing field of research. Please read here for the importance of machine interpretability. This blog discusses the idea behind LIME and SHAP.

Author(s): ashutosh nayak

Publication Date: 22 December 2019

Publication Site: Toward Data Science

Sex Bias in Graduate Admissions: Data from Berkeley

Link: https://science.sciencemag.org/content/187/4175/398

Graphic:

Excerpt:

Examination of aggregate data on graduate admissions to the University of California, Berkeley, for fall 1973 shows a clear but misleading pattern of bias against female applicants. Examination of the disaggregated data reveals few decision-making units that show statistically significant departures from expected frequencies of female admissions, and about as many units appear to favor women as to favor men. If the data are properly pooled, taking into account the autonomy of departmental decision making, thus correcting for the tendency of women to apply to graduate departments that are more difficult for applicants of either sex to enter, there is a small but statistically significant bias in favor of women. The graduate departments that are easier to enter tend to be those that require more mathematics in the undergraduate preparatory curriculum. The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system. Women are shunted by their socialization and education toward fields of graduate study that are generally more crowded, less productive of completed degrees, and less well funded, and that frequently offer poorer professional employment prospects.

Science 
 07 Feb 1975:
Vol. 187, Issue 4175, pp. 398-404
DOI: 10.1126/science.187.4175.398

Author(s): P. J. Bickel, E. A. Hammel, J. W. O’Connell

Publication Date: 7 February 1975

Publication Site: Science

Opinion: Amid a Pandemic, a Health Care Algorithm Shows Promise and Peril

Excerpt:

In the midst of the uncertainty, Epic, a private electronic health record giant and a key purveyor of American health data, accelerated the deployment of a clinical prediction tool called the Deterioration Index. Built with a type of artificial intelligence called machine learning and in use at some hospitals prior to the pandemic, the index is designed to help physicians decide when to move a patient into or out of intensive care, and is influenced by factors like breathing rate and blood potassium level. Epic had been tinkering with the index for years but expanded its use during the pandemic. At hundreds of hospitals, including those in which we both work, a Deterioration Index score is prominently displayed on the chart of every patient admitted to the hospital.

The Deterioration Index is poised to upend a key cultural practice in medicine: triage. Loosely speaking, triage is an act of determining how sick a patient is at any given moment to prioritize treatment and limited resources. In the past, physicians have performed this task by rapidly interpreting a patient’s vital signs, physical exam findings, test results, and other data points, using heuristics learned through years of on-the-job medical training.

Ostensibly, the core assumption of the Deterioration Index is that traditional triage can be augmented, or perhaps replaced entirely, by machine learning and big data. Indeed, a study of 392 Covid-19 patients admitted to Michigan Medicine that the index was moderately successful at discriminating between low-risk patients and those who were at high-risk of being transferred to an ICU, getting placed on a ventilator, or dying while admitted to the hospital. But last year’s hurried rollout of the Deterioration Index also sets a worrisome precedent, and it illustrates the potential for such decision-support tools to propagate biases in medicine and change the ways in which doctors think about their patients.

Author(s): VISHAL KHETPAL, NISHANT SHAH

Publication Date: 27 May 2021

Publication Site: Undark Magazine

Towards Explainability of Machine Learning Models in Insurance Pricing

Link: https://arxiv.org/abs/2003.10674

Paper: https://arxiv.org/pdf/2003.10674.pdf

Citation:


arXiv:2003.10674
 [q-fin.RM]

Graphic:

Abstract:

Machine learning methods have garnered increasing interest among actuaries in recent years. However, their adoption by practitioners has been limited, partly due to the lack of transparency of these methods, as compared to generalized linear models. In this paper, we discuss the need for model interpretability in property & casualty insurance ratemaking, propose a framework for explaining models, and present a case study to illustrate the framework.

Author(s): Kevin Kuo, Daniel Lupton

Publication Date: 24 March 2020

Publication Site: arXiv