How Costly is Noise? Data and Disparities in Consumer Credit

Link: https://arxiv.org/abs/2105.07554

Cite:


arXiv:2105.07554
 [econ.GN]

Graphic:

Abstract:

We show that lenders face more uncertainty when assessing default risk of historically under-served groups in US credit markets and that this information disparity is a quantitatively important driver of inefficient and unequal credit market outcomes. We first document that widely used credit scores are statistically noisier indicators of default risk for historically under-served groups. This noise emerges primarily through the explanatory power of the underlying credit report data (e.g., thin credit files), not through issues with model fit (e.g., the inability to include protected class in the scoring model). Estimating a structural model of lending with heterogeneity in information, we quantify the gains from addressing these information disparities for the US mortgage market. We find that equalizing the precision of credit scores can reduce disparities in approval rates and in credit misallocation for disadvantaged groups by approximately half.

Author(s): Laura Blattner, Scott Nelson

Publication Date: 17 May 2021

Publication Site: arXiv

Bias isn’t the only problem with credit scores—and no, AI can’t help

Link: https://www.technologyreview.com/2021/06/17/1026519/racial-bias-noisy-data-credit-scores-mortgage-loans-fairness-machine-learning/

Excerpt:

But in the biggest ever study of real-world mortgage data, economists Laura Blattner at Stanford University and Scott Nelson at the University of Chicago show that differences in mortgage approval between minority and majority groups is not just down to bias, but to the fact that minority and low-income groups have less data in their credit histories.

This means that when this data is used to calculate a credit score and this credit score used to make a prediction on loan default, then that prediction will be less precise. It is this lack of precision that leads to inequality, not just bias.

…..

But Blattner and Nelson show that adjusting for bias had no effect. They found that a minority applicant’s score of 620 was indeed a poor proxy for her creditworthiness but that this was because the error could go both ways: a 620 might be 625, or it might be 615.

Author(s): Will Douglas Heaven

Publication Date: 17 June 2021

Publication Site: MIT Tech Review

Python for Actuaries

Link: https://www.pathlms.com/cas/courses/15577/webinars/7402

Slides: https://cdn.fs.pathlms.com/p3Z78DJJRFWoqdziCQyf?_ga=2.2405433.801394078.1623949999-2118863750.1623949999#/

Graphic:

Description:

Explaining why actuaries may want to use the language python in their work, and providing a demo. Free recorded webcast, from the CAS.

Author(s): Brian Fannin, John Bogaardt

Publication Date: 6 February 2020

Publication Site: CAS Online Learning

Rebekah Jones’s Lies about Florida COVID Data Keep Piling Up

Link: https://www.nationalreview.com/2021/06/rebekah-joness-lies-about-florida-covid-data-keep-piling-up/

Excerpt:

One of the most persistent falsehoods of the COVID pandemic has been the claim that Florida has been “hiding” data. This idea has been advanced primarily by Rebekah Jones, a former Florida Department of Health employee, who, having at first expressed only some modest political disagreements with the way in which Florida responded to COVID, has over time become a fountain of misinformation.

…..

To understand what is happening here, one needs to go back to the beginning. Over the past 15 months, Florida has published a truly remarkable amount of COVID-related data. At the heart of this trove has been a well-maintained list of literally every documented case of COVID — listed by county, age, and gender, and replete with information about whether the patient had recently traveled, had visited the ER, had been hospitalized, and had had any known contact with other Floridians. To my knowledge, Florida has been the only state in the union that has published this kind of data.

…..

To this day, you can download Florida’s case-line data and see 21 cases of COVID that, despite having been identified between March 2020 and December 2020, feature a December 2019 “Event Date.” To anyone who understands data, these results are clearly the product of the system having assigned a non-null default value when no data has been entered. To the Miami Herald, however, these results hinted at scandal. Even now, when its reporters know beyond any doubt that their initial instincts were wrong, the Herald continues to tell its readers that these entries serve as “evidence of community spread potentially months earlier than previously reported.” This is not true.

Author(s): Matt Shapiro

Publication Date: 8 June 2021

Publication Site: National Review

Rebekah Jones, the COVID Whistleblower Who Wasn’t

Link: https://www.nationalreview.com/2021/05/rebekah-jones-the-covid-whistleblower-who-wasnt/

Excerpt:

There is an extremely good reason that nobody in the Florida Department of Health has sided with Jones. It’s the same reason that there has been no devastating New York Times exposé about Florida’s “real” numbers. That reason? There is simply no story here. By all accounts, Rebekah Jones is a talented developer of GIS dashboards. But that’s all she is. She’s not a data scientist. She’s not an epidemiologist. She’s not a doctor. She didn’t “build” the “data system,” as she now claims, nor is she a “data manager.” Her role at the FDOH was to serve as one of the people who export other people’s work—from sets over which she had no control—and to present it nicely on the state’s dashboard. To understand just how far removed Jones really is from the actual data, consider that even now—even as she rakes in cash from the gullible to support her own independent dashboard—she is using precisely the same FDOH data used by everyone else in the world. Yes, you read that right: Jones’s “rebel” dashboard is hooked up directly to the same FDOH that she pretends daily is engaged in a conspiracy. As Jones herself confirmed on Twitter: “I use DOH’s data. If you access the data from both sources, you’ll see that it is identical.” She just displays them differently.

Or, to put it more bluntly, she displays them badly. When you get past all of the nonsense, what Jones is ultimately saying is that the State of Florida—and, by extension, the Centers for Disease Control and Prevention—has not processed its data in the same way that she would if she were in charge. But, frankly, why would it? Again, Jones isn’t an epidemiologist, and her objections, while compelling to the sort of low-information political obsessive she is so good at attracting, betray a considerable ignorance of the material issues. In order to increase the numbers in Florida’s case count, Jones counts positive antibody tests as cases. But that’s unsound, given that (a) those positives include people who have already had COVID-19 or who have had the vaccine, and (b) Jones is unable to avoid double-counting people who have taken both an antibody test and a COVID test that came back positive, because the state correctly refuses to publish the names of the people who have taken those tests. Likewise, Jones claims that Florida is hiding deaths because it does not in­clude nonresidents in its headline numbers. But Florida does report nonresident deaths; it just reports them separately, as every state does, and as the CDC’s guidelines demand. Jones’s most recent claim is that Florida’s “excess death” number is suspicious. But that, too, has been rigorously debunked by pretty much everyone who understands what “excess deaths” means in an epidemiological context—including by the CDC; by Daniel Weinberger, an epidemiologist at the Yale School of Public Health; by Lauren Rossen, a statistician at the CDC’s National Center for Health Statistics; and, most notably, by Jason Salemi, an epidemiologist at the University of South Florida, who, having gone to the trouble of making a video explaining calmly why the talking point was false, was then bullied off Twitter by Jones and her followers.

Author(s): Charles C. W. Cooke

Publication Date: 13 May 2021

Publication Site: National Review

COMIC: How I Cope With Pandemic Numbness

Link: https://www.npr.org/sections/goatsandsoda/2021/04/25/987208356/comic-how-i-cope-with-pandemic-numbness

Graphic:

Excerpt:

Each week I check the latest deaths from COVID-19 for NPR. After a while, I didn’t feel any sorrow at the numbers. I just felt numb. I wanted to understand why — and how to overcome that numbness.

Author(s): CONNIE HANZHANG JIN

Publication Date: 25 April 2021

Publication Site: Goats and Soda at NPR

An Alternative to the Correlation Coefficient That Works For Numeric and Categorical Variables

Link: https://rviews.rstudio.com/2021/04/15/an-alternative-to-the-correlation-coefficient-that-works-for-numeric-and-categorical-variables/

Graphic:

Excerpt:

Using an insight from Information Theory, we devised a new metric – the x2y metric – that quantifies the strength of the association between pairs of variables.

The x2y metric has several advantages:

It works for all types of variable pairs (continuous-continuous, continuous-categorical, categorical-continuous and categorical-categorical)

It captures linear and non-linear relationships

Perhaps best of all, it is easy to understand and use.

I hope you give it a try in your work.

Author(s): Rama Ramakrishnan

Publication Date: 15 April 2021

Publication Site: R Views

Error-riddled data sets are warping our sense of how good AI really is

Link: https://www.technologyreview.com/2021/04/01/1021619/ai-data-errors-warp-machine-learning-progress/

Paper link: https://arxiv.org/pdf/2103.14749.pdf

Graphic:

Excerpt:

Yes, but: In recent years, studies have found that these data sets can contain serious flaws. ImageNet, for example, contains racist and sexist labels as well as photos of people’s faces obtained without consent. The latest study now looks at another problem: many of the labels are just flat-out wrong. A mushroom is labeled a spoon, a frog is labeled a cat, and a high note from Ariana Grande is labeled a whistle. The ImageNet test set has an estimated label error rate of 5.8%. Meanwhile, the test set for QuickDraw, a compilation of hand drawings, has an estimated error rate of 10.1%.

Author(s): Karen Hao

Publication Date: 1 April 2021

Publication Site: MIT Tech Review

How a Software Error Made Spain’s Child COVID-19 Mortality Rate Skyrocket

Link: https://slate.com/technology/2021/03/excel-error-spain-child-covid-death-rate.html

Excerpt:

“Even though I didn’t know what the problem was, I knew it wasn’t the right data,” Soler realized once he got his hands on the Lancet paper. “Our data is not worse than other countries. I would say it is even better,” he says. Pediatricians across the nation contacted Spain’s main research institutes, as well as hospitals and regional governments. Eventually, they discovered that the national government somehow misreported the data. It’s hard to pinpoint exactly what went wrong, but Soler says the main issue is that patient deaths for those over 100 were recorded as children. He believes that the system couldn’t record three-digit numbers, and so instead registered them as one-digit. For example, a 102-year-old was registered as a 2-year-old in the system. Soler notes that not all centenarian deaths were misreported as children, but at least 47 were. This inflated the child mortality rate so much, Soler explains, because the number of children who had died was so small. Any tiny mistake causes a huge change in the data.

Author(s): ELENA DEBRÉ

Publication Date: 25 March 2021

Publication Site: Slate

America’s Coronavirus Catastrophe Began With Data

Link: https://www.defenseone.com/ideas/2021/03/americas-coronavirus-catastrophe-began-data/172686/

Excerpt:

The consequences of this testing shortage, we realized, could be cataclysmic. A few days later, we founded the COVID Tracking Project at The Atlantic with Erin Kissane, an editor, and Jeff Hammerbacher, a data scientist. Every day last spring, the project’s volunteers collected coronavirus data for every U.S. state and territory. We assumed that the government had these data, and we hoped a small amount of reporting might prod it into publishing them.

Not until early May, when the CDC published its own deeply inadequate data dashboard, did we realize the depth of its ignorance. And when the White House reproduced one of our charts, it confirmed our fears: The government was using our data. For months, the American government had no idea how many people were sick with COVID-19, how many were lying in hospitals, or how many had died. And the COVID Tracking Project at The Atlantic, started as a temporary volunteer effort, had become a de facto source of pandemic data for the United States.

Author(s): ROBINSON MEYER and ALEXIS C. MADRIGAL, THE ATLANTIC

Publication Date: 15 March 2021

Publication Site: Defense One

Finding ‘Anomalies’ Illustrates 2020 Census Quality Checks Are Working

Link: https://www.census.gov/newsroom/blogs/random-samplings/2021/03/finding_anomalies.html?utm_campaign=20210309msc20s1ccpuprs&utm_medium=email&utm_source=govdelivery

Excerpt:

So far in 2020 Census processing, 27 of the 33 anomalies we’ve found are of this type. Let me give a couple of examples.

Miscalculating age for missing birthdays. We found that our system was miscalculating ages for people who included their year of birth but left their birthday and month blank. We fixed this with a simple code correction. Making sure ages calculate correctly helps us with other data processing steps for matching and removing duplicate responses.

Incorrectly sorting out self-responses from group quarters residents. The 2020 Census allowed people to respond online or by phone without using the pre-assigned Census ID that links their response to their address. As a result, some people who live in group quarters facilities, such as nursing homes, were able to respond on their own even though they were also counted through the separate Group Quarters Enumeration operation. This also makes their address show up as a duplicate — as both a group quarters facility and a housing unit. Our business rules sort out these duplicate responses and addresses by accepting the response coming from the group quarters operation and removing the response and address appearing as a housing unit. We found an error in how this rule was being carried out. The code was correctly removing the duplicate address but wasn’t removing the duplicate response. We fixed this with another code correction, which enables us to avoid overcounting these residents. 

Author(s): MICHAEL THIEME, ASSISTANT DIRECTOR FOR DECENNIAL CENSUS PROGRAMS, SYSTEMS AND CONTRACTS

Publication Date: 9 March 2021

Publication Site: U.S. Census Bureau