How (not) to deal with missing data: An economist’s take on a controversial study

Link: https://retractionwatch.com/2024/02/21/how-not-to-deal-with-missing-data-an-economists-take-on-a-controversial-study/

Graphic:

Excerpt:

I was reminded of this student’s clever ploy when Frederik Joelving, a journalist with Retraction Watch, recently contacted me about a published paper written by two prominent economists, Almas Heshmati and Mike Tsionas, on green innovations in 27 countries during the years 1990 through 2018. Joelving had been contacted by a PhD student who had been working with the same data used by Heshmati and Tsionas. The student knew the data in the article had large gaps and was “dumbstruck” by the paper’s assertion these data came from a “balanced panel.” Panel data are cross-sectional data for, say, individuals, businesses, or countries at different points in time. A “balanced panel” has complete cross-section data at every point in time; an unbalanced panel has missing observations. This student knew firsthand there were lots of missing observations in these data.

The student contacted Heshmati and eventually obtained spreadsheets of the data he had used in the paper. Heshmati acknowledged that, although he and his coauthor had not mentioned this fact in the paper, the data had gaps. He revealed in an email that these gaps had been filled by using Excel’s autofill function: “We used (forward and) backward trend imputations to replace the few missing unit values….using 2, 3, or 4 observed units before or after the missing units.”  

That statement is striking for two reasons. First, far from being a “few” missing values, nearly 2,000 observations for the 19 variables that appear in their paper are missing (13% of the data set). Second, the flexibility of using two, three, or four adjacent values is concerning. Joelving played around with Excel’s autofill function and found that changing the number of adjacent units had a large effect on the estimates of missing values.

Joelving also found that Excel’s autofill function sometimes generated negative values, which were, in theory, impossible for some data. For example, Korea is missing R&Dinv (green R&D investments) data for 1990-1998. Heshmati and Tsionas used Excel’s autofill with three years of data (1999, 2000, and 2001) to create data for the nine missing years. The imputed values for 1990-1996 were negative, so the authors set these equal to the positive 1997 value.

Author(s): Gary Smith

Publication Date: 21 Feb 2024

Publication Site: Retraction Watch

Exclusive: Elsevier to retract paper by economist who failed to disclose data tinkering

Link: https://retractionwatch.com/2024/02/22/exclusive-elsevier-to-retract-paper-by-economist-who-failed-to-disclose-data-tinkering/

Excerpt:

A paper on green innovation that drew sharp rebuke for using questionable and undisclosed methods to replace missing data will be retracted, its publisher told Retraction Watch.

Previous work by one of the authors, a professor of economics in Sweden, is also facing scrutiny, according to another publisher. 

As we reported earlier this month, Almas Heshmati of Jönköping University mended a dataset full of gaps by liberally applying Excel’s autofill function and copying data between countries – operations other experts described as “horrendous” and “beyond concern.”

Heshmati and his coauthor, Mike Tsionas, a professor of economics at Lancaster University in the UK who died recently, made no mention of missing data or how they dealt with them in their 2023 article, “Green innovations and patents in OECD countries.” Instead, the paper gave the impression of a complete dataset. One economist argued in a guest post on our site that there was “no justification” for such lack of disclosure.

Elsevier, in whose Journal of Cleaner Production the study appeared, moved quickly on the new information. A spokesperson for the publisher told us yesterday: “We have investigated the paper and can confirm that it will be retracted.”

Author(s): Frederik Joelving

Publication Date: 22 Feb 2024

Publication Site: Retraction Watch

Using First Name Information to Improve Race and Ethnicity Classification

Link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2763826

Graphic:

Abstract:

This paper uses a recent first name list to improve on a previous Bayesian classifier, the Bayesian Improved Surname Geocoding (BISG) method, which combines surname and geography information to impute missing race and ethnicity. The proposed approach is validated using a large mortgage lending dataset for whom race and ethnicity are reported. The new approach results in improvements in accuracy and in coverage over BISG for all major ethno-racial categories. The largest improvements occur for non-Hispanic Blacks, a group for which the BISG performance is weakest. Additionally, when estimating disparities in mortgage pricing and underwriting among ethno-racial groups with regression models, the disparity estimates based on either BIFSG or BISG proxies are remarkably close to those based on actual race and ethnicity. Following evaluation, I demonstrate the application of BIFSG to the imputation of missing race and ethnicity in the Home Mortgage Disclosure Act (HMDA) data, and in the process, offer novel evidence that race and ethnicity are somewhat correlated with the incidence of missing race/ethnicity information.

Author(s):

Ioan Voicu
Office of the Comptroller of the Currency (OCC)

Publication Date: February 22, 2016

Publication Site: SSRN

Suggested Citation:

Voicu, Ioan, Using First Name Information to Improve Race and Ethnicity Classification (February 22, 2016). Available at SSRN: https://ssrn.com/abstract=2763826 or http://dx.doi.org/10.2139/ssrn.2763826