First Advisor

Merchant, William R.

First Committee Member

Larkins, Randy J.

Second Committee Member

Khaledi, Bahedin

Third Committee Member

Apawu, Aaron K.

Degree Name

Doctor of Philosophy

Document Type

Dissertation

Date Created

8-2025

Department

College of Education and Behavioral Sciences, Applied Statistics and Research Methods, ASRM Student Work

Embargo Date

8-2027

Abstract

Missing data poses significant challenges in data-driven decision-making, often leading to biased estimates and reduced statistical power if improperly handled. Although K-nearest neighbors (KNN) imputation is widely adopted for its simplicity and effectiveness, the impact of distance metric selection on its performance under varying data conditions such as sample size, dimensionality, and proportion of missingness remains underexplored. The purpose of this study is to evaluate the performance of different distance metrics in the K-Nearest Neighbors algorithm for imputing missing values. This study evaluates the performance of five distance metrics (Euclidean, Mahalanobis, Manhattan, Minkowski, Chebyshev) in KNN imputation under these conditions. Specifically, the study addresses three research questions: (1) How does varying distance metrics affect KNN imputation accuracy (measured by mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE)) as the proportion of missing data increases? (2) How does varying distance metrics affect accuracy as data dimensionality increases? (3) How does varying distance metrics affect accuracy as sample size increases? A simulation study was conducted using a fixed neighborhood of k=5 to isolate the metric’s effect, generating datasets with controlled sample sizes (100, 500, 1000), dimensionalities (50, 200, 500), and proportions of missingness (2%, 15%, 30%) under a missing-at-random mechanism. Performance of each distance metrics in KNN imputation was evaluated using MSE, RMSE and iii MAE. Additionally, the Iris dataset was used for empirical validation. Findings indicate that Mahalanobis distance excels in low-missingness scenarios (< 15%) and high-dimensional data, though its performance shows a U-shaped curve with dimensionality and comes at a higher computational cost. Euclidean distance is robust for high missingness (> 20%) and low-dimensional data. Manhattan distance, however, proved consistently less effective. These findings emphasize that metric choice must be data-driven and deliver practical, actionable guidance for optimizing KNN imputation in practice.

Abstract Format

html

Places

Greeley, Colorado

Extent

150 pages

Local Identifiers

EsselKaitoo_unco_0161D_11377.pdf

Rights Statement

Copyright is held by the author.

Digital Origin

Born digital

Available for download on Sunday, August 01, 2027

Share

COinS