First Advisor
Merchant, William R.
First Committee Member
Larkins, Randy J.
Second Committee Member
Khaledi, Bahedin
Third Committee Member
Apawu, Aaron K.
Degree Name
Doctor of Philosophy
Document Type
Dissertation
Date Created
8-2025
Department
College of Education and Behavioral Sciences, Applied Statistics and Research Methods, ASRM Student Work
Embargo Date
8-2027
Abstract
Missing data poses significant challenges in data-driven decision-making, often leading to biased estimates and reduced statistical power if improperly handled. Although K-nearest neighbors (KNN) imputation is widely adopted for its simplicity and effectiveness, the impact of distance metric selection on its performance under varying data conditions such as sample size, dimensionality, and proportion of missingness remains underexplored. The purpose of this study is to evaluate the performance of different distance metrics in the K-Nearest Neighbors algorithm for imputing missing values. This study evaluates the performance of five distance metrics (Euclidean, Mahalanobis, Manhattan, Minkowski, Chebyshev) in KNN imputation under these conditions. Specifically, the study addresses three research questions: (1) How does varying distance metrics affect KNN imputation accuracy (measured by mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE)) as the proportion of missing data increases? (2) How does varying distance metrics affect accuracy as data dimensionality increases? (3) How does varying distance metrics affect accuracy as sample size increases? A simulation study was conducted using a fixed neighborhood of k=5 to isolate the metric’s effect, generating datasets with controlled sample sizes (100, 500, 1000), dimensionalities (50, 200, 500), and proportions of missingness (2%, 15%, 30%) under a missing-at-random mechanism. Performance of each distance metrics in KNN imputation was evaluated using MSE, RMSE and iii MAE. Additionally, the Iris dataset was used for empirical validation. Findings indicate that Mahalanobis distance excels in low-missingness scenarios (< 15%) and high-dimensional data, though its performance shows a U-shaped curve with dimensionality and comes at a higher computational cost. Euclidean distance is robust for high missingness (> 20%) and low-dimensional data. Manhattan distance, however, proved consistently less effective. These findings emphasize that metric choice must be data-driven and deliver practical, actionable guidance for optimizing KNN imputation in practice.
Abstract Format
html
Places
Greeley, Colorado
Extent
150 pages
Local Identifiers
EsselKaitoo_unco_0161D_11377.pdf
Rights Statement
Copyright is held by the author.
Digital Origin
Born digital
Recommended Citation
Essel-Kaitoo, Ernest, "Assessing K-Nearest Neighbors Imputation Methods for Missing Data: A Simulation Study" (2025). Dissertations. 1206.
https://digscholarship.unco.edu/dissertations/1206