Penggunaan Metode K Nearest Neighborhood untuk Imputasi Data Tersensor Kanan pada Pasien Kanker Paru-Paru Sel Kecil

Caecilia A Rahman; Abdul Kudus

doi:10.29313/bcss.v2i2.4615

Caecilia A Rahman Statistika Fakultas Matematika dan Ilmu Pengetahuan Alam
Abdul Kudus

DOI: https://doi.org/10.29313/bcss.v2i2.4615

Keywords: Analisis Survival, Data Tersensor Kanan, K-Nearest Neighborhood

Abstract

Abstract. In a study, it is usually necessary to have complete data for the accuracy of parameter estimation, but in survival analysis incomplete data is often found called censored data, this can happen due to limited research time and others. To complete the censored data, imputation is needed, one of method to imputating the censored data is K-Nearest Neighborhood (KNN) method. KNN imputation is designed to find K nearest neighbors from censored data to all complete data and then fill in the censored data with events that are most similar to its neighbors. If the target variable (or attribute) is categorical then imputation refers to the majority of neighbors but if the variable is numeric, then the imputation uses the average of the nearest neighbors. This study used data from 121 small cell lung cancer patients from the North Central Cancer Treatment Group in the United States. KNN imputation was used to impute the right-censored survival time of patients based on the average of the K nearest neighbors' complete data of survival time. The cens variable is used as an indicator of censorship, while the age and arm variables measure the distance between the complete data and the censored data. The smaller the distance data becomes the closest neighbor because it has similar characteristics. The average of the K complete data will be the imputed value for the censored data. This study succeeded in imputing 23 censored data based on 46 closest neighbors (K = 46).

Abstrak. Dalam suatu penelitian biasanya diperlukan kelengkapan data untuk ketepatan pendugaan parameter, namun pada analisis survival kerap ditemukannya data yang tidak lengkap yang disebut data tersensor, hal ini bisa terjadi karena terbatasnya waktu penelitian dan lain-lain. Untuk melengkapi data yang tidak lengkap tersebut diperlukannya imputasi, salah satunya yaitu metode K-Nearest Neighborhood (KNN). Imputasi KNN dirancang untuk mencari K tetangga terdekat dari data yang tidak lengkap ke seluruh kejadian suatu data, kemudian mengisi data yang hilang dengan kejadian yang paling mirip dengan tetangganya, jika target variabel (atau atribut) berupa kategorik maka imputasi merujuk kepada mayoritas tetangga namun apabila variabel berupa numerik maka imputasi menggunakan rata-rata dari tetangga terdekat. Penelitian ini menggunakan data dari 121 pasien kanker paru-paru sel kecil dari North Central Cancer Treatment Group di Amerika Serikat. Imputasi KNN digunakan untuk mengimputasi waktu survival pasien yang tersensor kanan berdasarkan rata-rata dari sebanyak K tetangga terdekat data lengkap waktu survival. Variabel cens digunakan sebagai indikator penyensoran sedangkan variabel usia dan Arm (jenis perawatan) digunakan untuk mengukur jarak antara data lengkap dengan data tersensor, semakin kecil jarak maka data tersebut menjadi tetangga terdekat karena memiliki karakteristik yang mirip. Rata-rata dari sebanyak K data lengkap akan menjadi nilai imputasi bagi data tersensor. Pada penelitian ini berhasil mengimputasi 23 data tersensor berdasarkan 46 tetangga terdekatnya (K = 46).