Quality in Primary Care Open Access

  • ISSN: 1479-1064
  • Journal h-index: 27
  • Journal CiteScore: 6.64
  • Journal Impact Factor: 4.22
  • Average acceptance to publication time (5-7 days)
  • Average article processing time (30-45 days) Less than 5 volumes 30 days
    8 - 9 volumes 40 days
    10 and more volumes 45 days
Reach us +32 25889658

Abstract

Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NCI PLCO Dataset

Ananya Dutta*

Background: Pancreatic cancer (PC) is a disease with poor prognosis and survival rate. There is a pertinent need to identify the risk factors of this disease. The purpose of this study is to identify a subset of factors (a.k.a. features) as predictors of PC from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer dataset consisting of responses to 65 questions about demographics, cancer and health history, medication usage, and smoking habits from 154,897 participants.

Method: There are two challenges to selecting the subset of features that predict PC with highest probability: The problem is computationally intractable, and the PLCO dataset is highly imbalanced. We use an innovative method to use the dataset in a balanced way, without involving up or down-sampling. We use nine feature selection methods to select the optimal subset of features from the preprocessed and balanced dataset.

Results: Our preprocessed dataset consists of 32 risk factors (8 demographics, 5 cancer history, 13 health histories, 2 medication usage, 4 smoking habits). Risk factors belonging to cancer and health history, followed by smoking habits, were consistently chosen by the feature selection methods. We also discuss findings in the medical sciences literature that corroborate our findings.

Conclusions: The study found that risk factors belonging to cancer and health history are the most prominent ones for PC. In particular, previously diagnosed with PC is chosen as the most prominent risk factor by majority of methods. While most of our findings are consistent with the literature, some of our findings shed light on novel factors that may not have received their due attention by the research community.

Published Date: 2023-08-29; Received Date: 2023-08-01