Quality in Primary Care Open Access

  • ISSN: 1479-1064
  • Journal h-index: 27
  • Journal CiteScore: 6.64
  • Journal Impact Factor: 4.22
  • Average acceptance to publication time (5-7 days)
  • Average article processing time (30-45 days) Less than 5 volumes 30 days
    8 - 9 volumes 40 days
    10 and more volumes 45 days
Reach us +32 25889658

Research - (2023) Volume 31, Issue 4

Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NCI PLCO Dataset
Ananya Dutta*
 
Department of Electrical and Communications Engineering, Gauhati University, India
 
*Correspondence: Ananya Dutta, Department of Electrical and Communications Engineering, Gauhati University, India, Email:

Received: 01-Aug-2023, Manuscript No. IPQPC-23-17722; Editor assigned: 03-Aug-2023, Pre QC No. IPQPC-23-17722 (PQ); Reviewed: 17-Aug-2023, QC No. IPQPC-23-17722; Revised: 22-Aug-2023, Manuscript No. IPQPC-23-17722 (R); Published: 29-Aug-2023, DOI: 10.36648/1479-1064.23.31.34

Abstract

Background: Pancreatic cancer (PC) is a disease with poor prognosis and survival rate. There is a pertinent need to identify the risk factors of this disease. The purpose of this study is to identify a subset of factors (a.k.a. features) as predictors of PC from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer dataset consisting of responses to 65 questions about demographics, cancer and health history, medication usage, and smoking habits from 154,897 participants.

Method: There are two challenges to selecting the subset of features that predict PC with highest probability: The problem is computationally intractable, and the PLCO dataset is highly imbalanced. We use an innovative method to use the dataset in a balanced way, without involving up or down-sampling. We use nine feature selection methods to select the optimal subset of features from the preprocessed and balanced dataset.

Results: Our preprocessed dataset consists of 32 risk factors (8 demographics, 5 cancer history, 13 health histories, 2 medication usage, 4 smoking habits). Risk factors belonging to cancer and health history, followed by smoking habits, were consistently chosen by the feature selection methods. We also discuss findings in the medical sciences literature that corroborate our findings.

Conclusions: The study found that risk factors belonging to cancer and health history are the most prominent ones for PC. In particular, previously diagnosed with PC is chosen as the most prominent risk factor by majority of methods. While most of our findings are consistent with the literature, some of our findings shed light on novel factors that may not have received their due attention by the research community.

Keywords

Pancreatic cancer; NCI PLCO dataset; Feature selection; Classification

Introduction

Pancreatic Cancer (PC) is a disease with poor prognosis and survival rate. About 95% of people who contract PC would not make it to the 5-year survival period [1]. Pancreas is an inner organ of the human body, surrounded by the duodenum and the small intestine; hence early symptoms are hard to detect [2]. Malicious cells in the pancreas are typically detected at a very advanced stage when it is impossible to save the patient. There is a pertinent need for a prediction model that can lead to early detection of this disease.

Biomarkers for early diagnosis of PC have been investigated (see for example, [3-8]). However, evidence for identified biomarkers has not been very conclusive. Image analysis and machine learning algorithms have been used for distinguishing between benign and malignant tissues in endoscopic ultrasound and computed tomography images (see for example, [9-12]). However, these models can detect PC only at an advanced stage and hence are not very useful.

The purpose of this study is to identify a set of factors as predictors of PC. We use a cancer dataset collected from 154,897 participants, each responding to 65 questions (or factors) about demographics, cancer and health history, medication usage, and smoking habits. There are two challenges to selecting the subset of 65 factors that predict PC with highest probability: The problem is computationally intractable, and the dataset is highly imbalanced. Our approach consists of balancing and preprocessing the dataset, and rank the risk factors based on their ability to predict PC.

Our study found that risk factors belonging to cancer and health history are the most prominent ones for PC. In particular, previously diagnosed with PC is chosen as the most prominent risk factor by majority of methods.

We also discuss findings in the medical sciences literature that corroborate our findings. Some of our findings shed light on novel factors that may not have received their due attention by the research community.

Materials and Methods

Problem Statement

Our problem is to predict whether a subject is diagnosed with PC or not, given information about his demographic characteristics, health history, medication usage, smoking habits, and his and his family’s cancer diagnosis history. This information is encoded as a vector of predictor variables where each predictor represents a risk factor (a.k.a. feature). The predictors are discrete and finite random variables.

Formally, given a set of data points X=[x1,..., xN] ∈ Nd×N, N={0, 1, 2,...}, and a set of labels {True, False}, the task is to map each data point xi ∈ Nd into one of the labels, where d is the dimension of each data point, and N is the number of data points in the dataset. This is a binary classification problem. Our goal is to select a subset of predictors such that classification using the subset is at least as accurate as that using the entire set. It has been shown that the accuracy does not always improve with increase in number of variables [13], hence choosing the optimal subset of predictors is imperative for accurate prediction of PC.

Materials

The Prostate, Lung, Colorectal and Ovarian (PLCO) cancer dataset [14] is collected by the National Cancer Institute from 154,897 participants. Among them, 76,678 or 49.5% were males, and 132,572 or 85.6% were non-Hispanic White. The participants, randomly selected based on a set of criteria from different parts of United States, were between 55-74 years of age with no history of prostate, lung, colorectal or ovarian cancer. Each participant filled out three questionnaires, thereby responding to 65 questions about demographics, cancer and health history, medication usage, and smoking habits. Therefore, N=154, 897 and d=65 for our problem. The dataset is highly imbalanced; only 749 or 0.48% of the participants were diagnosed with pancreatic cancer (Table 1). Visualizations of the PLCO dataset in two-dimensional (2D) space are shown in Figure 1 lists the 65 risk factors.

Quality-Primary-Care-PLCO

Figure 1: PLCO dataset visualized in 2D using (a) ADASYN algorithm [15] and (b) t-SNE algorithm [16]. Data points corresponding to PC=True and PC=False are shown in black and gray respectively.

Table 1: The risk factors considered in the PLCO dataset. The ones marked “removed” are not considered in our analysis as there are not enough responses from the participants on these questions

Risk factor categories Risk factors Male risk factors Female risk factors
(values) (total 47, removed 15) (total 52, removed 20)
Cancer history 59 participants Has relative with cancer (yes, no) â?? â??
60 participants Has relative with PC (yes, no) â?? â??
61 participants No.  of relatives with PC (0,1,2,3,...) â?? â??
62 participants Diagnosed with any cancer (yes, no) â?? â??
63 participants Diagnosed with PC (yes, no) â?? â??
Demographics 64 participants Gender (male, female) â?? â??
38 participants Race (White, Black, Asian, Pacific Islander, American Indian/Alaskan Native) â?? â??
39 participants Hispanic origin (yes, no) â?? â??
1 participant Education level completed (<8 yrs, 8-11 yrs, 12 yrs, 12 yrs+ some college, college grad, post grad) â?? â??
2 participants Marital status (married, widowed, divorced, separated, never married) â?? â??
3 participants Occupation (homemaker, working, unemployed, retired, extended sick leave, disabled, other) â?? â??
6 participants No.  of sisters (0,1,2,3,4,5,6, = 7) â?? â??
7 participants No.  of brothers (0, 1, 2, 3, 4, 5,6, = 7) â?? â??
Medication usage 8 participants Used aspirin regularly (yes, no) â?? â??
9 participants Used ibupr ofen regularly (yes, no) â?? â??
52 participants Taken birth control pills (yes, no)   â?? (removed)
20 participant Age started taking birth control pills (<30 yrs, 30-39 yrs, 40-49 yrs, 50-59 yrs, ≥ 60 yrs)   â?? (removed)
21 participants Currently taking female hormones (yes, no)   â?? (removed)
22 participants No.  of years taking female hormones (≥ 1, 2-3, 4-5, 6-9, ≥ 10)   â?? (removed)
53 participants Taken female hormones (yes, no, don’t know)   â?? (removed)
Health history 27 participants Had high blood pressure (yes, no) â?? â??
28 participants Had heart attack (yes, no) â?? â??
29 participants Had stroke (yes, no) â?? â??
30 participants Had emphysema (yes, no) â?? â??
31 participants Had bronchitis (yes, no) â?? â??
32 participants Had diabetes (yes, no) â?? â??
33 participants Had colorectal polyps (yes, no) â?? â??
34 participants Had arthritis (yes, no) â?? â??
35 participants Had osteoporosis (yes, no) â?? â??
36 participants Had diverculitis (yes, no) â?? â??
37 participants Had gall bladder inflammation (yes, no) â?? â??
57 participants Had colon comorbidity (yes, no) â?? â??
58 participants Had liver comorbidity (yes, no) â?? â??
40 participants Had biopsy  of prostrate (yes, no) â?? (removed)  
41 participants Had transurethral resection  of prostate (yes, no) â?? (removed)  
42 participants Had prostatetomy  of benign disease (yes, no) â?? (removed)  
43 participants Had prostate surgery (yes, no) â?? (removed)  
47 participants Had enlarged prostate (yes, no) â?? (removed)  
48 participants Had inflamed prostate (yes, no) â?? (removed)  
49 participants Had prostate problem (yes, no) â?? (removed)  
50 participants No.  of times wakes up to urinate at night (0,1,2,3, >3) â?? (removed)  
23 participants Age started to urinate more than once at night (<30 yrs, 30-39 yrs, 40-49 yrs, 50-59 yrs, 60-69 yrs, ≥ 70 yrs) â?? (removed)  
24 participants Age when told had enlarged prostate (<30 yrs, 30-39 yrs, 40-49 yrs, 50-59 yrs, 60-69 yrs, ≥ 70 yrs) â?? (removed)  
25 participants Age when told had inflammed prostate (<30 yrs, 30-39 yrs, 40-49 yrs, 50-59 yrs, 60-69 yrs, ≥ 70 yrs) â?? (removed)  
26 participants Age at vasectomy (<25 yrs, 25-34 yrs, 35-44 yrs, ≥ 45 yrs) â?? (removed)  
51 participants Had vasectomy (yes, no) â?? (removed)  
44 participants Been pregnant (yes, no, don’t know)   â?? (removed)
45 participants Had hysterectomy (yes, no)   â?? (removed)
46 participants Had ovaries removed (yes, no)   â?? (removed)
10 participants No.  of tubal pregnancies (0, 1, ≥ 2)   â?? (removed)
11 participants Had tubal ligation (yes, no, don’t know)   â?? (removed)
12 participants Had benign ovarian tumor (yes, no)   â?? (removed)
13 participants Had benign breast disease (yes, no)   â?? (removed)
14 participants Had endometriosis (yes, no)   â?? (removed)
15 participants Had uterine fibroid tumors (yes, no)   â?? (removed)
16 participants Tried to become pregnant without success (yes, no)   â?? (removed)
17 participants No.  of pregnancies (0,1,2,3, 4-9, ≥ 10)   â?? (removed)
18 participants No.  of stillbirth pregnancies (0,1, ≥ 2)   â?? (removed)
19 participants Age at hysterectomy (<40 yrs, 40-44 yrs, 45-49 yrs, 50-54 yrs, ≥ 55 yrs)   â?? (removed)
Smoking habits 4 participants Smoked pipe (never, currently, formerly) â?? â??
5 participants Smoked cigar (never, currently, formerly) â?? â??
54 participants Smoked cigarettes regularly (yes, no) â?? â??
55 participants Smoke regularly now (yes, no) â?? (removed) â?? (removed)
56 participants Usually filtered or not filtered (filter more  often, non-filter more  often, both about equally) â?? (removed) â?? (removed)
65 participants No.  of cigarettes smoked daily (0, 1-10, 11-20, 21-30, 31-40, 41-60, 61-80, >80) â?? â??

Dataset Balancing

A balanced dataset contains equal number of data points in all classes. Usually, an imbalanced dataset is balanced using methods such as fixed-rate downsampling or clustering that downsample the majority subset, or using methods such as the SMOTE algorithm [15-17] that upsample the minority subset. Both approaches inherit drawbacks unless the true distribution generating the data is known. The true distribution is unknown for the current problem.

We use a balancing method, similar to that proposed in [18], whereby the majority subset is iteratively and randomly subsampled such that in each iteration, the sampled subset is balanced. This method refrains from eliminating any data point from or introducing any new data point into the given dataset. A feature selection method is applied independently on each subset. The final result is obtained by computing the mean over all the subsets.

Data Preprocessing

The PLCO dataset has a number of missing values. We employ two steps iteratively to obtain a less incomplete dataset. First, we eliminate factors that are either missing responses from more than 10% of the participants, or responses from all participants are same. Next, we eliminate participants who did not respond to more than 10% of the remaining factors. The two steps are again applied to the resulting dataset. Application of the two steps continues until there is no change in the dataset between two consecutive iterations.

Each feature is standardized by subtracting its mean and dividing by its standard deviation. The missing values in the resulting dataset are filled in. The jth element of the ith data point, if missing, is filled by:

Equation 1:

Equation

|.| denotes the absolute value, “m not missing” refers to the mth element of a data point that is not missing, and dist is the absolute of the cosine similarity (or normalized dot product) of two data points. Therefore, 0 ≤ dist ≤ 1; as two data points get closer, their dist increases. In Eq. 1, a missing element of a given data point is computed as the weighted mean of that element from all data points in which values of all elements are present, and the weights are proportional to the absolute cosine similarity. After filling in all missing values, each feature is standardized again.

Variable or Feature Selection

Our problem of selecting the optimal subset of features is intractable as a total of

Equation

(2d) subsets are possible. Computing O(2d) subsets to determine the optimal one is impractical for the PLCO dataset with d=65. Hence we resort to variable or feature selection methods [19,20]. We used several feature selection algorithms suitable for categorical and continuous features and classification task [21-31], implemented in MATLAB, to rank the features, such as rank features using chi-square tests (‘fscchi2’ in MATLAB), rank features for classification using minimum redundancy maximum relevance (MRMR) algorithm (‘fscmrmr’ in MATLAB), estimate predictor importance for classification using ensemble of decision trees (‘fit-censemble’ in MATLAB), estimate predictor importance for classification using a binary decision tree (‘fitctree’ in MATLAB), estimate predictor importance for classification with an ensemble of bagged decision trees (e.g., random forest) which assigns positive and negative scores to the predictors (‘fitcensemble’ with method ‘bag’ and ‘oobPermutedPredictorImportance’ in MATLAB), rank key features by class separability criteria (‘rankfeatures’ with criteria ‘ttest,’ ‘entropy,’ ‘bhattacharyya,’ ‘roc,’ and ‘wilcoxon’ in MATLAB), and Pearson correlation between each feature/ predictor variable and response variable (‘corrcoef’ in MATLAB) with correlation set to zero if not significant (i.e. p > 0.01). Figure 2 shows the ranking of the features by each of these algorithms for males and females respectively. A brief description of each of these algorithms is presented in Appendix.

 

Results

Our analysis is done on the entire dataset as well as separately on the male and female participants. After Pre-processing (Ref: Data Processing), the PLCO dataset containing 65 features and 154897 points (76682 male, 749 True, 430 male True) reduces to 32 features and 148315 points (73162 male, 706 True, 405 male True). For balancing (Ref: Data Balancing), we randomly sample [148315/706]=210 non-overlapping subsets for PC=False, each containing 706 or 707 data points. Thus after balancing, each subset contains a total of 1412 or 1413 points. Similarly, for male only analysis, we obtain [73162/405]=180 balanced subsets, each containing 810 or 811 points. For female only analysis, we obtain [75153/301]=249 balanced subsets, each containing 602 or 603 points.

Classification

Some machine learning algorithms, briefly described in Section 6.4 were used and their statistical parameters are reported.

Using classification ensemble: In this ensemble algorithm, the weights or costs can be modified to correctly train the algorithm to predict PC. The weights are normalized to add unity, depicting the prior probabilities. Suppose ∈ij (i, j ∈ {1…c},∈ii=0) is the cost of misclassification of the example of the ith class to the jth class, where c is the number of classes. Then, the weight assigned to the ith class after rescaling is given as [32]:

Equation 2:

Equation

where n is the number of training samples,

Equation

It uses the algorithms as described in [32-34]. For example, we can say the weight of predicting no PC for subjects with PC (False positive) is 1000 times more serious than predicting PC for subjects with no PC (False negative). Accordingly, we can change the weights to get a confusion matrix as per our need.

Feature selection: We used several feature selection algorithms [21-31], implemented in MATLAB, to rank the features. Figure 2 show the ranking of the features by each of these algorithms for males and females respectively.

Quality-Primary-Care-Weights

Figure 2: Weights assigned by 9 feature selection algorithms (columns 1-9) to risk factors in the PLCO dataset.

Finding probability feature combination using a bayesian network: Russell and Norvig in their book, Artificial Intelligence: A Modern Approach [35] have illustrated about Bayes Theorem and joint probability. Consider that the symptoms E1, E2 are conditionally independent. Then their co-occurrence is as follows:

Using the above equation,

Equation 3:

Equation

Equation 4:

Equation

Equation 5:

As any individual will either have PC or not have PC with the given symptoms, considering a universal set, P(E1ΔE2) can be resolved using normalization as follows:

Equation

Equation 6:

P(C ) is the probability of non-occurrence of PC. Hence,

Equation

Equation 7:

Substituting equation 3 in equation 6

Equation

Equation 8:

Substituting equation 4 in equation 7,

Equation

Equation 9:

Substituting equation 3 in equation 8,

Equation

Discussion

A number of risk factors of PC have been identified [36-38], such as, smoking, obesity, exposure to certain chemicals (e.g., pesticides, benzene, certain dyes, petrochemicals), age (older than 55 years), gender (male), race/ethnicity (Blacks, Ashkenazi Jewish heritage), family history (two or more first-degree relatives with PC), inherited genetic syndromes, diabetes, pancreatic cysts and chronic pancreatitis. Several different genes are associated with increased risk of PC. However, genetic risk factors are beyond the scope of this work as our dataset does not contain genetic information. Table 2 shows the ratio in which each symptom was distributed in the HPT (High Probability Table) chosen by selecting top percentage values of probability for that feature combination and in the LPT (Low Probability Table) chosen by selecting bottom percentage values of probability for that feature combination.

Table 2: Table containing features from PLCO dataset that is plausible to being indicators of risk of PC

Symptoms Results Conclusion
Occupation All subjects in HPT were retired (category 4) and in LPT, they were extended sick leave (category 5) Older people have a greater risk of PC
Smoked pipe Subjects in HPT were in ratio 0.22 (never smoked): 0.5 (currentsmoker): 0.28 (pastsmoker) whereas subjects in LPT were in ratio 0.37 (never smoked): 0.27 (current smoker): 0.36 (past smoker) Subjects who never smoked have a lesser risk than past smokers and risk for current smokers was doubled
Heart Attack Subjects in HPT were in ratio 0.23 (never had heart at-tack): 0.77 (had heart attack) whereas subjects in LPT were in ratio 0.8 (never had heart attack): 0.2 (had heart attack) Subjects who had heart attack at least once have a greater risk for PC
Hypertension Subjects in HPT were in ratio 0.36 (not diagnosed with hypertension): 0.63 (diagnosed with hypertension) whereas subjects in LPT were in ratio 0.68 (not diagnosed with hypertension): 0.32 (diagnosed with hypertension)   Stress (or hypertension) is directly proportional to risk for PC
Taken female hormones Subjects in HPT were in ratio 0.7 (never taken): 0.3 (taken) whereas subjects in LPT were in ratio 0.23 (never taken): 0.77 (taken) Somehow female hormones reduces risk of PC
Race Subjects in HPT were mostly Asian (0.38) and only 0.3 were Pacific Islander whereas subjects in LPT were mostly American Indian (0.85) Clearly shows that Asians are at a higher risk of PC while Pacific Islander and American Indian were at lower risk.
Diabetes Subjects in HPT were in ratio 0.17 (never had diabetes): 0.83(had diabetes) whereas subjects in LPT were in ratio 0.75 (never had diabetes): 0.25 (had diabetes) Diabetes is a clear risk factor for PC

  Bronchitis

Subjects in HPT were in ratio 0.27 (never had): 0.73 (had) whereas subjects in LPT were in ratio 0.68 (never had): 0.32 (had)   Bronchitis is a risk factor for PC
Liver comorbidities Subjects in HPT were in ratio 0.39 (never had): 0.61 (had) whereas subjects in LPT were in ratio 0.62 (never had): 0.38(had) Liver comorbidities is a risk factor for PC
Colorectal Polyps Subjects in HPT were in ratio 0.36 (never had): 0.64 (had) whereas subjects in LPT were in ratio 0.62 (never had):0.38 (had) Colorectal Polyps is a risk factor for PC
Gender Subjects in HPT were in ratio 0.53 (male): 0.47 (female) whereas  subjects  in LPT were in ratio 0.35 (male): 0.65 (female) Male were at higher risk of PC than female
No of Relatives with pancreatic cancer Subjects in HPT were in ratio 0.02 (no relative):0.1 (1 relative): 0.88 (2 relatives) whereas subjects in LPT were in ratio 0.71(no relative: 0.29 (1 relative)   Risk of PC increases as incidence of PC on family members increases.
Ever take birth control pills? Subjects in HPT were in ratio 0.76 (no history): 0.24 (has history) whereas subjects in LPT were in ratio 0.17 (no history): 0.83 (has history) birth control pills may lower risk of PC
Smoke regularly now? Subjects in HPT were in ratio 0.12 (no history): 0.88 (has history) whereas subjects in LPT were in ratio 0.95 (no history): 0.05 (has history) Current smokers have higher risk of PC
Ever smoke regularly more than 6 months?   Subjects in HPT were in ratio 0.22 (no history): 0.78 (has history) whereas subjects in LPT were in ratio 0.85 (no history): 0.15 (has history)   Smoking in excess of 6 months also poses higher risk of PC

Factors with unclear effect on risk include nature of diet, lack of physical activity, coffee and alcohol consumption, and certain infections (see for example, [38]).

Smoking: Several studies have shown that smoking has a significant relationship with PC (see for example [39-44]). Yadav et al. found that smoking cessation can significantly reduce risk of PC [43]. Raimondi et al. argue that smoking is the most common risk factor and accounts for 20-25% of all pancreatic tumors [44].

Diabetes: Diabetes also has a positive correlation with PC [45]. Huxley et al. shows that individuals who have had type-II diabetes for less than four years were at a 50% higher risk of contracting PC than individuals who have had type-II diabetes for more than four years [3]. Everhart et al. have concluded that subjects with long standing diabetes have a higher relative risk of PC [4]. Ben et al. have also found similar relationship between diabetes and PC [46]. Liao et al shows that subjects in Taiwan who have had diabetes for less than 2 years are at elevated risk of PC [5]. Long standing diabetes did not pose a strong risk. Also concurrent occurrence of diabetes and chronic pancreatitis puts subjects at a higher risk.

Reproductive history in women: Lo, et al. has shown that women with 7 or more live births had a lower risk of PC. Lactation period also had a significant effect on the possibility of PC [47]. This study shows that women who lactated for 144 months or more had a one-fifth the risk of PC than women who lactated only for 89 months or less. Kreiger et al. have shown that PC is an estrogen-dependent disease and aspects of reproductive history and hormone replacement are associated with a greater risk of this disease. Reduced risks were observed with 3 or more pregnancies and with the use of oral contraceptives [48].

Marriage: Baine et al. shows marriages improves the survival rate and longevity of patients with PC [31]. This paper also shows using Kaplan-Meier analysis, that patients who were married had a median survival rate of 4 months in comparison to unmarried patients who had a survival rate of 3 months. Aizer et al. have shown that marriage has a beneficial effect on any cancer with regards to detection, treatment and survival [49]. This improvement was observed more in males than females, highlighting the socio-economic elevation that a married person could have. The paper concluded that ”married people were less likely to present metastatic disease, more likely to receive definitive therapy, and less likely to die as a result of their cancer after adjusting for demographics, stage, and treatment than unmarried patients.” Multivariate logistic and Cox regression were used to analyze the patients.

Occupation: Logan et al. shows how specific types of occupation pose higher risk to exposure to carcinogenic substances [50]. In 1961 and 1971, for men, occupation categories of clothing, food, drink and tobacco and armed forces had higher Standardised Mortality Rates (SMR) and relative standardised mortality rates (RSMR) whereas people in the clerical and leather industry saw low SMR and RSMR. For men in the occupation categories of mining, labourers and service, sport and recreation saw elevated but reduced RSMR. For men in administrative and managerial, and professional and technical disciplines, the trend was reduced SMR and elevated RSMR. In case of married women, if husbands worked in engineering, leather, wood, sales, clothing, construction work, both SMR and RSMR were high. For wives of husbands working in farm, gas, coke and chemicals industry, glass and ceramics and warehouse, both SMR and RSMR were low. In 1961, wives of husband in food, drink and tobacco had high SMR and RSMR and values were low for husbands in painting and decoration industry, and the trend was reversed in 1971.

Family composition: Gharidian et al. have found an interesting relationship wherein there is the occurrence of this disease in two brothers and one sister in all the seventh decade of their life [51]. This study was based in Montreal and there was no pancreatitis history between the patients or their relatives.

Use of certain medications: Tan et al. have shown that aspirin use decreases risk of procuring PC [52]. Aspirin use for 1day/ month or greater was associated with a lower risk of PC than subjects who had aspirin for less than 1day/month. According to this study, there are no relationships between non-aspirin non-steroidal anti-inflammatory drugs (NSAID) and PC. Larsson et al. have provided doubtful evidence that regular use of aspirin over longer duration increases risk of PC [53]. No relationship was found between use of frequent aspirin (7 tablets or more/week) or prolonged use of aspirin (more than 20 years) and the increase/decrease in PC. Harris et al. have found a relationship between aspirin, ibuprofen, and other Non-Steroidal Anti-Inflammatory Drugs (NSAID) and cancer prevention [54]. However, results varied for different types of cancer.

Table 3: Table containing 2 features combinations from PLCO dataset that produces highest risk of PC for male

Symptom 1 Symptom 2 Probability
Age when told had inflamed prostate=70+ No of cigarettes smoked daily=80+ 0.032
Age when told had inflamed prostate=70+ Prior history of any cancer?=Yes 0.03
Prior history of any cancer?=Yes No of cigarettes smoked daily=80+ 0.026
Age when told had inflamed prostate=70+ Age when told had enlarged prostate=70+ 0.026
Age when told had inflamed prostate=70+ Family history of PC?=Yes 0.026
Age when told had inflamed prostate=70+ No of relatives with PC=1 0.026
Age when told had enlarged prostate=70+ No  of cigarettes smoked daily=80+ 0.024
Family history of PC=Yes No of cigarettes smoked daily=80+ 0.024
No of relatives with PC=1 No of cigarettes smoked daily=80+ 0.024
Age when told had inflamed prostate=70+ Bronchitis history?=Yes 0.022
Age when told had enlarged prostate=70+ Prior history of any cancer?=Yes 0.022
Prior history of any cancer?=Yes Family history of PC=Yes 0.022
Prior history of any cancer?=Yes No of relatives with PC=1 0.021
Age when told had inflamed prostate=70+ Gall bladder stone or inflammation=Yes 0.021
Age when told had inflamed prostate=70+ Smoke regularly now?=Yes 0.021
No  of cigarettes smoked daily=80+ Bronchitis history?=Yes 0.021
Age when told had inflamed prostate=70+ During past year, how many times wake up in the night to urinate?=Thrice 0.021
Age when told had inflamed prostate=70+ Smoked pipe=current smoker 0.021
Age when told had inflamed prostate=70+ Diabetes history=yes 0.02
Age when told had inflamed prostate=70+ No. of brother=7+ 0.02

Quality-Primary-Care-Weights-PLCO

Figure 3: Weights assigned by 9 feature selection algorithms (columns 1-9) to risk factors in the PLCO dataset.

Surgical history: Rosenberg et al. have shown a positive correlation in increase in risk of PC by 1.8% because of vasectomy [55].

Inherited genetic syndrome: Certain rare genetic conditions cause almost 10% of all PCs. In our investigation, it can be found under family history of PC that have been chosen by two of the feature-selection algorithms, viz, Relieff and Lasso in Table 3. Also, no of relatives with PC has been chosen by 4 of the feature selection algorithms, viz, ECFS, UDFS, LLCFS and CFS. From the graphs in (Figure 3), it can be seen that if subject has family history of PC or any form of cancer, there is increase in probability of PC. Further the trend of increase is almost exponential as no. of relatives with PC increases, which strongly suggests that genetics play an important role in determination of possibility of PC. Such rare genetic conditions include [36]:

• Hereditary breast and ovarian cancer syndrome, caused by mutations in the BRCA1 or BRCA2 genes,

• Hereditary breast cancer, caused by mutations in the PALB2 gene,

• Familial atypical multiple mole melanoma (FAMMM) syndrome, caused by mutations in the p16/CDKN2A gene and

• associated with skin and eye melanomas,

• Familial pancreatitis, usually caused by mutations in the PRSS1 gene,

• Lynch syndrome, also known as hereditary non-polyposis colorectal cancer (HNPCC), most often caused by a defect in the

• MLH1 or MSH2 genes,

• Peutz-Jeghers syndrome, caused by defects in the STK11 gene. This syndrome is also linked with polyps in the digestive

• Tract and several other cancers.

Race: Race has been a predominant factor in the determination of the risk of PC [36-38]. According to literature, blacks or African American people have a higher risk of contracting PC. This could be attributed to their dietary habits or smoking history. Race has been chosen as one of the features by 3 of our feature-selection algorithms, viz, Laplacian, FSASL and LLCFS in 2 and also Asian race has been chosen as one of the highest probability of PC causing feature in Table 3.

Gender: Literature has shown that men are more likely to contract PC than women [36-38]. This could be because men are more likely to smoke than women and smoking has a significant effect on PC. Gender has been chosen as one of the features by 3 of our feature-selection algorithms, viz, Laplacian, CFS and ECFS in Table 3 and also gender is male in the highest probability of PC in Table 3.

Female hormones: Experimental findings from this article by on use of affect of female hormones suggest that female hormones have a protective role towards incidence of PC [56].

Bronchitis: Although there is no direct evidence between bronchitis and PC risk, is a study conducted on male smokers in Finland that suggests that bronchial asthma predict the subsequent risk of developing PC in male smokers and that greater physical activity may decrease the risk [57]. Also bronchial asthma can increase chances of developing bronchitis.

Heart attack: Many references suggest the increased association between heart attack and stroke with any type of cancer (not necessarily PC). It shows the increased risk of heart attack and stroke in the months leading up to cancer diagnosis. In another article [58,59], it shows that recent epidemiological analyses suggest that cancer incidence is more common among subjects with a history of heart failure versus subjects with no history of heart failure.

Hypertension: Some references, for example suggestion that hypertension at baseline was associated with an increased risk of PC incidence [60]. Although the above factors-inherited genetic syndrome, race, diabetes history and gender have a strong relationship with PC, yet they were not one of the highly selected features by our algorithms, probably because other features have a stronger dependence when considered in unison.

Most of the remaining features as seen in Table 2 do not have a strong evidence yet to their dependency with PC, however they can act as a guide to biologists and researchers to delve into possible correlation between these symptoms Tables 4 and 5.

Table 4: Table containing 2 features combinations from PLCO dataset that produces highest risk of PC for female

Symptom 1 Symptom 2 Probability
No of cigarettes smoked daily=61-80 No of relatives with PC=2+ 0.156
No of tubal/ectopic pregnancies=2+ No of relatives with PC=2+ 0.137
Usually filtered or not filtered?=Both No of relatives with PC=2+ 0.115
No of cigarettes smoked daily=61-80 No of relatives with PC=2+ 0.095
No of tubal/ectopic pregnancies=1 No  of reltives with PC=2+ 0.084
Heart attack history?=yes No of relatives with PC=2+ 0.08
No of cigarettes smoked daily=21-30 No of relatives with PC=2+ 0.077
No of relatives with PC=2+ Race=Asian 0.076
No of relatives with PC=2+ No of still births=1 0.074
No of relatives with PC=2+ Diabetes history?=Yes 0.0737
No of relatives with PC=2+ Race=American Indian 0.0737
No of relatives with PC=2+ Emphysema history?=Yes 0.0737
No of relatives with PC=2+ No of cigarettes smoked daily=31-40 0.0708
No of relatives with PC=2+ Colorectal Polyps history?=Yes 0.0708
No of relatives with PC=2+ Stroke history?=Yes 0.0704
No of relatives with PC=2+ Age at hysterectomy=40-44 0.0686
No of relatives with PC=2+ No. of brothers=7+ 0.0645
No of relatives with PC=2+ Bronchitis history?=2+ 0.064
No of relatives with PC=2+ Liver comorbidities history?=Yes 0.063
No of relatives with PC=2+ No of cigarettes smoked daily=11-20 0.063

Table 5: Table containing 3 features combinations from PLCO dataset that produces highest risk of PC for male and female

Male
Symptom 1 conditional probability Symptom 2 conditional probability Symptom 3 conditional probability Total probability
No of cigarettes smoked Daily is 61-80=0.005 Age when told had enlarged Prostate is 70+=0.0175 Prior history  of cancer is Yes=0.05 0.00521
No of cigarettes smoked Daily is 61-80=0.005 Age when told had enlarged Prostate is 70+=0.0175 Family history of Pc=yes=0.533 0.002368
No  of cigarettes smoked Daily is 61-80=0.005 Age when told had enlarged Prostate is 70+=0.0175 No. of relatives with pc is 1=0.04 0.004593
Prior history  of cancer is Yes=0.05 Age when told had enlarged Prostate is 70+=0.0175 Family history of pc is Yes=0.533 0.002153
Prior history  of cancer is Yes=0.05 Age when told had enlarged Prostate is 70+=0.0175 No. of relatives with pc is 1=0.04 0.004177
Age when told had enlarged Prostate is 70+=0.0175 Family history of pc is Yes=0.533 No. of relatives with pc is 1=0.04 0.00189
No of cigarettes smoked Daily is 61-80=0.005 Prior history of cancer is Yes=0.05 Family history of Pc=yes=0.533 0.02742
No of cigarettes smoked Daily is 61-80=0.005 Prior history of cancer is Yes=0.05 No. of relatives with pc is 1=0.04 0.05197055
No of cigarettes smoked Daily is 61-80=0.005 Family history of pc is Yes=0.533 No. of relatives with pc is 1=0.04 0.024218
Prior history of cancer is Yes=0.05 Family history of pc is Yes=0.533 No. of relatives with pc is 1=0.04 0.02206
Female
No of tubal/ectopic pregnancies is 1=0.003 No. of relatives with pc is 2+=0.011 No. of cigarettes smoked is 61-80=0.007 0.3578

Conclusion

We have used widely used algorithms for our prediction for PC. Since the exact relationship between features and the cause of PC cannot be ascertained for sure, for example, some factors like education, marital status and several others could have an indirect causal relationship with this disease, hence these factors were not excluded from our prediction study. After running all the above algorithms, it is observed that k-means clustering and SMOTE method of oversampling are some of the superior algorithms for PC prediction. The artificial intelligence based Bayesian network prediction model can signify which individuals are at an elevated risk for PC.

Until now, very limited work has been done in PC prediction, so the accuracy obtained by our research is significant. Lack of online available datasets for PC has limited the work that can be done in this field. Still the PLCO dataset by NIH has been a very valuable resource. Future improvements can be made based on taking into account other features that would have been found as a possible precursor to PC, based on further research and availability of more datasets.

Acknowledgement

The author of this paper would like to thank Dr Bonny Banerjee and Dr Chrysanthe Preza from the University of Memphis, Tennessee and Dr Subhash Chauhan and Dr Sheema Khan from University of Texas, Rio Grand Valley, Texas for their support and guidance in writing the paper.

Conflict Of Interest

No conflict of interest associated with this work.

References

Citation: Dutta A (2023) Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NCI PLCO Dataset. Qual Prim Care. 31:34.

Copyright: © 2023 Dutta A. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.