Only about 14 % of the susceptible EU citizens participate in colorectal cancer (CRC) screening programs despite of being the third most common type of cancer worldwide. The development of predictive models can facilitate personalized CRC predictions which can be embedded in decision-support tools that facilitate screening and treatment recommendations. This paper, published in Computer Methods and Programs in Biomedicine, develops a predictive model that aids in characterizing risk groups and assessing the influence of a variety of risk factors on the population.
Find the paper in the following link
-
main.py: Contains the pipeline of the project which can be summarized in the following steps:
-
Data Preprocessing: It reads a CSV file named
df_2012.csv
from thedata
directory, and applies some preprocessing to the data using a function from thepreprocessing
module. -
Structure Learning: It uses the
HillClimbSearch
andBDsScore
classes from thepgmpy
library to estimate the structure of a Bayesian Network from the data. The structure learning process can be influenced by a target variable, a blacklist of edges, and a list of fixed edges, all of which are specified in aconfig
module. -
Model Visualization: It visualizes the learned Bayesian Network structure using the
pyAgrum
library, and saves the visualizations as PNG images in theimages
directory. It creates two visualizations: one for the prior network (before learning) and one for the posterior network (after learning). -
Parameter Estimation: It estimates the parameters of the Bayesian Network using a function from the
parameter_estimation
module. This process involves updating the prior parameters of the network based on the data. -
Model Statistics: It calculates and saves some statistics of interest about the model, such as the mean and variance of the counts per year. It also calculates the 90% posterior predictive interval for these counts.
-
Risk Mapping: It creates and saves a heatmap of the risk associated with different variables in the model. This is done using a function from the
risk_mapping
module. If specified in theconfig
module, it also calculates an approximation of the posterior predictive intervals by sampling. -
Influential Variables: It identifies the variables in the model that have the most influence on a target variable. This is done using a function from the
influential_variables
module. -
Model Evaluation: Finally, it evaluates the performance of the model in classifying a target variable in a separate dataset (
df_2016.csv
). This is done using a function from theevaluation_classification
module.
-
-
config.py: This file contains several Python dictionaries and lists that are used to configure the behavior of a Bayesian Network model:
-
inputs
: This dictionary specifies the target variable for the model ("CRC"), whether to calculate intervals (False), and the number of random trials to perform (10). -
structure
: This dictionary contains two lists:black_list
: A list of variable pairs that should not be connected in the Bayesian Network.fixed_edges
: A list of variable pairs that should always be connected in the Bayesian Network.
-
node_color
: This dictionary assigns a weight to each variable, which could be used for visual representation or importance ranking. The weights range from 0.1 to 0.4. -
pointwise_risk_mapping
: This dictionary specifies the column variable ("Age") and the row variable ("BMI") for the pointwise risk mapping. -
interval_risk_mapping
: This dictionary specifies the column variable ("Age") and the row variable ("BMI") for the interval risk mapping. -
interval_path
: This dictionary specifies the path ("prueba22nov/") where the interval risk mapping results will be saved.
-
-
preprocessing.py: Add necessary preprocessing steps
-
parameter_estimation.py:
-
create_pscount_dict_from_model(model_bn, card_dict, prior_weight, size_prior_dataset)
: This function generates a dictionary of pseudo counts from a Bayesian Network model. It reshapes the conditional probability distributions (CPDs) of each variable in the model and scales them by the size of the prior dataset and a specified factor. -
prior_update_iteration(model_bn, card_dict, pscount_dict, size_prior_dataset)
: This function performs a prior update iteration on the Bayesian Network model using data from different years. It reads data, preprocesses it, fits the model using a Bayesian Estimator with a Dirichlet prior, updates the pseudo counts dictionary, and stores the count tables for each year. It returns the updated model and a dictionary of counts per year.
-
-
risk_mapping.py:
-
pointwise_risk_mapping(model_bn, var1, var2)
: This function calculates the pointwise risk mapping for two variables (var1
andvar2
) in a Bayesian Network model. It queries the model for the probability of "CRC" given different combinations of the two variables and "Sex", and stores the results in two dataframes. The risk is calculated as the logarithm of the difference between the probability of "CRC" given the evidence and the marginal probability of "CRC". The results are rounded to three decimal places. -
heatmap_plot_and_save(...)
: This function generates a heatmap based on the provided data and visual parameters. If thesave
flag is set toTrue
, it also saves the generated heatmap as a PNG image in a specified directory. The filename of the image is based on the provided title.
-
-
influential_variables.py:
influential_variables(data, target, model_bn, n_random_trials = 50)
: Calculates the influence of different variables on a target variable in a Bayesian Network model. It performs multiple random trials, shuffling the variables, identifying non-ancestors of the target, and calculating difference vectors for each row in the dataframe.
-
evaluation_classification.py:
evaluation_classification(df_test, model_bn, test_var = "CRC")
: This function evaluates the classification performance of a Bayesian Network model on a test dataset. It initializes a Variable Elimination object with the model, then iterates over the rows of the test dataframe. For each row, it drops the test variable, converts the row to a dictionary, and queries the model for the probability of the test variable given the evidence in the row. It stores the predicted probabilities in a listy_prob_pred
. Finally, it calculates the false positive rate, true positive rate, and thresholds for the Receiver Operating Characteristic (ROC) curve using the true labels and the predicted probabilities.
The functions BayesianEstimator.py
and BayesianNetwork.py
are modified versions of the original functions from pgmpy
which would need to be replaced in this library for the main code to run properly. The reason behind this is to save the unnormalized tables of counts and used them to calculate the mean and variance of the empirical distributions.
For any further consultation please contact [email protected] or [email protected]