Skip to content

Projet StatApp ENSAE 2023 - En partenariat avec l'AP-HP et l'INSERM.

Notifications You must be signed in to change notification settings

Tristan-Amadei/TextMining_Parcours_de_soin

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stat'App Project - ENSAE 2023

Text mining and care pathway: what are the causes of mortality in heart failure patients?

Supervisors: Dr. Anne-Isabelle Tropeano, Juliette Murris

AP-HP & INSERM

The objective of the study is to clarify the causes of mortality in heart failure patients, who are increasingly older. Knowing the main causes of mortality in these patients and their most frequent care pathways will have a major public health impact.
To answer our question, we first characterize the care pathways of patients through the study of sequential patterns: using GHM codes (Groupes Homogènes de Malades) defining hospitalizations, it is possible to find similarities in the care pathways, associated with a diagnosis.
Once these pathways are identified, a survival analysis will predict the survival trajectory afterfirst hospitalization.

Clustering

In order to obtain results more easily interpretable, and that would more easily respect the hypotheses of models that will be used later on, we decided to split the pool of patients into clusters.
We used K-Medoids with a custom metric: based on the information lying in the GHM codes, we created our own distance metric to assess the distance between different care pathways.
How and why we constructed this metric is explained is the project paper.

clusters_population

This will allow us to stratify our models on clusters; and this also allowed us to detect and set apart the outliers in the cluster 3.
We can also check that clusters regroup patients similar with regards to our metric

cluster_similarity

For each cluster:

  • we select 50 patients randomly
  • for each selected patient
    • for each GHM code in the health care pathway, we check if this code appears in the hospitalization course of the patient medoid of the cluster
    • if so, the code will be displayed in green in the figure below
    • otherwise, it will be displayed in grey

In a nutshell, the more green there is, the better. Leaving cluster 3 aside, our clusters seem to do a good job.

Pattern Mining

The aim of this section is to extract, through pattern mining, frequent health care pathway patterns, risky, or even potentially lethal trajectories.
Their identification will improve future predictions, as well as trying to better understand the causes of death in heart failure patients.

Sequential Pattern Mining is a data mining technique used to discover frequently occurring sequential patterns or
subsequences in a sequence database or time-series data; while taking into account the order of occurence.

With this technique, we obtained the most frequent jospitalization patterns for different temporal stamp lengths.
We notably found that hospitalization for heart failure (’05M09’) is the last hospitalization before death in 33% of care paths.

frequent_ghm

We could then use these findings to build Sankey diagrams, to view the patterns that are upstream of "Décès" (= death)

sankey

Survival Analysis

We finally try to build models to predict patients' life expectancy.

A model is called survival if it contains censored data.

Cox Model

The Cox model is defined by the hazard function h $$h(t) = h_o(t) \cdot exp\left(\sum_{i=1}^p b_ix_i \right)$$ It can be interpreted as the risk of dying at time t.
Stratifying the Cox model on the cluster defined earlier allowed for the hypothesis of said model to be respected.
We were able to fit a penalised Cox model, stratified by clusters, to get life expectancy curves.

cox_model_life_exp_cluster5

Survival Random Forest

We also wanted to compare the predictions of the Cox Model with non-parametric ap- proaches: the Survival Random Forest.
After a cross-validation phase to tune the hyperparameters, we could plot for each cluster the most and least optimistic trajectory, according to the model after fitting.

trajectories_survival_rf

However, it seems the non-parametric model that is Survival Random Forest does not bring more predictive power, while losing explainability; thus a Cox model seems more appropriate to this problem.

About

Projet StatApp ENSAE 2023 - En partenariat avec l'AP-HP et l'INSERM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 82.2%
  • Jupyter Notebook 17.8%