Skip to content

survival analysis on cirrhosis data from mayo clinic study: kaplan-meier estimator/curve, log rank test, cox proportional hazards model

Notifications You must be signed in to change notification settings

n8tlmps/cirrhosis-survival-analysis

Repository files navigation

Modeling Survival of Patients with Cirrhosis

Cirrhosis is severe scarring of the liver. This condition can be caused by many forms of liver diseases and conditions, such as hepatitis or chronic alcohol consumption. Each time your liver is injured, scar tissue forms in the healing process. The accumulation of scar tissue replaces more of the healthy liver tissue as cirrhosis worsens, impairing the liver’s functionality. Advanced cirrhosis can be life-threatening. This project aims to understand the probability of cirrhosis survival and to analyze the underlying factors influencing survival time. A survival analysis was conducted on this dataset using the Kaplan-Meier estimator and curve, Log-Rank tests, and the Cox proportional hazards model.

Dataset Information

The dataset was obtained from the UCI Machine Learning Repository, a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. It contained 418 observations and 20 variables that featured:

  • ID
  • Number of days between registration and the earlier of death, transplantation or study analysis time
  • Status of the patient
  • Type of drug: D-penicillamine or Placebo
  • Age
  • Sex: Male or Female
  • Presence of Ascities
  • Presence of Hepatomegaly
  • Presence of Spiders
  • Presence of Edema
  • Bilirubin
  • Cholesterol
  • Albumin
  • Copper
  • Alkaline Phosphatase
  • SGOT
  • Tryglicerides
  • Platelets per cubic
  • Prothrombin time
  • Histologic Stage of Disease

After removing missing values, there were 276 observations remaining.

During 1974 to 1984, 424 primary biliary cirrhosis (PBC) patients referred to a Mayo Clinic study were qualified for the randomized placebo-controlled trial testing on the drug D-penicillamine. Of these, 312 patients were followed through the trial and have mostly comprehensive data. The remaining 112 patients did not join the clinical trail, but agreed to record basic metrics and undergo survival tracking. By the end of the study, data was collected from 106 of these individuals along with the initial 312 who were participating in the trial. Penicillamine is a chelating agent that binds to excess copper and removes it from the blood stream. It is often prescribed to treat rheumatoid arthiritis, Wilson’s disease, and kidney stones. If the liver is unable to process substances like iron and copper, they can build up to dangerously high levels and cause cirrhosis as they damage your liver tissues over time. Mutation in particular genes is also suggested to contribute to cirrhosis.

Kaplan-Meier Estimator and Curve

After cleaning, the data consists of the survival times or times to event for 276 individuals from the study. The specificity of the data is that they may include censored observations. An observation is censored if it is known that the person survived (or hasn't experienced the event) up to a certain time but nothing is known afterwards. It happens when the individual drops out of the study.


Denote by $n_i, i = 1, ..., k$, the number of individuals still alive (or those who have not experienced the event) shortly before time $t_i$ (they are called at-risk at time $t_i$), and let $e_i$ be the number of individuals who experienced the event at time $t_i$

The Kaplan-Meier (KM) product-limit estimator of the survival function is:

The plot of the KM estimator against time is called the Kaplan-Meier survival curve. It is a step function with vertical lines corresponding to the event times. Times when censoring occurs are marked by some symbol. When censoring coincides with an event time, the convention is to put the symbol at the bottom of the step. The KM estimator was computed and the KM curve was plotted using R.

time n.risk n.event survival
41 276 1 0.996
51 275 1 0.993
71 274 1 0.989
77 273 1 0.986
110 272 1 0.982
131 271 1 0.978
140 270 1 0.975
179 269 1 0.971
186 268 1 0.967
191 267 1 0.964
198 266 1 0.960
216 265 1 0.957
223 264 1 0.953
264 263 1 0.949
304 262 1 0.946
321 261 1 0.942
326 260 1 0.938
334 259 1 0.935
348 258 1 0.931
388 257 1 0.928
400 256 1 0.924
460 255 1 0.920
515 254 1 0.917
549 252 1 0.913
552 251 1 0.909
597 250 1 0.906
611 249 1 0.902
673 248 1 0.898
694 247 1 0.895
733 245 1 0.891
750 243 1 0.888
762 242 1 0.884
769 241 1 0.880
786 240 1 0.877
790 238 1 0.873
797 237 1 0.869
799 236 1 0.865
850 233 1 0.862
853 232 1 0.858
859 231 1 0.854
890 229 1 0.851
904 227 1 0.847
930 226 1 0.843
943 224 1 0.839
974 223 1 0.836
980 222 1 0.832
999 220 1 0.828
1012 219 1 0.824
1077 216 1 0.820
1080 215 1 0.817
1083 214 1 0.813
1152 211 1 0.809
1165 209 1 0.805
1170 208 1 0.801
1191 207 2 0.793
1212 205 1 0.790
1235 200 1 0.786
1297 195 1 0.782
1356 187 1 0.777
1360 186 1 0.773
1413 180 1 0.769
1427 177 1 0.765
1434 175 1 0.760
1444 172 1 0.756
1487 168 1 0.751
1536 165 1 0.747
1576 160 1 0.742
1657 154 1 0.737
1682 152 1 0.732
1690 151 2 0.723
1741 146 1 0.718
1786 139 1 0.713
1827 136 1 0.707
1847 133 1 0.702
1925 129 1 0.697
2055 119 1 0.691
2090 118 1 0.685
2105 117 1 0.679
2224 110 1 0.673
2256 106 1 0.667
2288 104 1 0.660
2297 102 1 0.654
2386 93 1 0.647
2400 92 1 0.640
2419 91 1 0.633
2466 86 1 0.625
2540 81 1 0.617
2583 76 1 0.609
2598 75 1 0.601
2689 70 1 0.593
2769 67 1 0.584
2796 65 1 0.575
2847 62 1 0.566
3086 52 1 0.555
3090 51 1 0.544
3170 45 1 0.532
3244 44 1 0.520
3282 42 1 0.507
3358 39 1 0.494
3395 37 1 0.481
3428 35 1 0.467
3445 34 1 0.453
3574 32 1 0.439
3584 29 1 0.424
3762 25 1 0.407
3839 22 1 0.389
3853 21 1 0.370
4079 14 1 0.344
4191 10 1 0.309

Log-Rank Test

To compare two survival curves, we can use the log-rank test. The hypotheses are: $H_0: S_1(t) = S_2(t)$ for all values of $t$ vs $H_1: S_1(t) \neq S_2(t)$ for at least one value of $t$. Under the $H_0$, $e_{1i}$ has a hypergeometric distriubtion with parameters $n_i$ (the population size), $n_{1i}$ (the size of the group of interest in the population), and $e_i$) (the sample size). The mean and the variance of $e_{1i}$ are $E(e_{1i})=\frac{n_{1i}e_i}{n_i}$ and $\text{Var}(e_{1i})=\frac{n_{1i}n_{2i}(n_i-e_i)e_i}{n_i^2(n_i-1)}$.


The log-rank test statistic is the standardized sum of $e_{1i}$'s over all tables, that is,

Under $H_0$, this test statistic has approximately a $\mathcal{N}(0, 1)$ distribution. Equivalently, $z^2$ may be chosen as the test statistic. It has approximately a chi-square distribution with one degree of freedom.

The log-rank test was performed on different patient groups using R at the 5% level:

  • D-penicillamine vs. Placebo
  • Male vs. Female
  • Presence of Hepatomegaly vs. No Presence
  • Presence of Ascites vs. No Ascites

  • D-penicillamine vs. Placebo

    Chisq on 1 degrees of freedom P-Value
    0.4 0.5

    The two-sided P-value is $2P(Z > 0.6325) = 0.5$, hence we fail to reject $H_0$ and conclude that the two survival curves for the D-penicillamine and Placebo groups are not significantly different.

    Male vs. Female

    Chisq on 1 degrees of freedom P-Value
    4.5 0.03

    The two-sided P-value is $2P(Z > 2.1213) = 0.03$, hence we reject $H_0$ and conclude that the two survival curves for male and female patients are marginally different. The proportion of females to males must also be taken into consideration when comparing probability of survival between the two groups as there were 242 females and 34 males registered in the study.

    Hepatomegaly vs. No Hepatomegaly

    Chisq on 1 degrees of freedom P-Value
    27.1 2e-07

    The two-sided P-value is $2P(Z > 5.2058) = 2e^{-7}$, hence we reject $H_0$ and conclude that the two survival curves for patients with hepatomegaly and those with no hepatomegaly are significantly different. The curve for patients with hepatomegaly lies below that for patients with no hepatomegaly, indicating that presence of hepatomegaly is a risk factor for shorter life span.

    Ascites vs. No Ascites

    Chisq on 1 degrees of freedom P-Value
    110 <2e-16

    The two-sided P-value is $2P(Z &gt; 10.4880) = &lt;2e^{-16}$, hence we reject $H_0$ and conclude that the two survival curves for patients with ascites and those with no ascites are significantly different. The curve for patients with ascites lies drastically below that for patients with no ascites, indicating that presence of ascites is a risk factor for shorter life span.

    Cox Proportional Hazards Model

    Suppose that besides the event time and an indicator of censoring, data contain measurements of a set of predictors $x_1, ..., x_m$ that do not vary with time. Denote the event time by $T$ and assume that it is a random variable with the hazard function $h_T(t)$. The Cox proportional hazards model assumes that the hazard function has the form:

    Note that in this model, the hazard function depends on time only through the baseline hazard function $h_0(t)$, and therefore, the ratio of hazards of two individuals does not depend on time, which means that the hazards are proportional over time.

    The unknowns of this model are the baseline hazard function $h_0(t)$, we introduce another formulation of the Cox PH model, in terms of the survival function. We write:

    where $S_0(t) = \exp\left\{- \int_{0}^{t} h_0(u) \, du\right\}$ is the baseline survival function, and $r = exp(\beta_1 x_1 + ... + \beta_m x_m)$ is the relative risk of an individual.


    The Cox PH model was fit to the data using R, setting drug, age, sex, ascites, hepatomegaly, spiders, edema, bilirubin, cholesterol, albumin, copper, alkaline phosphatase, SGOT, tryglicerides, platelets, prothrombin, and stage as predictors.

    Coeff Pr(>z)
    drug.relD-penicallamine 1.765e-01 0.42003
    Age 7.910e-05 0.01304 *
    sex.relM 3.757e-01 0.22720
    ascites.relY 5.899e-04 0.99881
    hepatomegaly.relY 5.648e-02 0.82575
    spiders.relY 7.011e-02 0.77839
    edema.relS 2.436e-01 0.46764
    edema.relY 1.146e+00 0.00595 **
    Bilirubin 8.014e-02 0.00229 **
    Cholesterol 4.731e-04 0.29675
    Albumin -7.496e-01 0.01572 *
    Copper 2.442e-03 0.03693 *
    Alk_Phos 5.228e-07 0.98974
    SGOT 3.710e-03 0.06247
    Tryglicerides -5.177e-04 0.71846
    Platelets 8.410e-04 0.47929
    Prothrombin 2.763e-01 0.01793 *
    stage.rel2 1.406e+00 0.19288
    stage.rel3 1.683e+00 0.10968
    stage.rel4 2.119e+00 0.04711 *
    Time Risk Event Sbar
    41 276 1 1.000
    51 275 1 1.000
    71 274 1 1.000
    77 273 1 1.000
    110 272 1 0.999
    131 271 1 0.999
    140 270 1 0.999
    179 269 1 0.999
    186 268 1 0.999
    191 267 1 0.999
    198 266 1 0.998
    216 265 1 0.998
    223 264 1 0.998
    264 263 1 0.998
    304 262 1 0.998
    321 261 1 0.997
    326 260 1 0.997
    334 259 1 0.997
    348 258 1 0.997
    388 257 1 0.996
    400 256 1 0.996
    460 255 1 0.996
    515 254 1 0.996
    549 252 1 0.995
    552 251 1 0.995
    597 250 1 0.995
    611 249 1 0.994
    673 248 1 0.994
    694 247 1 0.994
    733 245 1 0.993
    750 243 1 0.993
    762 242 1 0.993
    769 241 1 0.992
    786 240 1 0.992
    790 238 1 0.992
    797 237 1 0.991
    799 236 1 0.991
    850 233 1 0.990
    853 232 1 0.990
    859 231 1 0.990
    890 229 1 0.989
    904 227 1 0.989
    930 226 1 0.989
    943 224 1 0.988
    974 223 1 0.988
    980 222 1 0.987
    999 220 1 0.987
    1012 219 1 0.986
    1077 216 1 0.986
    1080 215 1 0.985
    1083 214 1 0.985
    1152 211 1 0.984
    1165 209 1 0.984
    1170 208 1 0.983
    1191 207 2 0.982
    1212 205 1 0.982
    1235 200 1 0.981
    1297 195 1 0.981
    1356 187 1 0.980
    1360 186 1 0.979
    1413 180 1 0.978
    1427 177 1 0.977
    1434 175 1 0.977
    1444 172 1 0.976
    1487 168 1 0.975
    1536 165 1 0.974
    1576 160 1 0.973
    1657 154 1 0.972
    1682 152 1 0.971
    1690 151 2 0.969
    1741 146 1 0.968
    1786 139 1 0.967
    1827 136 1 0.966
    1847 133 1 0.965
    1925 129 1 0.963
    2055 119 1 0.962
    2090 118 1 0.960
    2105 117 1 0.959
    2224 110 1 0.957
    2256 106 1 0.955
    2288 104 1 0.954
    2297 102 1 0.952
    2386 93 1 0.950
    2400 92 1 0.948
    2419 91 1 0.946
    2466 86 1 0.944
    2540 81 1 0.942
    2583 76 1 0.940
    2598 75 1 0.938
    2689 70 1 0.935
    2769 67 1 0.932
    2796 65 1 0.930
    2847 62 1 0.927
    3086 52 1 0.924
    3090 51 1 0.921
    3170 45 1 0.917
    3244 44 1 0.913
    3282 42 1 0.909
    3358 39 1 0.905
    3395 37 1 0.900
    3428 35 1 0.895
    3445 34 1 0.890
    3574 32 1 0.884
    3584 29 1 0.877
    3762 25 1 0.869
    3839 22 1 0.861
    3853 21 1 0.852
    4079 14 1 0.839
    4191 10 1 0.824

    Fitted Model Cox PH Model

    The fitted Cox proportional hazards model can be written as:

    Interpretation of Significant Predictors

    At the 5% level, the following predictors are significant:

  • age: As age increases by one day, the estimated hazard of dying from cirrhosis increases by 1.05%.
  • presence of edema despite diuretic therapy: The estimated hazard for patients with edema is 314.56% of that for patients with no edema.
  • bilirubin: As bilirubin level increases by 1 mg/dl, the estimated hazard increases by 8.34%.
  • albumin: As albumin level increases by 1 gm/dl, the estimated hazard changes by -52.75%, or decreases by 52.74%.
  • copper: As copper level increases by 1 ug/day, the estimated hazard increases by 0.244%.
  • prothrombin time: As prothrombin time increases by 1 second, the estimated hazard increases by 31.82%.
  • Stage VI: The estimated hazard of dying from cirrhosis for patients at Stage IV is 832.28% of that for patients at Stage I.
  • About

    survival analysis on cirrhosis data from mayo clinic study: kaplan-meier estimator/curve, log rank test, cox proportional hazards model

    Topics

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages