Cirrhosis is severe scarring of the liver. This condition can be caused by many forms of liver diseases and conditions, such as hepatitis or chronic alcohol consumption. Each time your liver is injured, scar tissue forms in the healing process. The accumulation of scar tissue replaces more of the healthy liver tissue as cirrhosis worsens, impairing the liver’s functionality. Advanced cirrhosis can be life-threatening. This project aims to understand the probability of cirrhosis survival and to analyze the underlying factors influencing survival time. A survival analysis was conducted on this dataset using the Kaplan-Meier estimator and curve, Log-Rank tests, and the Cox proportional hazards model.
The dataset was obtained from the UCI Machine Learning Repository, a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. It contained 418 observations and 20 variables that featured:
- ID
- Number of days between registration and the earlier of death, transplantation or study analysis time
- Status of the patient
- Type of drug: D-penicillamine or Placebo
- Age
- Sex: Male or Female
- Presence of Ascities
- Presence of Hepatomegaly
- Presence of Spiders
- Presence of Edema
- Bilirubin
- Cholesterol
- Albumin
- Copper
- Alkaline Phosphatase
- SGOT
- Tryglicerides
- Platelets per cubic
- Prothrombin time
- Histologic Stage of Disease
After removing missing values, there were 276 observations remaining.
During 1974 to 1984, 424 primary biliary cirrhosis (PBC) patients referred to a Mayo Clinic study were qualified for the randomized placebo-controlled trial testing on the drug D-penicillamine. Of these, 312 patients were followed through the trial and have mostly comprehensive data. The remaining 112 patients did not join the clinical trail, but agreed to record basic metrics and undergo survival tracking. By the end of the study, data was collected from 106 of these individuals along with the initial 312 who were participating in the trial. Penicillamine is a chelating agent that binds to excess copper and removes it from the blood stream. It is often prescribed to treat rheumatoid arthiritis, Wilson’s disease, and kidney stones. If the liver is unable to process substances like iron and copper, they can build up to dangerously high levels and cause cirrhosis as they damage your liver tissues over time. Mutation in particular genes is also suggested to contribute to cirrhosis.
After cleaning, the data consists of the survival times or times to event for 276 individuals from the study. The specificity of the data is that they may include censored observations. An observation is censored if it is known that the person survived (or hasn't experienced the event) up to a certain time but nothing is known afterwards. It happens when the individual drops out of the study.
Denote by
The Kaplan-Meier (KM) product-limit estimator of the survival function is:
The plot of the KM estimator against time is called the Kaplan-Meier survival curve. It is a step function with vertical lines corresponding to the event times. Times when censoring occurs are marked by some symbol. When censoring coincides with an event time, the convention is to put the symbol at the bottom of the step. The KM estimator was computed and the KM curve was plotted using R.
time | n.risk | n.event | survival |
---|---|---|---|
41 | 276 | 1 | 0.996 |
51 | 275 | 1 | 0.993 |
71 | 274 | 1 | 0.989 |
77 | 273 | 1 | 0.986 |
110 | 272 | 1 | 0.982 |
131 | 271 | 1 | 0.978 |
140 | 270 | 1 | 0.975 |
179 | 269 | 1 | 0.971 |
186 | 268 | 1 | 0.967 |
191 | 267 | 1 | 0.964 |
198 | 266 | 1 | 0.960 |
216 | 265 | 1 | 0.957 |
223 | 264 | 1 | 0.953 |
264 | 263 | 1 | 0.949 |
304 | 262 | 1 | 0.946 |
321 | 261 | 1 | 0.942 |
326 | 260 | 1 | 0.938 |
334 | 259 | 1 | 0.935 |
348 | 258 | 1 | 0.931 |
388 | 257 | 1 | 0.928 |
400 | 256 | 1 | 0.924 |
460 | 255 | 1 | 0.920 |
515 | 254 | 1 | 0.917 |
549 | 252 | 1 | 0.913 |
552 | 251 | 1 | 0.909 |
597 | 250 | 1 | 0.906 |
611 | 249 | 1 | 0.902 |
673 | 248 | 1 | 0.898 |
694 | 247 | 1 | 0.895 |
733 | 245 | 1 | 0.891 |
750 | 243 | 1 | 0.888 |
762 | 242 | 1 | 0.884 |
769 | 241 | 1 | 0.880 |
786 | 240 | 1 | 0.877 |
790 | 238 | 1 | 0.873 |
797 | 237 | 1 | 0.869 |
799 | 236 | 1 | 0.865 |
850 | 233 | 1 | 0.862 |
853 | 232 | 1 | 0.858 |
859 | 231 | 1 | 0.854 |
890 | 229 | 1 | 0.851 |
904 | 227 | 1 | 0.847 |
930 | 226 | 1 | 0.843 |
943 | 224 | 1 | 0.839 |
974 | 223 | 1 | 0.836 |
980 | 222 | 1 | 0.832 |
999 | 220 | 1 | 0.828 |
1012 | 219 | 1 | 0.824 |
1077 | 216 | 1 | 0.820 |
1080 | 215 | 1 | 0.817 |
1083 | 214 | 1 | 0.813 |
1152 | 211 | 1 | 0.809 |
1165 | 209 | 1 | 0.805 |
1170 | 208 | 1 | 0.801 |
1191 | 207 | 2 | 0.793 |
1212 | 205 | 1 | 0.790 |
1235 | 200 | 1 | 0.786 |
1297 | 195 | 1 | 0.782 |
1356 | 187 | 1 | 0.777 |
1360 | 186 | 1 | 0.773 |
1413 | 180 | 1 | 0.769 |
1427 | 177 | 1 | 0.765 |
1434 | 175 | 1 | 0.760 |
1444 | 172 | 1 | 0.756 |
1487 | 168 | 1 | 0.751 |
1536 | 165 | 1 | 0.747 |
1576 | 160 | 1 | 0.742 |
1657 | 154 | 1 | 0.737 |
1682 | 152 | 1 | 0.732 |
1690 | 151 | 2 | 0.723 |
1741 | 146 | 1 | 0.718 |
1786 | 139 | 1 | 0.713 |
1827 | 136 | 1 | 0.707 |
1847 | 133 | 1 | 0.702 |
1925 | 129 | 1 | 0.697 |
2055 | 119 | 1 | 0.691 |
2090 | 118 | 1 | 0.685 |
2105 | 117 | 1 | 0.679 |
2224 | 110 | 1 | 0.673 |
2256 | 106 | 1 | 0.667 |
2288 | 104 | 1 | 0.660 |
2297 | 102 | 1 | 0.654 |
2386 | 93 | 1 | 0.647 |
2400 | 92 | 1 | 0.640 |
2419 | 91 | 1 | 0.633 |
2466 | 86 | 1 | 0.625 |
2540 | 81 | 1 | 0.617 |
2583 | 76 | 1 | 0.609 |
2598 | 75 | 1 | 0.601 |
2689 | 70 | 1 | 0.593 |
2769 | 67 | 1 | 0.584 |
2796 | 65 | 1 | 0.575 |
2847 | 62 | 1 | 0.566 |
3086 | 52 | 1 | 0.555 |
3090 | 51 | 1 | 0.544 |
3170 | 45 | 1 | 0.532 |
3244 | 44 | 1 | 0.520 |
3282 | 42 | 1 | 0.507 |
3358 | 39 | 1 | 0.494 |
3395 | 37 | 1 | 0.481 |
3428 | 35 | 1 | 0.467 |
3445 | 34 | 1 | 0.453 |
3574 | 32 | 1 | 0.439 |
3584 | 29 | 1 | 0.424 |
3762 | 25 | 1 | 0.407 |
3839 | 22 | 1 | 0.389 |
3853 | 21 | 1 | 0.370 |
4079 | 14 | 1 | 0.344 |
4191 | 10 | 1 | 0.309 |
To compare two survival curves, we can use the log-rank test. The hypotheses are:
The log-rank test statistic is the standardized sum of
Under
The log-rank test was performed on different patient groups using R at the 5% level:
Chisq on 1 degrees of freedom | P-Value |
---|---|
0.4 | 0.5 |
The two-sided P-value is
Chisq on 1 degrees of freedom | P-Value |
---|---|
4.5 | 0.03 |
The two-sided P-value is
Chisq on 1 degrees of freedom | P-Value |
---|---|
27.1 | 2e-07 |
The two-sided P-value is
Chisq on 1 degrees of freedom | P-Value |
---|---|
110 | <2e-16 |
The two-sided P-value is
Suppose that besides the event time and an indicator of censoring, data contain measurements of a set of predictors
Note that in this model, the hazard function depends on time only through the baseline hazard function
The unknowns of this model are the baseline hazard function
where
The Cox PH model was fit to the data using R, setting drug, age, sex, ascites, hepatomegaly, spiders, edema, bilirubin, cholesterol, albumin, copper, alkaline phosphatase, SGOT, tryglicerides, platelets, prothrombin, and stage as predictors.
Coeff | Pr(>z) | |
---|---|---|
drug.relD-penicallamine | 1.765e-01 | 0.42003 |
Age | 7.910e-05 | 0.01304 * |
sex.relM | 3.757e-01 | 0.22720 |
ascites.relY | 5.899e-04 | 0.99881 |
hepatomegaly.relY | 5.648e-02 | 0.82575 |
spiders.relY | 7.011e-02 | 0.77839 |
edema.relS | 2.436e-01 | 0.46764 |
edema.relY | 1.146e+00 | 0.00595 ** |
Bilirubin | 8.014e-02 | 0.00229 ** |
Cholesterol | 4.731e-04 | 0.29675 |
Albumin | -7.496e-01 | 0.01572 * |
Copper | 2.442e-03 | 0.03693 * |
Alk_Phos | 5.228e-07 | 0.98974 |
SGOT | 3.710e-03 | 0.06247 |
Tryglicerides | -5.177e-04 | 0.71846 |
Platelets | 8.410e-04 | 0.47929 |
Prothrombin | 2.763e-01 | 0.01793 * |
stage.rel2 | 1.406e+00 | 0.19288 |
stage.rel3 | 1.683e+00 | 0.10968 |
stage.rel4 | 2.119e+00 | 0.04711 * |
Time | Risk | Event | Sbar |
---|---|---|---|
41 | 276 | 1 | 1.000 |
51 | 275 | 1 | 1.000 |
71 | 274 | 1 | 1.000 |
77 | 273 | 1 | 1.000 |
110 | 272 | 1 | 0.999 |
131 | 271 | 1 | 0.999 |
140 | 270 | 1 | 0.999 |
179 | 269 | 1 | 0.999 |
186 | 268 | 1 | 0.999 |
191 | 267 | 1 | 0.999 |
198 | 266 | 1 | 0.998 |
216 | 265 | 1 | 0.998 |
223 | 264 | 1 | 0.998 |
264 | 263 | 1 | 0.998 |
304 | 262 | 1 | 0.998 |
321 | 261 | 1 | 0.997 |
326 | 260 | 1 | 0.997 |
334 | 259 | 1 | 0.997 |
348 | 258 | 1 | 0.997 |
388 | 257 | 1 | 0.996 |
400 | 256 | 1 | 0.996 |
460 | 255 | 1 | 0.996 |
515 | 254 | 1 | 0.996 |
549 | 252 | 1 | 0.995 |
552 | 251 | 1 | 0.995 |
597 | 250 | 1 | 0.995 |
611 | 249 | 1 | 0.994 |
673 | 248 | 1 | 0.994 |
694 | 247 | 1 | 0.994 |
733 | 245 | 1 | 0.993 |
750 | 243 | 1 | 0.993 |
762 | 242 | 1 | 0.993 |
769 | 241 | 1 | 0.992 |
786 | 240 | 1 | 0.992 |
790 | 238 | 1 | 0.992 |
797 | 237 | 1 | 0.991 |
799 | 236 | 1 | 0.991 |
850 | 233 | 1 | 0.990 |
853 | 232 | 1 | 0.990 |
859 | 231 | 1 | 0.990 |
890 | 229 | 1 | 0.989 |
904 | 227 | 1 | 0.989 |
930 | 226 | 1 | 0.989 |
943 | 224 | 1 | 0.988 |
974 | 223 | 1 | 0.988 |
980 | 222 | 1 | 0.987 |
999 | 220 | 1 | 0.987 |
1012 | 219 | 1 | 0.986 |
1077 | 216 | 1 | 0.986 |
1080 | 215 | 1 | 0.985 |
1083 | 214 | 1 | 0.985 |
1152 | 211 | 1 | 0.984 |
1165 | 209 | 1 | 0.984 |
1170 | 208 | 1 | 0.983 |
1191 | 207 | 2 | 0.982 |
1212 | 205 | 1 | 0.982 |
1235 | 200 | 1 | 0.981 |
1297 | 195 | 1 | 0.981 |
1356 | 187 | 1 | 0.980 |
1360 | 186 | 1 | 0.979 |
1413 | 180 | 1 | 0.978 |
1427 | 177 | 1 | 0.977 |
1434 | 175 | 1 | 0.977 |
1444 | 172 | 1 | 0.976 |
1487 | 168 | 1 | 0.975 |
1536 | 165 | 1 | 0.974 |
1576 | 160 | 1 | 0.973 |
1657 | 154 | 1 | 0.972 |
1682 | 152 | 1 | 0.971 |
1690 | 151 | 2 | 0.969 |
1741 | 146 | 1 | 0.968 |
1786 | 139 | 1 | 0.967 |
1827 | 136 | 1 | 0.966 |
1847 | 133 | 1 | 0.965 |
1925 | 129 | 1 | 0.963 |
2055 | 119 | 1 | 0.962 |
2090 | 118 | 1 | 0.960 |
2105 | 117 | 1 | 0.959 |
2224 | 110 | 1 | 0.957 |
2256 | 106 | 1 | 0.955 |
2288 | 104 | 1 | 0.954 |
2297 | 102 | 1 | 0.952 |
2386 | 93 | 1 | 0.950 |
2400 | 92 | 1 | 0.948 |
2419 | 91 | 1 | 0.946 |
2466 | 86 | 1 | 0.944 |
2540 | 81 | 1 | 0.942 |
2583 | 76 | 1 | 0.940 |
2598 | 75 | 1 | 0.938 |
2689 | 70 | 1 | 0.935 |
2769 | 67 | 1 | 0.932 |
2796 | 65 | 1 | 0.930 |
2847 | 62 | 1 | 0.927 |
3086 | 52 | 1 | 0.924 |
3090 | 51 | 1 | 0.921 |
3170 | 45 | 1 | 0.917 |
3244 | 44 | 1 | 0.913 |
3282 | 42 | 1 | 0.909 |
3358 | 39 | 1 | 0.905 |
3395 | 37 | 1 | 0.900 |
3428 | 35 | 1 | 0.895 |
3445 | 34 | 1 | 0.890 |
3574 | 32 | 1 | 0.884 |
3584 | 29 | 1 | 0.877 |
3762 | 25 | 1 | 0.869 |
3839 | 22 | 1 | 0.861 |
3853 | 21 | 1 | 0.852 |
4079 | 14 | 1 | 0.839 |
4191 | 10 | 1 | 0.824 |
The fitted Cox proportional hazards model can be written as:
At the 5% level, the following predictors are significant: