lifelines¶
lifelines is a implementation of survival analysis in Python. What benefits does lifelines offer over other survival analysis implementations?
 built on top of Pandas
 internal plotting methods
 simple and intuitive API
 only focus is survival analysis
 handles right, left and interval censored data
Contents:¶
Quickstart¶
KaplanMeier NelsonAalen, and parametric models¶
Note
For readers looking for an introduction to survival analysis, it’s recommended to start at Introduction to survival analysis
Let’s start by importing some data. We need the durations that individuals are observed for, and whether they “died” or not.
from lifelines.datasets import load_waltons
df = load_waltons() # returns a Pandas DataFrame
print(df.head())
"""
T E group
0 6 1 miR137
1 13 1 miR137
2 13 1 miR137
3 13 1 miR137
4 19 1 miR137
"""
T = df['T']
E = df['E']
T
is an array of durations, E
is a either boolean or binary array representing whether the “death” was observed or not (alternatively an individual can be censored). We will fit a Kaplan Meier model to this, implemented as KaplanMeierFitter
:
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E) # or, more succinctly, kmf.fit(T, E)
After calling the fit()
method, we have access to new properties like survival_function_
and methods like plot()
. The latter is a wrapper around Panda’s internal plotting library.
kmf.survival_function_
kmf.cumulative_density_
kmf.median_
kmf.plot_survival_function() # or just kmf.plot()
Alternatively, you can plot the cumulative density function:
kmf.plot_cumulative_density()
By specifying the timeline
keyword argument in fit()
, we can change how the above models are indexed:
kmf.fit(T, E, timeline=range(0, 100, 2))
kmf.survival_function_ # index is now the same as range(0, 100, 2)
kmf.confidence_interval_ # index is now the same as range(0, 100, 2)
Instead of the KaplanMeier estimator, you may be interested in a parametric model. lifelines has builtin parametric models. For example, Weibull, LogNormal, LogLogistic, and more.
from lifelines import *
fig, axes = plt.subplots(2, 3, figsize=(9, 5))
kmf = KaplanMeierFitter().fit(T, E, label='KaplanMeierFitter')
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
ggf = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
wbf.plot_survival_function(ax=axes[0][0])
exf.plot_survival_function(ax=axes[0][1])
lnf.plot_survival_function(ax=axes[0][2])
kmf.plot_survival_function(ax=axes[1][0])
llf.plot_survival_function(ax=axes[1][1])
pwf.plot_survival_function(ax=axes[1][2])
ggf.plot_survival_function(ax=axes[1][2])
Multiple groups¶
groups = df['group']
ix = (groups == 'miR137')
kmf.fit(T[~ix], E[~ix], label='control')
ax = kmf.plot()
kmf.fit(T[ix], E[ix], label='miR137')
ax = kmf.plot(ax=ax)
Alternatively, for many more groups and more “pandasesque”:
ax = plt.subplot(111)
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('group'):
kmf.fit(grouped_df["T"], grouped_df["E"], label=name)
kmf.plot(ax=ax)
Similar functionality exists for the NelsonAalenFitter
:
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()
naf.fit(T, event_observed=E)
but instead of a survival_function_
being exposed, a cumulative_hazard_
is.
Note
Similar to ScikitLearn, all statistically estimated quantities append an underscore to the property name.
Note
More detailed docs about estimating the survival function and cumulative hazard are available in Survival analysis with lifelines.
Getting data in the right format¶
Often you’ll have data that looks like::
*start_time1*, *end_time1*
*start_time2*, *end_time2*
*start_time3*, None
*start_time4*, *end_time4*
lifelines has some utility functions to transform this dataset into duration and censoring vectors. The most common one is lifelines.utils.datetimes_to_durations()
.
from lifelines.utils import datetimes_to_durations
# start_times is a vector or list of datetime objects or datetime strings
# end_times is a vector or list of (possibly missing) datetime objects or datetime strings
T, E = datetimes_to_durations(start_times, end_times, freq='h')
Perhaps you are interested in viewing the survival table given some durations and censoring vectors. The function lifelines.utils.survival_table_from_events()
will help with that:
from lifelines.utils import survival_table_from_events
table = survival_table_from_events(T, E)
print(table.head())
"""
removed observed censored entrance at_risk
event_at
0 0 0 0 163 163
6 1 1 0 0 163
7 2 1 1 0 162
9 3 3 0 0 160
13 3 3 0 0 157
"""
Survival regression¶
While the above KaplanMeierFitter
model is useful, it only gives us an “average” view of the population. Often we have specific data at the individual level that we would like to use. For this, we turn to survival regression.
Note
More detailed documentation and tutorials are available in Survival Regression.
from lifelines.datasets import load_regression_dataset
regression_dataset = load_regression_dataset()
regression_dataset.head()
The input of the fit
method’s API in a regression model is different. All the data, including durations, censored indicators and covariates must be contained in a Pandas DataFrame. The duration column and event occurred column are specified in the call to fit
. Below we model our regression dataset using the Cox proportional hazard model, full docs here.
from lifelines import CoxPHFitter
# Using Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(regression_dataset, 'T', event_col='E')
cph.print_summary()
"""
<lifelines.CoxPHFitter: fitted with 200 observations, 11 censored>
duration col = 'T'
event col = 'E'
number of subjects = 200
number of events = 189
partial loglikelihood = 807.62
time fit was run = 20190731 10:22:07 UTC

coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
var1 0.22 1.25 0.07 0.08 0.37 1.08 1.44
var2 0.05 1.05 0.08 0.11 0.21 0.89 1.24
var3 0.22 1.24 0.08 0.07 0.37 1.07 1.44
z p log2(p)
var1 2.99 <0.005 8.49
var2 0.61 0.54 0.89
var3 2.88 <0.005 7.97

Concordance = 0.58
Loglikelihood ratio test = 15.54 on 3 df, log2(p)=9.47
"""
cph.plot()
The same dataset, but with a Weibull accelerated failure time model. This model was two parameters (see docs here), and we can choose to model both using our covariates or just one. Below we model just the scale parameter, lambda_
.
from lifelines import WeibullAFTFitter
wft = WeibullAFTFitter()
wft.fit(regression_dataset, 'T', event_col='E')
wft.print_summary()
"""
<lifelines.WeibullAFTFitter: fitted with 200 observations, 11 censored>
event col = 'E'
number of subjects = 200
number of events = 189
loglikelihood = 504.48
time fit was run = 20190731 10:19:07 UTC

coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
lambda_ var1 0.08 0.92 0.02 0.13 0.04 0.88 0.97
var2 0.02 0.98 0.03 0.07 0.04 0.93 1.04
var3 0.08 0.92 0.02 0.13 0.03 0.88 0.97
_intercept 2.53 12.57 0.05 2.43 2.63 11.41 13.85
rho_ _intercept 1.09 2.98 0.05 0.99 1.20 2.68 3.32
z p log2(p)
lambda_ var1 3.45 <0.005 10.78
var2 0.56 0.57 0.80
var3 3.33 <0.005 10.15
_intercept 51.12 <0.005 inf
rho_ _intercept 20.12 <0.005 296.66

Concordance = 0.58
Loglikelihood ratio test = 19.73 on 3 df, log2(p)=12.34
"""
wft.plot()
Other AFT models are available as well, see here. An alternative regression model is Aalen’s Additive model, which has timevarying hazards:
# Using Aalen's Additive model
from lifelines import AalenAdditiveFitter
aaf = AalenAdditiveFitter(fit_intercept=False)
aaf.fit(regression_dataset, 'T', event_col='E')
Along with CoxPHFitter
and WeibullAFTFitter
, after fitting you’ll have access to properties like cumulative_hazards_
and methods like plot
, predict_cumulative_hazards
, and predict_survival_function
. The latter two methods require an additional argument of individual covariates:
X = regression_dataset.drop(['E', 'T'], axis=1)
aaf.predict_survival_function(X.iloc[10:12]).plot() # get the unique survival functions of two subjects
Like the above estimators, there is also a builtin plotting method:
aaf.plot()
Note
More detailed documentation and tutorials are available in Survival Regression.
Introduction to survival analysis¶
Applications¶
Traditionally, survival analysis was developed to measure lifespans of individuals. An actuary or health professional would ask questions like “how long does this population live for?”, and answer it using survival analysis. For example, the population may be a nation’s population (for actuaries), or a population stricken by a disease (in the medical professionals case). Traditionally, sort of a morbid subject.
The analysis can be further applied to not just traditional births and deaths, but any duration. Medical professionals might be interested in the time between childbirths, where a birth in this case is the event of having a child, and a death is becoming pregnant again! (obviously, we are loose with our definitions of birth and death) Another example is users subscribing to a service: a birth is a user who joins the service, and a death is when the user leaves the service.
Censoring¶
At the time you want to make inferences about durations, it is possible, likely true, that not all the death events have occurred yet. For example, a medical professional will not wait 50 years for each individual in the study to pass away before investigating – he or she is interested in the effectiveness of improving lifetimes after only a few years, or months possibly.
The individuals in a population who have not been subject to the death event are labeled as rightcensored, i.e., we did not (or can not) view the rest of their life history due to some external circumstances. All the information we have on these individuals are their current lifetime durations (which is naturally less than their actual lifetimes).
Note
There is also leftcensoring and interval censoring, which are expanded on later.
A common mistake data analysts make is choosing to ignore the rightcensored individuals. We will see why this is a mistake next.
Consider a case where the population is actually made up of two subpopulations, \(A\) and \(B\). Population \(A\) has a very small lifespan, say 2 months on average, and population \(B\) enjoys a much larger lifespan, say 12 months on average. We may not know this distinction beforehand. At \(t=10\), we wish to investigate the average lifespan for everyone.
In the figure below, the red lines denote the lifespan of individuals where the death event has been observed, and the blue lines denote the lifespan of the rightcensored individuals (deaths have not been observed). If we are asked to estimate the average lifetime of our population, and we naively decided to not included the rightcensored individuals, it is clear that we would be severely underestimating the true average lifespan.
from lifelines.plotting import plot_lifetimes
import numpy as np
from numpy.random import uniform, exponential
N = 25
CURRENT_TIME = 10
actual_lifetimes = np.array([
exponential(12) if (uniform() < 0.5) else exponential(2) for i in range(N)
])
observed_lifetimes = np.minimum(actual_lifetimes, CURRENT_TIME)
death_observed = actual_lifetimes < CURRENT_TIME
ax = plot_lifetimes(observed_lifetimes, event_observed=death_observed)
ax.set_xlim(0, 25)
ax.vlines(10, 0, 30, lw=2, linestyles='')
ax.set_xlabel("time")
ax.set_title("Births and deaths of our population, at $t=10$")
print("Observed lifetimes at time %d:\n" % (CURRENT_TIME), observed_lifetimes)
Observed lifetimes at time 10:
[10. 1.1 8. 10. 3.43 0.63 6.28 1.03 2.37 6.17 10.
0.21 2.71 1.25 10. 3.4 0.62 1.94 0.22 7.43 6.16 10.
9.41 10. 10.]
Furthermore, if we instead simply took the mean of all observed lifespans, including the current lifespans of rightcensored instances, we would still be underestimating the true average lifespan. Below we plot the actual lifetimes of all instances (recall we do not see this information at \(t=10\)).
ax = plot_lifetimes(actual_lifetimes, event_observed=death_observed)
ax.vlines(10, 0, 30, lw=2, linestyles='')
ax.set_xlim(0, 25)
Survival analysis was originally developed to solve this type of problem, that is, to deal with estimation when our data is rightcensored. Even in the case where all events have been observed, i.e. no censoring, survival analysis is still a very useful tool to understand durations.
The observations need not always start at zero, either. This was done only for understanding in the above example. Consider the example where a customer entering a store is a birth: a customer can enter at any time, and not necessarily at time zero. In survival analysis, durations are relative: individuals may start at different times. (We actually only need the duration of the observation, and not necessarily the start and end time.)
We next introduce the three fundamental objects in survival analysis, the survival function, hazard function and the cumulative hazard function.
Survival function¶
Let \(T\) be a (possibly infinite, but always nonnegative) random lifetime taken from the population under study. For example, the amount of time a couple is married. Or the time it takes a user to enter a webpage (an infinite time if they never do). The survival function  \(S(t)\)  of a population is defined as
In plain English: the survival function defines the probability the death event has not occurred yet at time \(t\), or equivalently, the probability of surviving past time \(t\). Note the following properties of the survival function:
 \(0 \le S(t) \le 1\)
 \(F_T(t) = 1  S(t)\), where \(F_T(t)\) is the CDF of \(T\), which implies
 \(S(t)\) is a nonincreasing function of \(t\).
Here’s an example of a survival function:
Hazard function¶
We are also interested in the probability of the death event occurring at time \(t\), given that the death event has not occurred until time \(t\). Mathematically, that is:
This quantity goes to 0 as \(\delta t\) shrinks, so we divide this by the interval \(\delta t\) (like we might do in calculus). This defines the hazard function at time \(t\), \(h(t)\):
It can be shown that this is equal to:
and solving this differential equation (cool, it is a differential equation!), we get:
The integral has a more common name: the cumulative hazard function, denoted \(H(t)\). We can rewrite the above as:
What I love about the above equation is that it defines all survival functions. Notice that we can now speak either about the survival function, \(S(t)\), or the cumulative hazard function, \(H(t)\), and we can convert back and forth quite easily.
The two figures below represent the hazard and the cumulative hazard of the survival function in the figure above.
Next steps¶
Of course, we do not observe the true survival function of a population. We must use the observed data to estimate it. There are many ways to estimate the survival function and the hazard functions, which brings us to estimation using lifelines.
Estimating univariate models¶
In the previous section, we introduced the use of survival analysis, the need, and the mathematical objects on which it relies. In this article, we will work with real data and the lifelines library to estimate these mathematical objects.
Estimating the survival function using KaplanMeier¶
For this example, we will be investigating the lifetimes of political leaders around the world. A political leader, in this case, is defined by a single individual’s time in office who controls the ruling regime. This political leader could be an elected president, unelected dictator, monarch, etc. The birth event is the start of the individual’s tenure, and the death event is the retirement of the individual. Censoring can occur if they are a) still in offices at the time of dataset compilation (2008), or b) die while in power (this includes assassinations).
For example, the Bush regime began in 2000 and officially ended in 2008 upon his retirement, thus this regime’s lifespan was eight years, and there was a “death” event observed. On the other hand, the JFK regime lasted 2 years, from 1961 and 1963, and the regime’s official death event was not observed – JFK died before his official retirement.
(This is an example that has gladly redefined the birth and death events, and in fact completely flips the idea upside down by using deaths as the censoring event. This is also an example where the current time is not the only cause of censoring; there are the alternative events (e.g., death in office) that can be the cause of censoring.
To estimate the survival function, we first will use the KaplanMeier Estimate, defined:
where \(d_i\) are the number of death events at time \(t\) and \(n_i\) is the number of subjects at risk of death just prior to time \(t\).
Let’s bring in our dataset.
from lifelines.datasets import load_dd
data = load_dd()
data.head()
democracy  regime  start_year  duration  observed  ctryname  cowcode2  politycode  un_region_name  un_continent_name  ehead  leaderspellreg 

Nondemocracy  Monarchy  1946  7  1  Afghanistan  700  700  Southern Asia  Asia  Mohammad Zahir Shah  Mohammad Zahir Shah.Afghanistan.1946.1952.Monarchy 
Nondemocracy  Civilian Dict  1953  10  1  Afghanistan  700  700  Southern Asia  Asia  Sardar Mohammad Daoud  Sardar Mohammad Daoud.Afghanistan.1953.1962.Civilian Dict 
Nondemocracy  Monarchy  1963  10  1  Afghanistan  700  700  Southern Asia  Asia  Mohammad Zahir Shah  Mohammad Zahir Shah.Afghanistan.1963.1972.Monarchy 
Nondemocracy  Civilian Dict  1973  5  0  Afghanistan  700  700  Southern Asia  Asia  Sardar Mohammad Daoud  Sardar Mohammad Daoud.Afghanistan.1973.1977.Civilian Dict 
Nondemocracy  Civilian Dict  1978  1  0  Afghanistan  700  700  Southern Asia  Asia  Nur Mohammad Taraki  Nur Mohammad Taraki.Afghanistan.1978.1978.Civilian Dict 
From the lifelines library, we’ll need the
KaplanMeierFitter
for this exercise:
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
Note
Other ways to estimate the survival function in lifelines are discussed below.
For this estimation, we need the duration each leader was/has been in office, and whether or not they were observed to have left office (leaders who died in office or were in office in 2008, the latest date this data was record at, do not have observed death events)
We next use the KaplanMeierFitter
method fit()
to fit the model to
the data. (This is similar to, and inspired by, scikitlearn’s fit/predict API).
Below we fit our data with the KaplanMeierFitter
:
T = data["duration"]
E = data["observed"]
kmf.fit(T, event_observed=E)
After calling the fit()
method, the KaplanMeierFitter
has a property
called survival_function_
(again, we follow the styling of
scikitlearn, and append an underscore to all properties that were computational estimated).
The property is a Pandas DataFrame, so we can call plot()
on it:
kmf.survival_function_.plot()
plt.title('Survival function of political regimes');
How do we interpret this? The yaxis represents the probability a leader is still
around after \(t\) years, where \(t\) years is on the xaxis. We
see that very few leaders make it past 20 years in office. Of course,
like all good stats, we need to report how uncertain we are about these
point estimates, i.e., we need confidence intervals. They are computed in
the call to fit()
, and located under the confidence_interval_
property. (The method uses exponential Greenwood confidence interval. The mathematics are found in these notes.)
Alternatively, we can call plot()
on the KaplanMeierFitter
itself to plot both the KM estimate and its confidence intervals:
kmf.plot()
The median time in office, which defines the point in time where on average 1/2 of the population has expired, is a property:
kmf.median_
# 4.0
Interesting that it is only four years. That means, around the world, elected leaders have a 50% chance of cessation in four years or less!
Let’s segment on democratic regimes vs nondemocratic regimes. Calling
plot
on either the estimate itself or the fitter object will return
an axis
object, that can be used for plotting further estimates:
ax = plt.subplot(111)
dem = (data["democracy"] == "Democracy")
kmf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
kmf.plot(ax=ax)
kmf.fit(T[~dem], event_observed=E[~dem], label="Nondemocratic Regimes")
kmf.plot(ax=ax)
plt.ylim(0, 1);
plt.title("Lifespans of different global regimes");
We might be interested in estimating the probabilities in between some
points. We can do that with the timeline
argument. We specify the
times we are interested in and are returned a DataFrame with the
probabilities of survival at those points:
ax = plt.subplot(111)
t = np.linspace(0, 50, 51)
kmf.fit(T[dem], event_observed=E[dem], timeline=t, label="Democratic Regimes")
ax = kmf.plot(ax=ax)
print("Median survival time of democratic:", kmf.median_)
kmf.fit(T[~dem], event_observed=E[~dem], timeline=t, label="Nondemocratic Regimes")
ax = kmf.plot(ax=ax)
print("Median survival time of nondemocratic:", kmf.median_)
plt.ylim(0, 1)
plt.title("Lifespans of different global regimes");
"""
Median survival time of democratic: 3
Median survival time of nondemocratic: 6
"""
It is incredible how much longer these nondemocratic regimes exist for. A democratic regime does have a natural bias towards death though: both via elections and natural limits (the US imposes a strict eightyear limit). The median of a nondemocratic is only about twice as large as a democratic regime, but the difference is apparent in the tails: if you’re a nondemocratic leader, and you’ve made it past the 10 year mark, you probably have a long life ahead. Meanwhile, a democratic leader rarely makes it past ten years, and then have a very short lifetime past that.
Here the difference between survival functions is very obvious, and
performing a statistical test seems pedantic. If the curves are more
similar, or we possess less data, we may be interested in performing a
statistical test. In this case, lifelines contains routines in
lifelines.statistics
to compare two survival functions. Below we
demonstrate this routine. The function lifelines.statistics.logrank_test()
is a common
statistical test in survival analysis that compares two event series’
generators. If the value returned exceeds some prespecified value, then
we rule that the series have different generators.
from lifelines.statistics import logrank_test
results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)
results.print_summary()
"""
<lifelines.StatisticalResult>
t_0 = 1
null_distribution = chi squared
degrees_of_freedom = 1
alpha = 0.99

test_statistic p log2(p)
260.47 <0.005 192.23
""""
Lets compare the different types of regimes present in the dataset:
regime_types = data['regime'].unique()
for i, regime_type in enumerate(regime_types):
ax = plt.subplot(2, 3, i + 1)
ix = data['regime'] == regime_type
kmf.fit(T[ix], E[ix], label=regime_type)
kmf.plot(ax=ax, legend=False)
plt.title(regime_type)
plt.xlim(0, 50)
if i==0:
plt.ylabel('Frac. in power after $n$ years')
plt.tight_layout()
There are alternative (and sometimes better) tests of survival functions, and we explain more here: Statistically compare two populations
Getting data into the right format¶
lifelines data format is consistent across all estimator class and
functions: an array of individual durations, and the individuals
event observation (if any). These are often denoted T
and E
respectively. For example:
T = [0,3,3,2,1,2]
E = [1,1,0,0,1,1]
kmf.fit(T, event_observed=E)
The raw data is not always available in this format – lifelines
includes some helper functions to transform data formats to lifelines
format. These are located in the lifelines.utils
sublibrary. For
example, the function datetimes_to_durations()
accepts an array or
Pandas object of start times/dates, and an array or Pandas objects of
end times/dates (or None
if not observed):
from lifelines.utils import datetimes_to_durations
start_date = ['20131010 0:00:00', '20131009', '20131010']
end_date = ['20131013', '20131010', None]
T, E = datetimes_to_durations(start_date, end_date, fill_date='20131015')
print('T (durations): ', T)
print('E (event_observed): ', E)
T (durations): [ 3. 1. 5.]
E (event_observed): [ True True False]
The function datetimes_to_durations()
is very flexible, and has many
keywords to tinker with.
Estimating hazard rates using NelsonAalen¶
The survival functions is a great way to summarize and visualize the survival dataset, however it is not the only way. If we are curious about the hazard function \(h(t)\) of a population, we unfortunately cannot transform the Kaplan Meier estimate – statistics doesn’t work quite that well. Fortunately, there is a proper nonparametric estimator of the cumulative hazard function:
The estimator for this quantity is called the Nelson Aalen estimator:
where \(d_i\) is the number of deaths at time \(t_i\) and \(n_i\) is the number of susceptible individuals.
In lifelines, this estimator is available as the NelsonAalenFitter
. Let’s use the regime dataset from above:
T = data["duration"]
E = data["observed"]
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()
naf.fit(T,event_observed=E)
After fitting, the class exposes the property cumulative_hazard_`()
as
a DataFrame:
print(naf.cumulative_hazard_.head())
naf.plot()
NAestimate
0 0.000000
1 0.325912
2 0.507356
3 0.671251
4 0.869867
[5 rows x 1 columns]
The cumulative hazard has less obvious understanding than the survival functions, but the hazard functions is the basis of more advanced techniques in survival analysis. Recall that we are estimating cumulative hazard functions, \(H(t)\). (Why? The sum of estimates is much more stable than the pointwise estimates.) Thus we know the rate of change of this curve is an estimate of the hazard function.
Looking at figure above, it looks like the hazard starts off high and gets smaller (as seen by the decreasing rate of change). Let’s break the regimes down between democratic and nondemocratic, during the first 20 years:
Note
We are using the loc
argument in the call to plot
here: it accepts a slice
and plots only points within that slice.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot(loc=slice(0, 20))
naf.fit(T[~dem], event_observed=E[~dem], label="Nondemocratic Regimes")
naf.plot(ax=ax, loc=slice(0, 20))
plt.title("Cumulative hazard function of different global regimes");
Looking at the rates of change, I would say that both political philosophies have a constant hazard, albeit democratic regimes have a much higher constant hazard.
Smoothing the hazard function¶
Interpretation of the cumulative hazard function can be difficult – it is not how we usually interpret functions. On the other hand, most survival analysis is done using the cumulative hazard function, so understanding it is recommended.
Alternatively, we can derive the moreinterpretable hazard function, but
there is a catch. The derivation involves a kernel smoother (to smooth
out the differences of the cumulative hazard function) , and this requires
us to specify a bandwidth parameter that controls the amount of
smoothing. This functionality is in the smoothed_hazard_()
and smoothed_hazard_confidence_intervals_()
methods. Why methods?
They require an argument representing the bandwidth.
There is also a plot_hazard()
function (that also requires a
bandwidth
keyword) that will plot the estimate plus the confidence
intervals, similar to the traditional plot()
functionality.
bandwidth = 3.
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=bandwidth)
naf.fit(T[~dem], event_observed=E[~dem], label="Nondemocratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=bandwidth)
plt.title("Hazard function of different global regimes  bandwidth=%.1f" % bandwidth);
plt.ylim(0, 0.4)
plt.xlim(0, 25);
It is more clear here which group has the higher hazard, and Nondemocratic regimes appear to have a constant hazard.
There is no obvious way to choose a bandwidth, and different bandwidths produce different inferences, so it’s best to be very careful here. My advice: stick with the cumulative hazard function.
bandwidth = 8.0
naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=bandwidth)
naf.fit(T[~dem], event_observed=E[~dem], label="Nondemocratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=bandwidth)
plt.title("Hazard function of different global regimes  bandwidth=%.1f" % bandwidth);
Estimating cumulative hazards using parametric models¶
Fitting to a Weibull model¶
Note
The parameterization of the Weibull and Exponential model changed in lifelines 0.19.0, released in Feb. 2019.
Another very popular model for survival data is the Weibull model. In contrast the the NelsonAalen estimator, this model is a parametric model, meaning it has a functional form with parameters that we are fitting the data to. (The NelsonAalen estimator has no parameters to fit to). The survival function looks like:
A priori, we do not know what \(\lambda\) and \(\rho\) are, but we use the data on hand to estimate these parameters. We model and estimate the cumulative hazard rate instead of the survival function (this is different than the KaplanMeier estimator):
In lifelines, estimation is available using the WeibullFitter
class. The plot()
method will plot the cumulative hazard.
from lifelines import WeibullFitter
from lifelines.datasets import load_waltons
data = load_waltons()
T = data['T']
E = data['E']
wf = WeibullFitter().fit(T, E)
wf.print_summary()
wf.plot()
"""
<lifelines.WeibullFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 672.062
hypothesis = lambda != 1, rho != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
lambda_ 0.02 0.00 0.02 0.02 <0.005 inf
rho_ 3.45 0.24 2.97 3.93 <0.005 76.83
"""
Other parametric models: Exponential, LogLogistic & LogNormal¶
Similarly, there are other parametric models in lifelines. Generally, which parametric model to choose is determined by either knowledge of the distribution of durations, or some sort of model goodnessoffit. Below are the builtin parametric models, and the NelsonAalen nonparametric model, of the same data.
from lifelines import (WeibullFitter, ExponentialFitter,
LogNormalFitter, LogLogisticFitter, NelsonAalenFitter,
PiecewiseExponentialFitter, GeneralizedGammaFitter)
from lifelines.datasets import load_waltons
data = load_waltons()
fig, axes = plt.subplots(3, 3, figsize=(10, 7.5))
T = data['T']
E = data['E']
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
naf = NelsonAalenFitter().fit(T, E, label='NelsonAalenFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
gg = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
wbf.plot_cumulative_hazard(ax=axes[0][0])
exf.plot_cumulative_hazard(ax=axes[0][1])
lnf.plot_cumulative_hazard(ax=axes[0][2])
naf.plot_cumulative_hazard(ax=axes[1][0])
llf.plot_cumulative_hazard(ax=axes[1][1])
pwf.plot_cumulative_hazard(ax=axes[1][2])
gg.plot_cumulative_hazard(ax=axes[2][0])
lifelines can also be used to define your own parametric model. There is a tutorial on this available, see Piecewise Exponential Models and Creating Custom Models.
Parametric models can also be used to create and plot the survival function, too. Below we compare the parametric models versus the nonparametric KaplanMeier estimate:
from lifelines import KaplanMeierFitter
fig, axes = plt.subplots(3, 3, figsize=(10, 7.5))
T = data['T']
E = data['E']
kmf = KaplanMeierFitter().fit(T, E, label='KaplanMeierFitter')
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
gg = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
wbf.plot_survival_function(ax=axes[0][0])
exf.plot_survival_function(ax=axes[0][1])
lnf.plot_survival_function(ax=axes[0][2])
kmf.plot_survival_function(ax=axes[1][0])
llf.plot_survival_function(ax=axes[1][1])
pwf.plot_survival_function(ax=axes[1][2])
gg.plot_survival_function(ax=axes[2][0])
With parametric models, we have a functional form that allows us to extend the survival function (or hazard or cumulative hazard) past our maximum observed duration. This is called extrapolation. We can do this in a few ways.
timeline = np.linspace(0, 100, 200)
# directly compute the survival function, these return a pandas Series
wbf = WeibullFitter().fit(T, E)
wbf.survival_function_at_times(timeline)
wbf.hazard_at_times(timeline)
wbf.cumulative_hazard_at_times(timeline)
# use the `timeline` kwarg in `fit`
# by default, all functions and properties will use
# these values provided
wbf = WeibullFitter().fit(T, E, timeline=timeline)
wbf.plot_survival_function()
To aid model selection, lifelines has provided qqplots, Selecting a parametric model using QQ plots.
Other types of censoring¶
Left censored data and nondetection¶
We’ve mainly been focusing on rightcensoring, which describes cases where we do not observe the death event. This situation is the most common one. Alternatively, there are situations where we do not observe the birth event occurring. Consider the case where a doctor sees a delayed onset of symptoms of an underlying disease. The doctor is unsure when the disease was contracted (birth), but knows it was before the discovery.
Another situation where we have leftcensored data is when measurements have only an upper bound, that is, the measurements instruments could only detect the measurement was less than some upper bound. This bound is often called the limit of detection (LOD). In practice, there could be more than one LOD. One very important statistical lesson: don’t “fillin” this value naively. It’s tempting to use something like onehalf the LOD, but this will cause lots of bias in downstream analysis. An example dataset is below:
Note
The recommended API for modeling leftcensored data using parametric models changed in version 0.21.0. Below is the recommended API.
from lifelines.datasets import load_nh4
df = load_nh4()[['NH4.Orig.mg.per.L', 'NH4.mg.per.L', 'Censored']]
print(df.head())
"""
NH4.Orig.mg.per.L NH4.mg.per.L Censored
1 <0.006 0.006 True
2 <0.006 0.006 True
3 0.006 0.006 False
4 0.016 0.016 False
5 <0.006 0.006 True
"""
lifelines has support for leftcensored datasets in most univariate models, including the KaplanMeierFitter
class, by using the lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter.fit_left_censoring()
method.
T, E = df['NH4.mg.per.L'], ~df['Censored']
kmf = KaplanMeierFitter()
kmf.fit_left_censoring(T, E)
Instead of producing a survival function, leftcensored data analysis is more interested in the cumulative density function. This is available as the lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter.cumulative_density_
property after fitting the data.
print(kmf.cumulative_density_.head())
kmf.plot() #will plot the CDF
plt.xlabel("Concentration of NH_4")
"""
KM_estimate
timeline
0.000 0.379897
0.006 0.401002
0.007 0.464319
0.008 0.478828
0.009 0.536868
"""
Alternatively, you can use a parametric model to model the data. This allows for you to “peer” below the LOD, however using a parametric model means you need to correctly specify the distribution. You can use plots like qqplots to help invalidate some distributions, see Selecting a parametric model using QQ plots.
from lifelines import *
from lifelines.plotting import qq_plot
fig, axes = plt.subplots(3, 2, figsize=(9, 9))
timeline = np.linspace(0, 0.25, 100)
wf = WeibullFitter().fit_left_censoring(T, E, label="Weibull", timeline=timeline)
lnf = LogNormalFitter().fit_left_censoring(T, E, label="Log Normal", timeline=timeline)
lgf = LogLogisticFitter().fit_left_censoring(T, E, label="Log Logistic", timeline=timeline)
# plot what we just fit, along with the KMF estimate
kmf.plot_cumulative_density(ax=axes[0][0], ci_show=False)
wf.plot_cumulative_density(ax=axes[0][0], ci_show=False)
qq_plot(wf, ax=axes[0][1])
kmf.plot_cumulative_density(ax=axes[1][0], ci_show=False)
lnf.plot_cumulative_density(ax=axes[1][0], ci_show=False)
qq_plot(lnf, ax=axes[1][1])
kmf.plot_cumulative_density(ax=axes[2][0], ci_show=False)
lgf.plot_cumulative_density(ax=axes[2][0], ci_show=False)
qq_plot(lgf, ax=axes[2][1])
Based on the above, the lognormal distribution seems to fit well, and the Weibull not very well at all.
Interval censored data¶
Data can also be interval censored. An example of this is periodically recording the population of microorganisms as they dieoff. Their deaths are interval censored because you know a subject died between two observations periods. New to lifelines in version 0.21.0, all parametric models have support for interval censored data.
Note
The API for fit_interval_censoring
is different than right and left censored data.
from lifelines.datasets import load_diabetes
df = load_diabetes()
wf = WeibullFitter().fit_interval_censoring(lower_bound=df['left'], upper_bound=df['right'])
Another example of using lifelines for interval censored data is located here.
Left truncated (late entry) data¶
Another form of bias that is introduced into a dataset is called lefttruncation (or late entry). Lefttruncation can occur in many situations. One situation is when individuals may have the opportunity to die before entering into the study. For example, if you are measuring time to death of prisoners in prison, the prisoners will enter the study at different ages. So it’s possible there are some counterfactual individuals who would have entered into your study (that is, went to prison), but instead died early.
All univariate fitters, like KaplanMeierFitter
and any parametric models, have an optional argument for entry
, which is an array of equal size to the duration array. It describes the time between actual “birth” (or “exposure”) to entering the study.
Note
Nothing changes in the duration array: it still measures time from “birth” to time exited study (either by death or censoring). That is, durations refers to the absolute death time rather than a duration relative to the study entry.
Another situation with lefttruncation occurs when subjects are exposed before entry into study. For example, a study of time to allcause mortality of AIDS patients that recruited individuals previously diagnosed with AIDS, possibly years before. In our example below we will use a dataset like this, called the Multicenter Aids Cohort Study. In the figure below, we plot the lifetimes of subjects. A solid line is when the subject was under our observation, and a dashed line represents the unobserved period between diagnosis and study entry. A solid dot at the end of the line represents death.
from lifelines.datasets import load_multicenter_aids_cohort_study
from lifelines.plotting import plot_lifetimes
df = load_multicenter_aids_cohort_study()
plot_lifetimes(
df["T"]  df["W"],
event_observed=df["D"],
entry=df["W"],
event_observed_color="#383838",
event_censored_color="#383838",
left_truncated=True,
)
plt.ylabel("Patient Number")
plt.xlabel("Years from AIDS diagnosis")
So subject #77, the subject at the top, was diagnosed with AIDS 7.5 years ago, but wasn’t in our study for the first 4.5 years. From this pointofview, why can’t we “fill in” the dashed lines and say, for example, “subject #77 lived for 7.5 years”? If we did this, we would severely underestimate chance of dying early on after diagnosis. Why? It’s possible that there were individuals who were diagnosed and then died shortly after, and never had a chance to enter our study. If we did manage to observe them however, they would have depressed the survival function early on. Thus, “filling in” the dashed lines makes us over confident about what occurs in the early period after diagnosis. We can see this below when we model the survival function with and without taking into account late entries.
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(df["T"], event_observed=df["D"], entry=df["W"], label='modeling late entries')
ax = kmf.plot()
kmf.fit(df["T"], event_observed=df["D"], label='ignoring late entries')
kmf.plot(ax=ax)
Piecewise exponential models and creating custom models¶
This section will be easier if we recall our three mathematical “creatures” and the relationships between them. First is the survival function, \(S(t)\), that represents the probability of living past some time, \(t\). Next is the always nonnegative and nondecreasing cumulative hazard function, \(H(t)\). Its relation to \(S(t)\) is:
Finally, the hazard function, \(h(t)\), is the derivative of the cumulative hazard:
which has the immediate relation to the survival function:
Notice that any of the three absolutely defines the other two. Some situations make it easier to define one vs the others. For example, in the Cox model, it’s easist to work with the hazard, \(h(t)\). In this section on parametric univariate models, it’ll be easiest to work with the cumulative hazard. This is because of an asymmetry in math: derivatives are much easier to compute than integrals. So, if we define the cumulative hazard, both the hazard and survival function are much easier to reason about versus if we define the hazard and ask questions about the other two.
First, let’s revisit some simpler parametric models.
The Exponential model¶
Recall that the Exponential model has a constant hazard, that is:
which implies that the cumulative hazard, \(H(t)\), has a pretty simple form: \(H(t) = \frac{t}{\lambda}\). Below we fit this model to some survival data.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from lifelines.datasets import load_waltons
waltons = load_waltons()
T, E = waltons['T'], waltons['E']
[2]:
from lifelines import ExponentialFitter
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
epf = ExponentialFitter().fit(T, E)
epf.plot_hazard(ax=ax[0])
epf.plot_cumulative_hazard(ax=ax[1])
ax[0].set_title("hazard"); ax[1].set_title("cumulative_hazard")
epf.print_summary(3)
<lifelines.ExponentialFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 771.913
hypothesis = lambda_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
lambda_ 51.831 4.149 43.699 59.963 <0.0005 112.218
This model does a poor job of fitting to our data. If I fit a nonparametric model, like the NelsonAalen model, to this data, the Exponential’s lack of fit is very obvious.
[3]:
from lifelines import NelsonAalenFitter
ax = epf.plot(figsize=(8,5))
naf = NelsonAalenFitter().fit(T, E)
ax = naf.plot(ax=ax)
plt.legend()
[3]:
<matplotlib.legend.Legend at 0x11b00cfd0>
It should be clear that the single parameter model is just averaging the hazards over the entire time period. In reality though, the true hazard rate exhibits some complex nonlinear behaviour.
Piecewise Exponential models¶
What if we could break out model into different time periods, and fit an exponential model to each of those? For example, we define the hazard as:
This model should be flexible enough to fit better to our dataset.
The cumulative hazard is only slightly more complicated, but not too much and can still be defined in Python. In lifelines, univariate models are constructed such that one only needs to define the cumulative hazard model with the parameters of interest, and all the hard work of fitting, creating confidence intervals, plotting, etc. is taken care.
For example, lifelines has implemented the PiecewiseExponentialFitter
model. Internally, the code is a single function that defines the cumulative hazard. The user specifies where they believe the “breaks” are, and lifelines estimates the best \(\lambda_i\).
[4]:
from lifelines import PiecewiseExponentialFitter
# looking at the above plot, I think there may be breaks at t=40 and t=60.
pf = PiecewiseExponentialFitter(breakpoints=[40, 60]).fit(T, E)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
ax = pf.plot(ax=axs[1])
pf.plot_hazard(ax=axs[0])
ax = naf.plot(ax=ax, ci_show=False)
axs[0].set_title("hazard"); axs[1].set_title("cumulative_hazard")
pf.print_summary(3)
<lifelines.PiecewiseExponentialFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 647.118
hypothesis = lambda_0_ != 1, lambda_1_ != 1, lambda_2_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
lambda_0_ 162.989 27.144 109.787 216.191 <0.0005 28.630
lambda_1_ 31.366 4.043 23.442 39.290 <0.0005 43.957
lambda_2_ 4.686 0.624 3.462 5.910 <0.0005 28.055
We can see a much better fit in this model. A quantitative measure of fit is to compare the loglikelihood between exponential model and the piecewise exponential model (higher is better). The loglikelihood went from 772 to 647, respectively. We could keep going and add more and more breakpoints, but that would end up overfitting to the data.
Univarite models in lifelines¶
I mentioned that the PiecewiseExponentialFitter
was implemented using only its cumulative hazard function. This is not a lie. lifelines has very general semantics for univariate fitters. For example, this is how the entire ExponentialFitter
is implemented:
class ExponentialFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ["lambda_"]
def _cumulative_hazard(self, params, times):
lambda_ = params[0]
return times / lambda_
We only need to specify the cumulative hazard function because of the 1:1:1 relationship between the cumulative hazard function and the survival function and the hazard rate. From there, lifelines handles the rest.
Defining our own survival models¶
To show off the flexability of lifelines univariate models, we’ll create a brand new, never before seen, survival model. Looking at the NelsonAalen fit, the cumulative hazard looks looks like their might be an asymptote at \(t=80\). This may correspond to an absolute upper limit of subjects’ lives. Let’s start with that functional form.
We subscript \(1\) because we’ll investigate other models. In a lifelines univariate model, this is defined in the following code.
Important: in order to compute derivatives, you must use the numpy imported from the autograd
library. This is a thin wrapper around the original numpy. Note the import autograd.numpy as np
below.
[5]:
from lifelines.fitters import ParametericUnivariateFitter
import autograd.numpy as np
class InverseTimeHazardFitter(ParametericUnivariateFitter):
# we tell the model what we want the names of the unknown parameters to be
_fitted_parameter_names = ['alpha_']
# this is the only function we need to define. It always takes two arguments:
# params: an iterable that unpacks the parameters you'll need in the order of _fitted_parameter_names
# times: a vector of times that will be passed in.
def _cumulative_hazard(self, params, times):
alpha = params[0]
return alpha /(80  times)
[6]:
itf = InverseTimeHazardFitter()
itf.fit(T, E)
itf.print_summary()
ax = itf.plot(figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
plt.legend()
<lifelines.InverseTimeHazardFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 697.84
hypothesis = alpha_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
alpha_ 21.51 1.72 18.13 24.88 <0.005 106.22
[6]:
<matplotlib.legend.Legend at 0x11b392400>
The best fit of the model to the data is:
Our choice of 80 as an asymptote was maybe mistaken, so let’s allow the asymptote to be another parameter:
If we define the model this way, we need to add a bound to the values that \(\beta\) can take. Obviously it can’t be smaller than or equal to the maximum observed duration. Generally, the cumulative hazard must be positive and nondecreasing. Otherwise the model fit will hit convergence problems.
[7]:
class TwoParamInverseTimeHazardFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ['alpha_', 'beta_']
# Sequence of (min, max) pairs for each element in x. None is used to specify no bound
_bounds = [(0, None), (75.0001, None)]
def _cumulative_hazard(self, params, times):
alpha, beta = params
return alpha / (beta  times)
[8]:
two_f = TwoParamInverseTimeHazardFitter()
two_f.fit(T, E)
two_f.print_summary()
ax = itf.plot(ci_show=False, figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
two_f.plot(ax=ax)
plt.legend()
<lifelines.TwoParamInverseTimeHazardFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 685.57
hypothesis = alpha_ != 1, beta_ != 76.0001

coef se(coef) lower 0.95 upper 0.95 p log2(p)
alpha_ 16.50 1.51 13.55 19.46 <0.005 79.98
beta_ 76.55 0.38 75.80 77.30 0.15 2.73
[8]:
<matplotlib.legend.Legend at 0x11b6b0c50>
From the output, we see that the value of 76.55 is the suggested asymptote, that is:
The curve also appears to track against the NelsonAalen model better too. Let’s try one additional parameter, \(\gamma\), some sort of measure of decay.
[9]:
from lifelines.fitters import ParametericUnivariateFitter
class ThreeParamInverseTimeHazardFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ['alpha_', 'beta_', 'gamma_']
_bounds = [(0, None), (75.0001, None), (0, None)]
# this is the only function we need to define. It always takes two arguments:
# params: an iterable that unpacks the parameters you'll need in the order of _fitted_parameter_names
# times: a numpy vector of times that will be passed in by the optimizer
def _cumulative_hazard(self, params, times):
a, b, c = params
return a / (b  times) ** c
[10]:
three_f = ThreeParamInverseTimeHazardFitter()
three_f.fit(T, E)
three_f.print_summary()
ax = itf.plot(ci_show=False, figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
ax = two_f.plot(ax=ax, ci_show=False)
ax = three_f.plot(ax=ax)
plt.legend()
<lifelines.ThreeParamInverseTimeHazardFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 649.38
hypothesis = alpha_ != 1, beta_ != 76.0001, gamma_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
alpha_ 1588776.28 3775137.44 5810357.13 8987909.70 0.67 0.57
beta_ 100.88 5.88 89.35 112.41 <0.005 15.38
gamma_ 3.83 0.50 2.85 4.81 <0.005 25.82
[10]:
<matplotlib.legend.Legend at 0x11a875198>
Our new asymptote is at \(t\approx 100, \text{c.i.}=(87, 112)\). The model appears to fit the early times better than the previous models as well, however our \(\alpha\) parameter has more uncertainty now. Continuing to add parameters isn’t advisable, as we will overfit to the data.
Why fit parametric models anyways? Taking a step back, we are fitting parameteric models and comparing them to the nonparametric NelsonAalen. Why not just always use the NelsonAalen model?
 Sometimes we have scientific motivations to use a parametric model. That is, using domain knowledge, we may know the system has a parametric model and we wish to fit to that model.
 In a parametric model, we are borrowing information from all observations to determine the best parameters. To make this more clear, imagine take a single observation and changing it’s value wildly. The fitted parameters would change as well. On the other hand, imagine doing the same for a nonparametric model. In this case, only the local survival function or hazard function would change. Because parametric models can borrow information from all observations, and there are much fewer unknowns than a nonparametric model, parametric models are said to be more statistical efficient.
 Extrapolation: nonparametric models are not easily extended to values outside the observed data. On the other hand, parametric models have no problem with this. However, extrapolation outside observed values is a very dangerous activity.
[11]:
fig, axs = plt.subplots(3, figsize=(7, 8), sharex=True)
new_timeline = np.arange(0, 85)
three_f = ThreeParamInverseTimeHazardFitter().fit(T, E, timeline=new_timeline)
three_f.plot_hazard(label='hazard', ax=axs[0]).legend()
three_f.plot_cumulative_hazard(label='cumulative hazard', ax=axs[1]).legend()
three_f.plot_survival_function(label='survival function', ax=axs[2]).legend()
fig.subplots_adjust(hspace=0)
# Hide x labels and tick labels for all but bottom plot.
for ax in axs:
ax.label_outer()
3parameter Weibull distribution¶
We can easily extend the builtin Weibull model (lifelines.WeibullFitter
) to include a new location parameter:
(When \(\theta = 0\), this is just the 2parameter case again). In lifelines custom models, this looks like:
[12]:
import autograd.numpy as np
from autograd.scipy.stats import norm
# I'm shifting this to exaggerate the effect
T = T + 10
class ThreeParameterWeibullFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ["lambda_", "rho_", "theta_"]
_bounds = [(0, None), (0, None), (0, T.min()0.001)]
def _cumulative_hazard(self, params, times):
lambda_, rho_, theta_ = params
return ((times  theta_) / lambda_) ** rho_
[13]:
tpw = ThreeParameterWeibullFitter()
tpw.fit(T, E)
tpw.print_summary()
ax = tpw.plot_cumulative_hazard(figsize=(8,5))
ax = NelsonAalenFitter().fit(T, E).plot(ax=ax, ci_show=False)
<lifelines.ThreeParameterWeibullFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
loglikelihood = 666.71
hypothesis = lambda_ != 1, rho_ != 1, theta_ != 7.9995

coef se(coef) lower 0.95 upper 0.95 p log2(p)
lambda_ 63.92 5.38 53.38 74.47 <0.005 102.58
rho_ 4.20 0.56 3.11 5.29 <0.005 26.67
theta_ 2.55 5.05 7.35 12.45 0.28 1.83
Inverse Gaussian distribution¶
The inverse Gaussian distribution is another popular model for survival analysis. Unlike other model, it’s hazard does not asympotically converge to 0, allowing for a long tail of survival. Let’s model this, using the same parameterization from Wikipedia
[14]:
from autograd.scipy.stats import norm
class InverseGaussianFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ['lambda_', 'mu_']
def _cumulative_density(self, params, times):
mu_, lambda_ = params
v = norm.cdf(np.sqrt(lambda_ / times) * (times / mu_  1), loc=0, scale=1) + \
np.exp(2 * lambda_ / mu_) * norm.cdf(np.sqrt(lambda_ / times) * (times / mu_ + 1), loc=0, scale=1)
return v
def _cumulative_hazard(self, params, times):
return np.log(1self._cumulative_density(params, times))
[15]:
from lifelines.datasets import load_rossi
rossi = load_rossi()
igf = InverseGaussianFitter()
igf.fit(rossi['week'], rossi['arrest'], timeline=np.arange(1, 500))
igf.print_summary()
igf.plot_hazard()
<lifelines.InverseGaussianFitter: fitted with 432 observations, 318 censored>
number of subjects = 432
number of events = 114
loglikelihood = 729.80
hypothesis = lambda_ != 1, mu_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
lambda_ 7441.43 9296.67 10779.69 25662.56 0.42 1.24
mu_ 47.86 3.31 41.38 54.35 <0.005 148.83
[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c4715c0>
Bounded lifetimes using the beta distribution¶
Maybe your data is bounded between 0 and some (unknown) upperbound M? That is, lifetimes can’t be more than M. Maybe you know M, maybe you don’t.
[16]:
n = 100
T = 5 * np.random.random(n)**2
T_censor = 10 * np.random.random(n)**2
E = T < T_censor
T_obs = np.minimum(T, T_censor)
[17]:
from autograd_gamma import betainc, betaincln
class BetaFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ['alpha_', 'beta_', "m_"]
_bounds = [(0, None), (0, None), (T.max(), None)]
_scipy_fit_method = "SLSQP"
def _cumulative_hazard(self, params, times):
alpha_, beta_, m_ = params
return betaincln(beta_, alpha_, 1times / m_)
[23]:
beta_fitter = BetaFitter().fit(T_obs, E,)
beta_fitter.plot()
beta_fitter.print_summary()
/Users/camerondavidsonpilon/code/lifelines/lifelines/fitters/__init__.py:950: StatisticalWarning: The diagonal of the variance_matrix_ has negative values. This could be a problem with BetaFitter's fit to the data.
It's advisable to not trust the variances reported, and to be suspicious of the
fitted parameters too. Perform plots of the cumulative hazard to help understand
the latter's bias.
To fix this, try specifying an `initial_point` kwarg in `fit`.
warnings.warn(warning_text, utils.StatisticalWarning)
/Users/camerondavidsonpilon/code/lifelines/lifelines/fitters/__init__.py:462: RuntimeWarning: invalid value encountered in sqrt
np.einsum("nj,jk,nk>n", gradient_at_times.T, self.variance_matrix_, gradient_at_times.T)
<lifelines.BetaFitter: fitted with 100 observations, 34 censored>
number of subjects = 100
number of events = 66
loglikelihood = 94.50
hypothesis = alpha_ != 1, beta_ != 1, m_ != 5.95501

coef se(coef) lower 0.95 upper 0.95 p log2(p)
alpha_ 0.51 0.06 0.38 0.63 <0.005 46.27
beta_ 0.87 0.13 0.60 1.13 0.32 1.64
m_ 4.96 nan nan nan nan nan
Gompertz¶
[24]:
class GompertzFitter(ParametericUnivariateFitter):
# this parameterization is slightly different than wikipedia.
_fitted_parameter_names = ['nu_', 'b_']
def _cumulative_hazard(self, params, times):
nu_, b_ = params
return nu_ * (np.expm1(times / b_))
[25]:
ggf = GompertzFitter()
ggf.fit(rossi['week'], rossi['arrest'], timeline=np.arange(1, 120))
ggf.print_summary()
ggf.plot_survival_function()
<lifelines.GompertzFitter: fitted with 432 observations, 318 censored>
number of subjects = 432
number of events = 114
loglikelihood = 697.81
hypothesis = nu_ != 1, b_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
nu_ 0.20 0.11 0.01 0.40 <0.005 45.19
b_ 55.19 19.26 17.45 92.93 <0.005 7.68
[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cf415c0>
Discrete survival models¶
So far we have only been investigating continous time survival models, where times can take on any positive value. If we want to consider discrete survival times (for example, over the positive integers), we need to make a small adjustment. With discrete survival models, there is a slightly more complicated relationship between the hazard and cumulative hazard. This is because there are two ways to define the cumulative hazard.
We also no longer have the relationship that \(h(t) = \frac{d H(t)}{dt}\), since \(t\) is no longer continous. Instead, depending on which verion of the cumulative hazard you choose to use (inference will be the same), we have to redefine the hazard function in lifelines.
Here is an example of a discrete survival model, that may not look like a survival model at first, where we use a redefined _hazard
function.
Looking for more examples of what you can build? See other unique survival models in the docs on timelagged survival
Timelagged conversion rates and cure models¶
Suppose in our population we have a subpopulation that will never experience the event of interest. Or, for some subjects the event will occur so far in the future that it’s essentially at time infinity. The survival function should not asymptically approach zero, but some positive value. Models that describe this are sometimes called cure models or timelagged conversion models.
There’s a serious fault in using parametric models for these types of problems that nonparametric models don’t have. The most common parametric models like Weibull, LogNormal, etc. all have strictly increasing cumulative hazard functions, which means the corresponding survival function will always converge to 0.
Let’s look at an example of this problem. Below I generated some data that has individuals who will not experience the event, not matter how long we wait.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from matplotlib import pyplot as plt
import autograd.numpy as np
from autograd.scipy.special import expit, logit
import pandas as pd
plt.style.use('bmh')
[2]:
N = 200
U = np.random.rand(N)
T = (logit(np.log(U) / 0.5)  np.random.exponential(2, N)  6.00) / 0.50
E = ~np.isnan(T)
T[np.isnan(T)] = 50
[3]:
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter().fit(T, E)
kmf.plot(figsize=(8,4))
plt.ylim(0, 1);
It should be clear that there is an asymptote at around 0.6. The nonparametric model will always show this. If this is true, then the cumulative hazard function should have a horizontal asymptote as well. Let’s use the NelsonAalen model to see this.
[4]:
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter().fit(T, E)
naf.plot(figsize=(8,4))
[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x1211ac7b8>
However, when we try a parametric model, we will see that it won’t extrapolate very well. Let’s use the flexible two parameter LogLogisticFitter model.
[5]:
from lifelines import LogLogisticFitter
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))
t = np.linspace(0, 40)
llf = LogLogisticFitter().fit(T, E, timeline=t)
llf.plot_survival_function(ax=ax[0][0])
kmf.plot(ax=ax[0][0])
llf.plot_cumulative_hazard(ax=ax[0][1])
naf.plot(ax=ax[0][1])
t = np.linspace(0, 100)
llf = LogLogisticFitter().fit(T, E, timeline=t)
llf.plot_survival_function(ax=ax[1][0])
kmf.plot(ax=ax[1][0])
llf.plot_cumulative_hazard(ax=ax[1][1])
naf.plot(ax=ax[1][1])
[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x12199d518>
The LogLogistic model does a quite terrible job of capturing the not only the asymptotes, but also the intermediate values as well. If we extended the survival function out further, we would see that it eventually nears 0.
Custom parametric models to handle asymptotes¶
Focusing on modeling the cumulative hazard function, what we would like is a function that increases up to a limit and then tapers off to an asymptote. We can think long and hard about these (I did), but there’s a family of functions that have this property that we are very familiar with: cumulative distribution functions! By their nature, they will asympotically approach 1. And, they are readily available in the SciPy and autograd libraries. So our new model of the cumulative hazard function is:
where \(c\) is the (unknown) horizontal asymptote, and \(\theta\) is a vector of (unknown) parameters for the CDF, \(F\).
We can create a custom cumulative hazard model using ParametericUnivariateFitter
(for a tutorial on how to create custom models, see this here). Let’s choose the Normal CDF for our model.
Remember we must use the imports from autograd
for this, i.e. from autograd.scipy.stats import norm
.
[6]:
from autograd.scipy.stats import norm
from lifelines.fitters import ParametericUnivariateFitter
class UpperAsymptoteFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ["c_", "mu_", "sigma_"]
_bounds = ((0, None), (None, None), (0, None))
def _cumulative_hazard(self, params, times):
c, mu, sigma = params
return c * norm.cdf((times  mu) / sigma, loc=0, scale=1)
[7]:
uaf = UpperAsymptoteFitter().fit(T, E)
uaf.print_summary(3)
uaf.plot(figsize=(8,4))
<lifelines.UpperAsymptoteFitter: fitted with 200 observations, 119 censored>
number of subjects = 200
number of events = 81
loglikelihood = 386.053
hypothesis = c_ != 1, mu_ != 0, sigma_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
c_ 0.521 0.059 0.406 0.636 <0.0005 51.753
mu_ 17.514 0.608 16.322 18.705 <0.0005 603.894
sigma_ 5.398 0.425 4.564 6.232 <0.0005 80.823
[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x122376f28>
We get a lovely asympotical cumulative hazard. The summary table suggests that the best value of \(c\) is 0.586. This can be translated into the survival function asymptote by \(\exp(0.586) \approx 0.56\).
Let’s compare this fit to the nonparametric models.
[8]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))
t = np.linspace(0, 40)
uaf = UpperAsymptoteFitter().fit(T, E, timeline=t)
uaf.plot_survival_function(ax=ax[0][0])
kmf.plot(ax=ax[0][0])
uaf.plot_cumulative_hazard(ax=ax[0][1])
naf.plot(ax=ax[0][1])
t = np.linspace(0, 100)
uaf = UpperAsymptoteFitter().fit(T, E, timeline=t)
uaf.plot_survival_function(ax=ax[1][0])
kmf.survival_function_.plot(ax=ax[1][0])
uaf.plot_cumulative_hazard(ax=ax[1][1])
naf.plot(ax=ax[1][1])
[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x121109438>
I wasn’t expect this good of a fit. But there it is. This was some artificial data, but let’s try this technique on some real life data.
[9]:
from lifelines.datasets import load_leukemia, load_kidney_transplant
T, E = load_leukemia()['t'], load_leukemia()['status']
uaf.fit(T, E)
ax = uaf.plot_survival_function(figsize=(8,4))
uaf.print_summary()
kmf.fit(T, E).plot(ax=ax, ci_show=False)
print("")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(uaf.summary.loc['c_', 'coef']),
np.exp(uaf.summary.loc['c_', 'upper 0.95']),
np.exp(uaf.summary.loc['c_', 'lower 0.95']),
)
)
<lifelines.UpperAsymptoteFitter: fitted with 42 observations, 12 censored>
number of subjects = 42
number of events = 30
loglikelihood = 118.60
hypothesis = c_ != 1, mu_ != 0, sigma_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
c_ 1.63 0.36 0.94 2.33 0.07 3.75
mu_ 13.44 1.73 10.06 16.82 <0.005 47.07
sigma_ 7.03 1.07 4.94 9.12 <0.005 25.91

Estimated lower bound: 0.20 (0.10, 0.39)
So we might expect that about 20% will not have the even in this population (but make note of the large CI bounds too!)
[10]:
# Another, less obvious, dataset.
T, E = load_kidney_transplant()['time'], load_kidney_transplant()['death']
uaf.fit(T, E)
ax = uaf.plot_survival_function(figsize=(8,4))
uaf.print_summary()
kmf.fit(T, E).plot(ax=ax)
print("")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(uaf.summary.loc['c_', 'coef']),
np.exp(uaf.summary.loc['c_', 'upper 0.95']),
np.exp(uaf.summary.loc['c_', 'lower 0.95']),
)
)
<lifelines.UpperAsymptoteFitter: fitted with 863 observations, 723 censored>
number of subjects = 863
number of events = 140
loglikelihood = 1458.88
hypothesis = c_ != 1, mu_ != 0, sigma_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
c_ 0.29 0.03 0.24 0.35 <0.005 433.78
mu_ 1139.66 101.52 940.68 1338.63 <0.005 94.73
sigma_ 872.26 66.24 742.44 1002.08 <0.005 128.86

Estimated lower bound: 0.75 (0.70, 0.79)
Using alternative functional forms¶
An even simplier model might look like \(c \left(1  \frac{1}{\lambda t + 1}\right)\), however this model cannot handle any “inflection points” like our artificial dataset has above. However, it works well for this Lung dataset.
With all cure models, one important feature is the ability to extrapolate to unforseen times.
[11]:
from autograd.scipy.stats import norm
from lifelines.fitters import ParametericUnivariateFitter
class SimpleUpperAsymptoteFitter(ParametericUnivariateFitter):
_fitted_parameter_names = ["c_", "lambda_"]
_bounds = ((0, None), (0, None))
def _cumulative_hazard(self, params, times):
c, lambda_ = params
return c * (1  1 /(lambda_ * times + 1))
[12]:
# Another, less obvious, dataset.
saf = SimpleUpperAsymptoteFitter().fit(T, E, timeline=np.arange(1, 10000))
ax = saf.plot_survival_function(figsize=(8,4))
saf.print_summary(4)
kmf.fit(T, E).plot(ax=ax)
print("")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(saf.summary.loc['c_', 'coef']),
np.exp(saf.summary.loc['c_', 'upper 0.95']),
np.exp(saf.summary.loc['c_', 'lower 0.95']),
)
)
<lifelines.SimpleUpperAsymptoteFitter: fitted with 863 observations, 723 censored>
number of subjects = 863
number of events = 140
loglikelihood = 1392.1610
hypothesis = c_ != 1, lambda_ != 1

coef se(coef) lower 0.95 upper 0.95 p log2(p)
c_ 0.4252 0.0717 0.2847 0.5658 <5e05 49.6941
lambda_ 0.0006 0.0002 0.0003 0.0009 <5e05 inf

Estimated lower bound: 0.65 (0.57, 0.75)
[ ]:
[ ]:
Survival regression¶
Often we have additional data aside from the duration that we want to use. The technique is called survival regression – the name implies we regress covariates (e.g., age, country, etc.) against another variable – in this case durations. Similar to the logic in the first part of this tutorial, we cannot use traditional methods like linear regression because of censoring.
There are a few popular models in survival regression: Cox’s model, accelerated failure models, and Aalen’s additive model. All models attempt to represent the hazard rate \(h(t  x)\) as a function of \(t\) and some covariates \(x\). We explore these models next.
The dataset for regression¶
The dataset required for survival regression must be in the format of a Pandas DataFrame. Each row of the DataFrame should be an observation. There should be a column denoting the durations of the observations. There may be a column denoting the event status of each observation (1 if event occurred, 0 if censored). There are also the additional covariates you wish to regress against. Optionally, there could be columns in the DataFrame that are used for stratification, weights, and clusters which will be discussed later in this tutorial.
An example dataset we will use is the Rossi recidivism dataset, available in lifelines as load_rossi()
.
from lifelines.datasets import load_rossi
rossi = load_rossi()
"""
week arrest fin age race wexp mar paro prio
0 20 1 0 27 1 0 0 1 3
1 17 1 0 18 1 0 0 1 8
2 25 1 0 19 0 1 0 1 13
3 52 0 1 23 1 1 1 1 1
"""
The dataframe rossi
contains 432 observations. The week
column is the duration, the arrest
column is the event occurred, and the other columns represent variables we wish to regress against.
If you need to first clean or transform your dataset (encode categorical variables, add interaction terms, etc.), that should happen before using lifelines. Libraries like Pandas and Patsy help with that.
Cox’s proportional hazard model¶
The idea behind Cox’s proportional hazard model model is that the loghazard of an individual is a linear function of their static covariates and a populationlevel baseline hazard that changes over time. Mathematically:
Note a few facts about this model: the only time component is in the baseline hazard, \(b_0(t)\). In the above product, the partial hazard is a timeinvariant scalar factor that only increases or decreases the baseline hazard. Thus a changes in covariates will only increase or decrease the baseline hazard.
Note
In other regression models, a column of 1s might be added that represents that intercept or baseline. This is not necessary in the Cox model. In fact, there is no intercept in the additive Cox model  the baseline hazard represents this. lifelines will will throw warnings and may experience convergence errors if a column of 1s is present in your dataset.
Running the regression¶
The implementation of the Cox model in lifelines is under CoxPHFitter
. Like R, it has a print_summary()
function that prints a tabular view of coefficients and related stats.
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest', show_progress=True)
cph.print_summary() # access the results using cph.summary
"""
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
loglikelihood = 658.75
time fit was run = 20190127 23:10:15 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.38 0.68 0.19 1.98 0.05 4.40 0.75 0.00
age 0.06 0.94 0.02 2.61 0.01 6.79 0.10 0.01
race 0.31 1.37 0.31 1.02 0.31 1.70 0.29 0.92
wexp 0.15 0.86 0.21 0.71 0.48 1.06 0.57 0.27
mar 0.43 0.65 0.38 1.14 0.26 1.97 1.18 0.31
paro 0.08 0.92 0.20 0.43 0.66 0.59 0.47 0.30
prio 0.09 1.10 0.03 3.19 <0.005 9.48 0.04 0.15

Concordance = 0.64
Likelihood ratio test = 33.27 on 7 df, log2(p)=15.37
"""
To access the coefficients and the baseline hazard directly, you can use params_
and baseline_hazard_
respectively. Taking a look at these coefficients for a moment, prio
(the number of prior arrests) has a coefficient of about 0.09. Thus, a one unit increase in prio
means the the baseline hazard will increase by a factor of \(\exp{(0.09)} = 1.10\)  about a 10% increase. Recall, in the Cox proportional hazard model, a higher hazard means more at risk of the event occurring. The value \(\exp{(0.09)}\) is called the hazard ratio, a name that will be clear with another example.
Consider the coefficient of mar
(whether the subject is married or not). The values in the column are binary: 0 or 1, representing either not married or married. The value of the coefficient associated with mar
, \(\exp{(.43)}\), is the value of ratio of hazards associated with being married, that is:
Note that lefthand side is a constant (specifically, it’s independent of time, \(t\)), but the righthand side has two factors that can vary with time. The proportional assumption is that this is true in reality. That is, hazards can change over time, but their ratio between levels remains a constant. Later we will deal with checking this assumption.
Convergence¶
Fitting the Cox model to the data involves using iterative methods. lifelines takes extra effort to help with convergence, so please be attentive to any warnings that appear. Fixing any warnings will generally help convergence and decrease the number of iterative steps required. If you wish to see the fitting, there is a show_progress
parameter in fit()
function. For further help, see Problems with convergence in the Cox proportional hazard model.
After fitting, the value of the maximum loglikelihood this available using log_likelihood
. The variance matrix of the coefficients is available under variance_matrix_
.
Goodness of fit¶
After fitting, you may want to know how “good” of a fit your model was to the data. A few methods the author has found useful is to
 look at the concordanceindex (see below section on Model selection in survival regression), available as
score_
or in theprint_summary()
as a measure of predictive accuracy. look at the loglikelihood test result in the
print_summary()
 check the proportional hazards assumption with the
check_assumptions()
method. See section later on this page for more details.
Prediction¶
After fitting, you can use use the suite of prediction methods: predict_partial_hazard()
, predict_survival_function()
, etc.
X = rossi_dataset
cph.predict_partial_hazard(X)
cph.predict_survival_function(X, times=[5., 25., 50.])
cph.predict_median(X)
A common use case is to predict the event time of censored subjects. This is easy to do, but we first have to calculate an important conditional probability. Let \(T\) be the (random) event time for some subject, and \(S(t)≔P(T > t)\) be their survival function. We are interested to know What is the new survival function, given I know the subject has lived past time s, where s < t? Mathematically:
Thus we scale the original survival function by the survival function at time \(s\) (everything prior to \(s\) should be mapped to 1.0 as well, since we are working with probabilities and we know that the subject was alive before \(s\)).
Back to our original problem of predicting the event time of censored individuals, lifelines has all this math and logic built in when using the conditional_after kwarg.
censored_subjects = rossi.loc[~rossi['arrest'].astype(bool)]
censored_subjects_last_obs = censored_subjects['week']
cph.predict_partial_hazard(censored_subjects, conditional_after=censored_subjects_last_obs)
cph.predict_survival_function(censored_subjects, times=[5., 25., 50.], conditional_after=censored_subjects_last_obs)
cph.predict_median(censored_subjects, conditional_after=censored_subjects_last_obs)
Plotting the coefficients¶
With a fitted model, an alternative way to view the coefficients and their ranges is to use the plot
method.
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest', show_progress=True)
cph.plot()
Plotting the effect of varying a covariate¶
After fitting, we can plot what the survival curves look like as we vary a single covariate while
holding everything else equal. This is useful to understand the impact of a covariate, given the model. To do this, we use the plot_covariate_groups()
method and give it the covariate of interest, and the values to display.
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest', show_progress=True)
cph.plot_covariate_groups('prio', [0, 2, 4, 6, 8, 10], cmap='coolwarm')
The plot_covariate_groups()
method can accept multiple covariates as well. This is useful for two purposes:
 There are derivative features in your dataset. For example, suppose you have included
year
andyear**2
in your dataset. It doesn’t make sense to just varyyear
and leaveyear**2
fixed. You’ll need to specify manually the values the covariates take on in a Nd array or list (where N is the number of covariates being varied.)
cph.plot_covariate_groups(
['year', 'year**2'],
[
[0, 0],
[1, 1],
[2, 4],
[3, 9],
[8, 64],
],
cmap='coolwarm')
 This feature is also useful for analyzing categorical variables. In your regression, you may have dummy variables (also called onehotencoded variables) in your DataFrame that represent some categorical variable. To simultaneously plot the survival curves of each category, all else being equal, we can use:
cph.plot_covariate_groups(
['d1', 'd2' 'd3', 'd4', 'd5'],
np.eye(5)
cmap='coolwarm')
The reason why we use np.eye
is because we want each row of the matrix to “turn on” one category and “turn off” the others.
Checking the proportional hazards assumption¶
CoxPHFitter
has a check_assumptions()
method that will output violations of the proportional hazard assumption. For a tutorial on how to fix violations, see Testing the Proportional Hazard Assumptions.
Nonproportional hazards is a case of model misspecification. Suggestions are to look for ways to stratify a column (see docs below), or use a time varying model.
Stratification¶
Sometimes one or more covariates may not obey the proportional hazard assumption. In this case, we can allow the covariate(s) to still be including in the model without estimating its effect. This is called stratification. At a high level, think of it as splitting the dataset into N smaller datasets, defined by the unique values of the stratifying covariate(s). Each dataset has its own baseline hazard (the nonparametric part of the model), but they all share the regression parameters (the parametric part of the model). Since covariates are the same within each dataset, there is no regression parameter for the covariates stratified on, hence they will not show up in the output. However there will be N baseline hazards under baseline_cumulative_hazard_
.
To specify variables to be used in stratification, we define them in the call to fit()
:
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi_dataset = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi_dataset, 'week', event_col='arrest', strata=['race'], show_progress=True)
cph.print_summary() # access the results using cph.summary
"""
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['race']
number of subjects = 432
number of events = 114
loglikelihood = 620.56
time fit was run = 20190127 23:08:35 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.38 0.68 0.19 1.98 0.05 4.39 0.75 0.00
age 0.06 0.94 0.02 2.62 0.01 6.83 0.10 0.01
wexp 0.14 0.87 0.21 0.67 0.50 0.99 0.56 0.27
mar 0.44 0.64 0.38 1.15 0.25 2.00 1.19 0.31
paro 0.09 0.92 0.20 0.44 0.66 0.60 0.47 0.30
prio 0.09 1.10 0.03 3.21 <0.005 9.56 0.04 0.15

Concordance = 0.64
Likelihood ratio test = 109.63 on 6 df, log2(p)=68.48
"""
cph.baseline_cumulative_hazard_.shape
# (49, 2)
Weights & robust errors¶
Observations can come with weights, as well. These weights may be integer values representing some commonly occurring observation, or they may be float values representing some sampling weights (ex: inverse probability weights). In the fit()
method, an kwarg is present for specifying which column in the DataFrame should be used as weights, ex: CoxPHFitter(df, 'T', 'E', weights_col='weights')
.
When using sampling weights, it’s correct to also change the standard error calculations. That is done by turning on the robust
flag in fit()
. Internally, CoxPHFitter
will use the sandwich estimator to compute the errors.
from lifelines import CoxPHFitter
df = pd.DataFrame({
'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
'weights': [1.1, 0.5, 2.0, 1.6, 1.2, 4.3, 1.4, 4.5, 3.0, 3.2, 0.4, 6.2],
'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
})
cph = CoxPHFitter()
cph.fit(df, 'T', 'E', weights_col='weights', robust=True)
cph.print_summary()
See more examples in Adding weights to observations in a Cox model.
Clusters & correlations¶
Another property your dataset may have is groups of related subjects. This could be caused by:
 a single individual having multiple occurrences, and hence showing up in the dataset more than once.
 subjects that share some common property, like members of the same family or being matched on propensity scores.
We call these grouped subjects “clusters”, and assume they are designated by some column in the DataFrame (example below). When using cluster, the point estimates of the model don’t change, but the standard errors will increase. An intuitive argument for this is that 100 observations on 100 individuals provide more information than 100 observations on 10 individuals (or clusters).
from lifelines import CoxPHFitter
df = pd.DataFrame({
'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'id': [1, 1, 1, 1, 2, 3, 3, 4, 4, 5, 6, 7]
})
cph = CoxPHFitter()
cph.fit(df, 'T', 'E', cluster_col='id')
cph.print_summary()
For more examples, see Correlations between subjects in a Cox model.
Residuals¶
After fitting a Cox model, we can look back and compute important model residuals. These residuals can tell us about nonlinearities not captured, violations of proportional hazards, and help us answer other useful modeling questions. See Assessing Cox model fit using residuals.
Accelerated failure time models¶
Suppose we have two populations, A and B, with different survival functions, \(S_A(t)\) and \(S_B(t)\), and they are related by some accelerated failure rate, \(\lambda\):
This can be interpreted as slowing down or speeding up moving along the survival function. A classic example of this is that dogs age at 7 times the rate of humans, i.e. \(\lambda = \frac{1}{7}\). This model has some other nice properties: the average survival time of population B is \({\lambda}\) times the average survival time of population A. Likewise with the median survival time.
More generally, we can model the \(\lambda\) as a function of covariates available, that is:
This model can accelerate or decelerate failure times depending on subjects’ covariates. Another nice feature of this is the ease of interpretation of the coefficients: a unit increase in \(x_i\) means the average/median survival time changes by a factor of \(\exp(b_i)\).
Note
An important note on interpretation: Suppose \(b_i\) was positive, then the factor \(\exp(b_i)\) is greater than 1, which will decelerate the event time since we divide time by the factor ⇿ increase mean/median survival. Hence, it will be a protective effect. Likewise, a negative \(b_i\) will hasten the event time ⇿ reduce the mean/median survival time. This interpretation is opposite of how the sign influences event times in the Cox model! This is standard survival analysis convention.
Next, we pick a parametric form for the survival function, \(S(t)\). The most common is the Weibull form. So if we assume the relationship above and a Weibull form, our hazard function is quite easy to write down:
We call these accelerated failure time models, shortened often to just AFT models. Using lifelines, we can fit this model (and the unknown \(\rho\) parameter too).
The Weibull AFT model¶
The Weibull AFT model is implemented under WeibullAFTFitter
. The API for the class is similar to the other regression models in lifelines. After fitting, the coefficients can be accessed using params_
or summary
, or alternatively printed using print_summary()
.
from lifelines import WeibullAFTFitter
from lifelines.datasets import load_rossi
rossi_dataset = load_rossi()
aft = WeibullAFTFitter()
aft.fit(rossi_dataset, duration_col='week', event_col='arrest')
aft.print_summary(3) # access the results using aft.summary
"""
<lifelines.WeibullAFTFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
loglikelihood = 679.917
time fit was run = 20190220 17:47:19 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
lambda_ fin 0.272 1.313 0.138 1.973 0.049 4.365 0.002 0.543
age 0.041 1.042 0.016 2.544 0.011 6.512 0.009 0.072
race 0.225 0.799 0.220 1.021 0.307 1.703 0.656 0.207
wexp 0.107 1.112 0.152 0.703 0.482 1.053 0.190 0.404
mar 0.311 1.365 0.273 1.139 0.255 1.973 0.224 0.847
paro 0.059 1.061 0.140 0.421 0.674 0.570 0.215 0.333
prio 0.066 0.936 0.021 3.143 0.002 9.224 0.107 0.025
_intercept 3.990 54.062 0.419 9.521 <0.0005 68.979 3.169 4.812
rho_ _intercept 0.339 1.404 0.089 3.809 <0.0005 12.808 0.165 0.514

Concordance = 0.640
Loglikelihood ratio test = 33.416 on 7 df, log2(p)=15.462
"""
From above, we can see that prio
, which is the number of previous incarcerations, has a large negative coefficient. This means that each addition incarcerations changes a subject’s mean/median survival time by \(\exp(0.066) = 0.936\), approximately a 7% decrease in mean/median survival time. What is the mean/median survival time?
print(aft.median_survival_time_)
print(aft.mean_survival_time_)
# 100.325
# 118.67
What does the rho_ _intercept
row mean in the above table? Internally, we model the log of the rho_
parameter, so the value of \(\rho\) is the exponential of the value, so in case above it’s \(\hat{\rho} = \exp0.339 = 1.404\). This brings us to the next point  modelling \(\rho\) with covariates as well:
Modeling ancillary parameters¶
In the above model, we left the parameter \(\rho\) as a single unknown. We can also choose to model this parameter as well. Why might we want to do this? It can help in survival prediction to allow heterogeneity in the \(\rho\) parameter. The model is no longer an AFT model, but we can still recover and understand the influence of changing a covariate by looking at its outcome plot (see section below). To model \(\rho\), we use the ancillary_df
keyword argument in the call to fit()
. There are four valid options:
False
orNone
: explicitly do not model therho_
parameter (except for its intercept). a Pandas DataFrame. This option will use the columns in the Pandas DataFrame as the covariates in the regression for
rho_
. This DataFrame could be a equal to, or a subset of, the original dataset using for modelinglambda_
, or it could be a totally different dataset. True
. Passing inTrue
will internally reuse the dataset that is being used to modellambda_
.
aft = WeibullAFTFitter()
aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=False)
# identical to aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=None)
aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=some_df)
aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=True)
# identical to aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=rossi)
aft.print_summary()
"""
<lifelines.WeibullAFTFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
loglikelihood = 669.40
time fit was run = 20190220 17:42:55 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
lambda_ fin 0.24 1.28 0.15 1.60 0.11 3.18 0.06 0.55
age 0.10 1.10 0.03 3.43 <0.005 10.69 0.04 0.16
race 0.07 1.07 0.19 0.36 0.72 0.48 0.30 0.44
wexp 0.34 0.71 0.15 2.22 0.03 5.26 0.64 0.04
mar 0.26 1.30 0.30 0.86 0.39 1.35 0.33 0.85
paro 0.09 1.10 0.15 0.61 0.54 0.88 0.21 0.39
prio 0.08 0.92 0.02 4.24 <0.005 15.46 0.12 0.04
_intercept 2.68 14.65 0.60 4.50 <0.005 17.14 1.51 3.85
rho_ fin 0.01 0.99 0.15 0.09 0.92 0.11 0.31 0.29
age 0.05 0.95 0.02 3.10 <0.005 9.01 0.08 0.02
race 0.46 0.63 0.25 1.79 0.07 3.77 0.95 0.04
wexp 0.56 1.74 0.17 3.32 <0.005 10.13 0.23 0.88
mar 0.10 1.10 0.27 0.36 0.72 0.47 0.44 0.63
paro 0.02 1.02 0.16 0.12 0.90 0.15 0.29 0.33
prio 0.03 1.03 0.02 1.44 0.15 2.73 0.01 0.08
_intercept 1.48 4.41 0.41 3.60 <0.005 11.62 0.68 2.29

Concordance = 0.63
Loglikelihood ratio test = 54.45 on 14 df, log2(p)=19.83
"""
Plotting¶
The plotting API is the same as in CoxPHFitter
. We can view all covariates in a forest plot:
wft = WeibullAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=True)
wft.plot()
We can observe the influence a variable in the model by plotting the outcome (i.e. survival) of changing the variable. This is done using plot_covariate_groups()
, and this is also a nice time to observe the effects of modeling rho_
vs keeping it fixed. Below we fit the Weibull model to the same dataset twice, but in the first model we model rho_
and in the second model we don’t. We when vary the prio
(which is the number of prior arrests) and observe how the survival changes.
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
times = np.arange(0, 100)
wft_model_rho = WeibullAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=True, timeline=times)
wft_model_rho.plot_covariate_groups('prio', range(0, 16, 3), cmap='coolwarm', ax=ax[0])
ax[0].set_title("Modelling rho_")
wft_not_model_rho = WeibullAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=False, timeline=times)
wft_not_model_rho.plot_covariate_groups('prio', range(0, 16, 3), cmap='coolwarm', ax=ax[1])
ax[1].set_title("Not modelling rho_");
Comparing a few of these survival functions side by side:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 4))
wft_model_rho.plot_covariate_groups('prio', range(0, 16, 5), cmap='coolwarm', ax=ax, lw=2, plot_baseline=False)
wft_not_model_rho.plot_covariate_groups('prio', range(0, 16, 5), cmap='coolwarm', ax=ax, ls='', lw=2, plot_baseline=False)
ax.get_legend().remove()
You read more about and see other examples of the extensions to plot_covariate_groups()
Prediction¶
Given a new subject, we ask questions about their future survival? When are they likely to experience the event? What does their survival function look like? The WeibullAFTFitter
is able to answer these. If we have modeled the ancillary covariates, we are required to include those as well:
X = rossi.loc[:10]
aft.predict_cumulative_hazard(X, ancillary_df=X)
aft.predict_survival_function(X, ancillary_df=X)
aft.predict_median(X, ancillary_df=X)
aft.predict_percentile(X, p=0.9, ancillary_df=X)
aft.predict_expectation(X, ancillary_df=X)
When predicting time remaining for uncensored individuals, you can use the conditional_after kwarg:
censored_X = rossi.loc[~rossi['arrest'].astype(bool)]
censored_subjects_last_obs = censored_X['week']
aft.predict_cumulative_hazard(censored_X, ancillary_df=censored_X, conditional_after=censored_subjects_last_obs)
aft.predict_survival_function(censored_X, ancillary_df=censored_X, conditional_after=censored_subjects_last_obs)
aft.predict_median(censored_X, ancillary_df=censored_X, conditional_after=censored_subjects_last_obs)
aft.predict_percentile(X, p=0.9, ancillary_df=censored_X, conditional_after=censored_subjects_last_obs)
There are two hyperparameters that can be used to to achieve a better test score. These are penalizer
and l1_ratio
in the call to WeibullAFTFitter
. The penalizer is similar to scikitlearn’s ElasticNet
model, see their docs.
aft_with_elastic_penalty = WeibullAFTFitter(penalizer=4.0, l1_ratio=1.0)
aft_with_elastic_penalty.fit(rossi, 'week', 'arrest')
aft_with_elastic_penalty.predict_median(rossi)
aft_with_elastic_penalty.print_summary()
"""
<lifelines.WeibullAFTFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
penalizer = 4.0
l1_ratio = 1.0
number of subjects = 432
number of events = 114
loglikelihood = 2710.95
time fit was run = 20190220 19:53:29 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
lambda_ fin 0.00 1.00 0.08 0.00 1.00 0.00 0.15 0.15
age 0.13 1.14 0.01 12.27 <0.005 112.47 0.11 0.15
race 0.55 1.73 0.09 5.80 <0.005 27.16 0.36 0.73
wexp 0.00 1.00 0.09 0.00 1.00 0.00 0.17 0.17
mar 0.00 1.00 0.14 0.01 0.99 0.01 0.27 0.28
paro 0.00 1.00 0.08 0.01 0.99 0.01 0.16 0.16
prio 0.00 1.00 0.01 0.00 1.00 0.00 0.03 0.03
_intercept 0.00 1.00 0.19 0.00 1.00 0.00 0.38 0.38
rho_ _intercept 0.00 1.00 nan nan nan nan nan nan

Concordance = 0.60
Loglikelihood ratio test = 4028.65 on 7 df, log2(p)=0.00
"""
The LogNormal and LogLogistic AFT model¶
There are also the LogNormalAFTFitter
and LogLogisticAFTFitter
models, which instead of assuming that the survival time distribution is Weibull, we assume it is LogNormal or LogLogistic, respectively. They have identical APIs to the WeibullAFTFitter
, but the parameter names are different.
from lifelines import LogLogisticAFTFitter
from lifelines import LogNormalAFTFitter
llf = LogLogisticAFTFitter().fit(rossi, 'week', 'arrest')
lnf = LogNormalAFTFitter().fit(rossi, 'week', 'arrest')
Model selection for AFT models¶
Often, you don’t know a priori which AFT model to use. Each model has some assumptions builtin (not implemented yet in lifelines), but a quick and effective method is to compare the loglikelihoods for each fitted model. (Technically, we are comparing the AIC, but the number of parameters for each model is the same, so we can simply and just look at the loglikelihood). Generally, given the same dataset and number of parameters, a better fitting model has a larger loglikelihood. We can look at the loglikelihood for each fitted model and select the largest one.
from lifelines import LogLogisticAFTFitter, WeibullAFTFitter, LogNormalAFTFitter
from lifelines.datasets import load_rossi
rossi = load_rossi()
llf = LogLogisticAFTFitter().fit(rossi, 'week', 'arrest')
lnf = LogNormalAFTFitter().fit(rossi, 'week', 'arrest')
wf = WeibullAFTFitter().fit(rossi, 'week', 'arrest')
print(llf.log_likelihood_) # 679.938
print(lnf.log_likelihood_) # 683.234
print(wf.log_likelihood_) # 679.916, slightly the best model.
# with some heterogeneity in the ancillary parameters
ancillary_df = rossi[['prio']]
llf = LogLogisticAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=ancillary_df)
lnf = LogNormalAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=ancillary_df)
wf = WeibullAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=ancillary_df)
print(llf.log_likelihood_) # 678.94, slightly the best model.
print(lnf.log_likelihood_) # 680.39
print(wf.log_likelihood_) # 679.60
Left, right and interval censored data¶
The AFT models have APIs that handle left and interval censored data, too. The API for them is different than the API for fitting to right censored data. Here’s an example with interval censored data.
from lifelines.datasets import load_diabetes
df = load_diabetes()
df['gender'] = df['gender'] == 'male'
print(df.head())
"""
left right gender
1 24 27 True
2 22 22 False
3 37 39 True
4 20 20 True
5 1 16 True
"""
wf = WeibullAFTFitter().fit_interval_censoring(df, lower_bound_col='left', upper_bound_col='right')
wf.print_summary()
"""
<lifelines.WeibullAFTFitter: fitted with 731 observations, 136 censored>
event col = 'E'
number of subjects = 731
number of events = 595
loglikelihood = 2027.20
time fit was run = 20190411 19:39:42 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
lambda_ gender 0.05 1.05 0.03 1.66 0.10 3.38 0.01 0.10
_intercept 2.91 18.32 0.02 130.15 <0.005 inf 2.86 2.95
rho_ _intercept 1.04 2.83 0.03 36.91 <0.005 988.46 0.98 1.09

Loglikelihood ratio test = 2.74 on 1 df, log2(p)=3.35
"""
Another example of using lifelines for interval censored data is located here.
Aalen’s additive model¶
Warning
This implementation is still experimental.
Aalen’s Additive model is another regression model we can use. Like the Cox model, it defines the hazard rate, but instead of the linear model being multiplicative like the Cox model, the Aalen model is additive. Specifically:
Inference typically does not estimate the individual
\(b_i(t)\) but instead estimates \(\int_0^t b_i(s) \; ds\)
(similar to the estimate of the hazard rate using NelsonAalenFitter
). This is important
when interpreting plots produced.
For this
exercise, we will use the regime dataset and include the categorical
variables un_continent_name
(eg: Asia, North America,…), the
regime
type (e.g., monarchy, civilian,…) and the year the regime
started in, start_year
. The estimator to fit unknown coefficients in Aalen’s additive model is
located under AalenAdditiveFitter
.
from lifelines import AalenAdditiveFitter
from lifelines.datasets import load_dd
data = load_dd()
data.head()
ctryname  cowcode2  politycode  un_region_name  un_continent_name  ehead  leaderspellreg  democracy  regime  start_year  duration  observed 

Afghanistan  700  700  Southern Asia  Asia  Mohammad Zahir Shah  Mohammad Zahir Shah.Afghanistan.1946.1952.Monarchy  Nondemocracy  Monarchy  1946  7  1 
Afghanistan  700  700  Southern Asia  Asia  Sardar Mohammad Daoud  Sardar Mohammad Daoud.Afghanistan.1953.1962.Civilian Dict  Nondemocracy  Civilian Dict  1953  10  1 
Afghanistan  700  700  Southern Asia  Asia  Mohammad Zahir Shah  Mohammad Zahir Shah.Afghanistan.1963.1972.Monarchy  Nondemocracy  Monarchy  1963  10  1 
Afghanistan  700  700  Southern Asia  Asia  Sardar Mohammad Daoud  Sardar Mohammad Daoud.Afghanistan.1973.1977.Civilian Dict  Nondemocracy  Civilian Dict  1973  5  0 
Afghanistan  700  700  Southern Asia  Asia  Nur Mohammad Taraki  Nur Mohammad Taraki.Afghanistan.1978.1978.Civilian Dict  Nondemocracy  Civilian Dict  1978  1  0 
I’m using the lovely library Patsy here to create a design matrix from my original dataframe.
import patsy
X = patsy.dmatrix('un_continent_name + regime + start_year', data, return_type='dataframe')
X = X.rename(columns={'Intercept': 'baseline'})
print(X.columns.tolist())
['baseline',
'un_continent_name[T.Americas]',
'un_continent_name[T.Asia]',
'un_continent_name[T.Europe]',
'un_continent_name[T.Oceania]',
'regime[T.Military Dict]',
'regime[T.Mixed Dem]',
'regime[T.Monarchy]',
'regime[T.Parliamentary Dem]',
'regime[T.Presidential Dem]',
'start_year']
We have also included the coef_penalizer
option. During the estimation, a
linear regression is computed at each step. Often the regression can be
unstable (due to high colinearity or small sample sizes) – adding a penalizer term controls the stability. I recommend always starting with a small penalizer term – if the estimates still appear to be too unstable, try increasing it.
aaf = AalenAdditiveFitter(coef_penalizer=1.0, fit_intercept=False)
An instance of AalenAdditiveFitter
includes a fit()
method that performs the inference on the coefficients. This method accepts a pandas DataFrame: each row is an individual and columns are the covariates and
two individual columns: a duration column and a boolean event occurred column (where event occurred refers to the event of interest  expulsion from government in this case)
X['T'] = data['duration']
X['E'] = data['observed']
aaf.fit(X, 'T', event_col='E')
After fitting, the instance exposes a cumulative_hazards_
DataFrame
containing the estimates of \(\int_0^t b_i(s) \; ds\):
aaf.cumulative_hazards_.head()
baseline  un_continent_name[T.Americas]  un_continent_name[T.Asia]  un_continent_name[T.Europe]  un_continent_name[T.Oceania]  regime[T.Military Dict]  regime[T.Mixed Dem]  regime[T.Monarchy]  regime[T.Parliamentary Dem]  regime[T.Presidential Dem]  start_year 

0.03447  0.03173  0.06216  0.2058  0.009559  0.07611  0.08729  0.1362  0.04885  0.1285  0.000092 
0.14278  0.02496  0.11122  0.2083  0.079042  0.11704  0.36254  0.2293  0.17103  0.1238  0.000044 
0.30153  0.07212  0.10929  0.1614  0.063030  0.16553  0.68693  0.2738  0.33300  0.1499  0.000004 
0.37969  0.06853  0.15162  0.2609  0.185569  0.22695  0.95016  0.2961  0.37351  0.4311  0.000032 
0.36749  0.20201  0.21252  0.2429  0.188740  0.25127  1.15132  0.3926  0.54952  0.7593  0.000000 
AalenAdditiveFitter
also has built in plotting:
aaf.plot(columns=['regime[T.Presidential Dem]', 'baseline', 'un_continent_name[T.Europe]'], iloc=slice(1,15))
Regression is most interesting if we use it on data we have not yet seen, i.e., prediction! We can use what we have learned to predict individual hazard rates, survival functions, and median survival time. The dataset we are using is available up until 2008, so let’s use this data to predict the duration of former Canadian Prime Minister Stephen Harper.
ix = (data['ctryname'] == 'Canada') & (data['start_year'] == 2006)
harper = X.loc[ix]
print("Harper's unique data point:")
print(harper)
Harper's unique data point:
baseline un_continent_name[T.Americas] un_continent_name[T.Asia] ... start_year T E
268 1.0 1.0 0.0 ... 2006.0 3 0
ax = plt.subplot(2,1,1)
aaf.predict_cumulative_hazard(harper).plot(ax=ax)
ax = plt.subplot(2,1,2)
aaf.predict_survival_function(harper).plot(ax=ax);
Note
Because of the nature of the model, estimated survival functions of individuals can increase. This is an expected artifact of Aalen’s additive model.
Custom Parametric Regression Models¶
lifelines has a very general syntax for creating your own parametric regression models. If you are looking to create your own custom models, see docs Custom Regression Models.
Model selection in survival regression¶
Parametric vs Semiparametric models¶
Above, we’ve displayed two semiparametric models (Cox model and Aalen’s model), and a family of parametric AFT models. Which should you choose? What are the advantages and disadvantages of either? I suggest reading the two following StackExchange answers to get a better idea of what experts think:
Model selection based on residuals¶
The sections Testing the Proportional Hazard Assumptions and Assessing Cox model fit using residuals may be useful for modeling your data better.
Note
Work is being done to extend residual methods to AFT models. Stay tuned.
Model selection based on predictive power¶
If censoring is present, it’s not appropriate to use a loss function like meansquarederror or meanabsoluteloss. Instead, one measure is the concordanceindex, also known as the cindex. This measure evaluates the accuracy of the ranking of predicted time. It is in fact a generalization of AUC, another common loss function, and is interpreted similarly:
 0.5 is the expected result from random predictions,
 1.0 is perfect concordance and,
 0.0 is perfect anticoncordance (multiply predictions with 1 to get 1.0)
Fitted survival models typically have a concordance index between 0.55 and 0.75 (this may seem bad, but even a perfect model has a lot of noise than can make a high score impossible). In lifelines, a fitted model’s concordanceindex is present in the output of print_summary()
, but also available under the score_
property. Generally, the measure is implemented in lifelines under lifelines.utils.concordance_index()
and accepts the actual times (along with any censored subjects) and the predicted times.
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
rossi = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi, duration_col="week", event_col="arrest")
# Three ways to view the cindex:
# method one
cph.print_summary()
# method two
print(cph.score_)
# method three
from lifelines.utils import concordance_index
print(concordance_index(rossi['week'], cph.predict_partial_hazard(rossi), rossi['arrest']))
Note
Remember, the concordance score evaluates the relative rankings of subject’s event times. Thus, it is scale and shift invariant (i.e. you can multiple by a positive constant, or add a constant, and the rankings won’t change). A model maximized for concordanceindex does not necessarily give good predicted times, but will give good predicted rankings.
However, there are other, arguably better, methods to measure the fit of a model. Included in print_summary
is the loglikelihood, which can be used in an AIC calculation, and the loglikelihood ratio statistic. Generally, I personally loved this article by Frank Harrell, “Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements”.
lifelines has an implementation of kfold cross validation under lifelines.utils.k_fold_cross_validation()
. This function accepts an instance of a regression fitter (either CoxPHFitter
of AalenAdditiveFitter
), a dataset, plus k
(the number of folds to perform, default 5). On each fold, it splits the data
into a training set and a testing set fits itself on the training set and evaluates itself on the testing set (using the concordance measure by default).
from lifelines import CoxPHFitter
from lifelines.datasets import load_regression_dataset
from lifelines.utils import k_fold_cross_validation
regression_dataset = load_regression_dataset()
cph = CoxPHFitter()
scores = k_fold_cross_validation(cph, regression_dataset, 'T', event_col='E', k=3)
print(scores)
print(np.mean(scores))
print(np.std(scores))
#[ 0.5896 0.5358 0.5028]
# 0.542
# 0.035
Also, lifelines has wrappers for compatibility with scikit learn for making crossvalidation and gridsearch even easier.
Custom regression models¶
Like for univariate models, it is possible to create your own custom parametric survival models. Why might you want to do this?
 Create new / extend AFT models using known probability distributions
 Create a piecewise model using domain knowledge about subjects
 Iterate and fit a more accurate parametric model
lifelines has a very simple API to create custom parametric regression models. The author only needs to define the cumulative hazard function. For example, the cumulative hazard for the Exponential regression model looks like:
Below are some example custom models.
[1]:
from lifelines.fitters import ParametricRegressionFitter
from autograd import numpy as np
from lifelines.datasets import load_rossi
class ExponentialAFTFitter(ParametricRegressionFitter):
# this is necessary, and should always be a nonempty list of strings.
_fitted_parameter_names = ['lambda_']
def _cumulative_hazard(self, params, T, Xs):
# params is a dictionary that maps unknown parameters to a numpy vector.
# Xs is a dictionary that maps unknown parameters to a numpy 2d array
lambda_ = np.exp(np.dot(Xs['lambda_'], params['lambda_']))
return T / lambda_
rossi = load_rossi()
rossi['intercept'] = 1.0
# the below variables maps dataframe columns to parameters
regressors = {
'lambda_': rossi.columns
}
eaf = ExponentialAFTFitter().fit(rossi, 'week', 'arrest', regressors=regressors)
eaf.print_summary()
<lifelines.ExponentialAFTFitter: fitted with 432 observations, 318 censored>
event col = 'arrest'
number of subjects = 432
number of events = 114
loglikelihood = 686.37
time fit was run = 20190703 01:56:13 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
lambda_ fin 0.37 1.44 0.19 1.92 0.06 4.18 0.01 0.74
age 0.06 1.06 0.02 2.55 0.01 6.52 0.01 0.10
race 0.30 0.74 0.31 0.99 0.32 1.63 0.91 0.30
wexp 0.15 1.16 0.21 0.69 0.49 1.03 0.27 0.56
mar 0.43 1.53 0.38 1.12 0.26 1.93 0.32 1.17
paro 0.08 1.09 0.20 0.42 0.67 0.57 0.30 0.47
prio 0.09 0.92 0.03 3.03 <0.005 8.65 0.14 0.03
intercept 4.05 57.44 0.59 6.91 <0.005 37.61 2.90 5.20

Loglikelihood ratio test = 31.22 on 6 df, log2(p)=15.41
[2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from lifelines.fitters import ParametricRegressionFitter
from autograd import numpy as np
class PolynomialCumulativeHazard(ParametricRegressionFitter):
_fitted_parameter_names = ['lambda1_', 'lambda2_', 'lambda3_']
def _cumulative_hazard(self, params, T, Xs):
lambda1_ = np.exp(np.dot(Xs['lambda1_'], params['lambda1_']))
lambda2_ = np.exp(np.dot(Xs['lambda2_'], params['lambda2_']))
lambda3_ = np.exp(np.dot(Xs['lambda3_'], params['lambda3_']))
return (T/lambda1_) + (T/lambda2_)**2 + (T/lambda3_)**3
def _add_penalty(self, params, neg_ll):
# authors can create their own nontraditional penalty functions too.
# This penalty is an "informationpooling" penalty, see more about it here:
# https://dataorigami.net/blogs/napkinfolding/churn
params_stacked = np.stack(params.values())
coef_penalty = 0
if self.penalizer > 0:
for i in range(params_stacked.shape[1]  1): # assuming the intercept col is the last column...
coef_penalty = coef_penalty + (params_stacked[:, i]).var()
return neg_ll + self.penalizer * coef_penalty
rossi = load_rossi()
rossi['intercept'] = 1.0
# the below variables maps dataframe columns to parameters
regressors = {
'lambda1_': rossi.columns,
'lambda2_': rossi.columns,
'lambda3_': rossi.columns
}
pf = PolynomialCumulativeHazard(penalizer=.5).fit(rossi, 'week', 'arrest', regressors=regressors)
pf.print_summary()
ax = plt.subplot()
pf.plot(columns=['fin'], ax=ax)
pf = PolynomialCumulativeHazard(penalizer=5.).fit(rossi, 'week', 'arrest', regressors=regressors)
pf.plot(columns=['fin'], ax=ax, c="r")
/Users/camerondavidsonpilon/code/lifelines/lifelines/fitters/__init__.py:1510: RuntimeWarning: invalid value encountered in sqrt
se = np.sqrt(self.variance_matrix_.diagonal())
<lifelines.PolynomialCumulativeHazard: fitted with 432 observations, 318 censored>
event col = 'arrest'
penalizer = 0.5
number of subjects = 432
number of events = 114
loglikelihood = 680.15
time fit was run = 20190703 01:56:13 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
lambda1_ fin 0.25 1.28 0.18 1.40 0.16 2.64 0.10 0.60
age 0.04 1.05 0.02 2.29 0.02 5.49 0.01 0.08
race 0.30 0.74 0.28 1.04 0.30 1.75 0.85 0.26
wexp 0.15 1.16 0.19 0.77 0.44 1.18 0.23 0.52
mar 0.30 1.35 0.31 0.98 0.33 1.61 0.30 0.91
paro 0.06 1.06 0.18 0.31 0.76 0.40 0.30 0.42
prio 0.06 0.94 0.03 2.15 0.03 4.99 0.12 0.01
intercept 4.70 110.30 0.59 7.98 <0.005 49.23 3.55 5.86
lambda2_ fin 0.22 1.25 0.24 0.93 0.35 1.50 0.24 0.68
age 0.05 1.05 0.02 2.08 0.04 4.75 0.00 0.10
race 0.15 0.86 0.36 0.41 0.68 0.56 0.85 0.55
wexp 0.00 1.00 0.25 0.01 0.99 0.01 0.48 0.48
mar 0.26 1.29 0.40 0.65 0.52 0.95 0.52 1.04
paro 0.07 1.07 0.24 0.27 0.78 0.35 0.41 0.54
prio 0.06 0.94 0.04 1.46 0.14 2.79 0.14 0.02
intercept 58.15 1.80e+25 nan nan nan nan nan nan
lambda3_ fin 0.19 1.21 0.14 1.33 0.18 2.44 0.09 0.47
age 0.06 1.06 0.02 2.75 0.01 7.40 0.02 0.10
race 0.00 1.00 0.19 0.01 0.99 0.01 0.37 0.37
wexp 0.14 0.87 0.15 0.93 0.35 1.52 0.44 0.16
mar 0.22 1.24 0.30 0.71 0.48 1.06 0.38 0.81
paro 0.08 1.08 0.14 0.55 0.58 0.78 0.19 0.34
prio 0.05 0.95 0.02 2.24 0.03 5.31 0.10 0.01
intercept 3.49 32.75 0.47 7.42 <0.005 42.96 2.57 4.41

Loglikelihood ratio test = 134.41 on 22 df, log2(p)=57.81
/Users/camerondavidsonpilon/code/lifelines/lifelines/fitters/__init__.py:1510: RuntimeWarning: invalid value encountered in sqrt
se = np.sqrt(self.variance_matrix_.diagonal())
[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e3e7da0>
Cure models¶
Suppose in our population we have a subpopulation that will never experience the event of interest. Or, for some subjects the event will occur so far in the future that it’s essentially at time infinity. In this case, the survival function for an individual should not asymptically approach zero, but some positive value. Models that describe this are sometimes called cure models (i.e. the subject is “cured” of death and hence no longer susceptible) or timelagged conversion models.
It would be nice to be able to use common survival models and have some “cure” component. Let’s suppose that for individuals that will experience the event of interest, their survival distrubtion is a Weibull, denoted \(S_W(t)\). For a random selected individual in the population, thier survival curve, \(S(t)\), is:
Even though it’s in an unconvential form, we can still determine the cumulative hazard (which is the negative logarithm of the survival function):
[26]:
from autograd.scipy.special import expit
class CureModel(ParametricRegressionFitter):
_fitted_parameter_names = ["lambda_", "beta_", "rho_"]
def _cumulative_hazard(self, params, T, Xs):
c = expit(np.dot(Xs["beta_"], params["beta_"]))
lambda_ = np.exp(np.dot(Xs["lambda_"], params["lambda_"]))
rho_ = np.exp(np.dot(Xs["rho_"], params["rho_"]))
sf = np.exp((T / lambda_) ** rho_)
return np.log((1  c) + c * sf)
swf = CureModel(penalizer=.1)
rossi = load_rossi()
rossi["intercept"] = 1.0
covariates = {"lambda_": rossi.columns, "rho_": ["intercept", "prio"], "beta_": rossi.columns}
swf.fit(rossi, "week", event_col="arrest", regressors=covariates, timeline=np.arange(250))
swf.print_summary(2)
<lifelines.CureModel: fitted with 432 observations, 318 censored>
event col = 'arrest'
penalizer = 0.1
number of subjects = 432
number of events = 114
loglikelihood = 733.43
time fit was run = 20190703 02:01:56 UTC

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
lambda_ fin 0.35 1.42 0.15 2.40 0.02 5.92 0.06 0.64
age 0.11 1.12 0.01 11.67 <0.005 102.18 0.09 0.13
race 0.69 1.99 0.19 3.63 <0.005 11.79 0.32 1.06
wexp 0.45 1.57 0.15 3.06 <0.005 8.84 0.16 0.74
mar 0.39 1.48 0.23 1.69 0.09 3.45 0.06 0.85
paro 0.21 1.23 0.15 1.35 0.18 2.49 0.09 0.51
prio 0.02 1.02 0.02 0.72 0.47 1.08 0.03 0.06
intercept 0.27 1.31 0.10 2.60 0.01 6.73 0.07 0.48
rho_ prio 0.03 1.03 0.02 1.48 0.14 2.85 0.01 0.06
intercept 0.10 1.11 0.08 1.33 0.18 2.44 0.05 0.25
beta_ fin 0.07 0.93 0.18 0.37 0.71 0.50 0.43 0.29
age 0.01 0.99 0.01 0.74 0.46 1.12 0.03 0.01
race 0.15 1.17 0.24 0.63 0.53 0.92 0.33 0.63
wexp 0.07 1.08 0.19 0.40 0.69 0.53 0.29 0.44
mar 0.04 1.04 0.30 0.12 0.91 0.14 0.56 0.63
paro 0.08 0.92 0.18 0.46 0.65 0.63 0.45 0.28
prio 0.04 1.04 0.03 1.24 0.21 2.22 0.02 0.09
intercept 0.04 0.96 0.10 0.39 0.70 0.52 0.25 0.17

Loglikelihood ratio test = 516.07 on 16 df, log2(p)=328.45
[27]:
swf.predict_survival_function(rossi.loc[::75]).plot(figsize=(12,6))
[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1249c0400>
[30]:
# what's the effect on the survival curve if I vary "age"
fig, ax = plt.subplots(figsize=(12, 6))
swf.plot_covariate_groups(['age'], values=np.arange(20, 50, 5), cmap='coolwarm', ax=ax)
[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x12533e630>
Compatibility with scikitlearn¶
New to lifelines in version 0.21.3 is a wrapper that allows you to use lifeline’s regression models with scikitlearn’s APIs.
Note
the API and functionality is still experimental. Please report any bugs or features on our Github issue list.
from lifelines.utils.sklearn_adapter import sklearn_adapter
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
X = load_rossi().drop('week', axis=1) # keep as a dataframe
Y = load_rossi().pop('week')
CoxRegression = sklearn_adapter(CoxPHFitter, event_col='arrest')
# CoxRegression is a class like the `LinearRegression` class or `SVC` class in scikitlearn
sk_cph = CoxRegression(penalizer=1.0)
sk_cph.fit(X, Y)
print(sk_cph)
"""
SkLearnCoxPHFitter(alpha=0.05, penalizer=1.0, strata=None, tie_method='Efron')
"""
sk_cph.predict(X)
sk_cph.score(X, Y)
Note
The X variable still needs to be a DataFrame, and should contains the eventoccurred column (event_col
) if it exists.
If needed, the original lifeline’s instance is available as the lifelines_model
attribute.
sk_cph.lifelines_model.print_summary()
The wrapped classes can even be used in more complex scikitlearn functions (ex: cross_val_score
) and classes (ex: GridSearchCV
):
from lifelines import WeibullAFTFitter
from sklearn.model_selection import cross_val_score
base_class = sklearn_adapter(WeibullAFTFitter, event_col='arrest')
wf = base_class()
scores = cross_val_score(wf, X, Y, cv=5)
print(scores)
"""
[0.59037328 0.503427 0.55454545 0.59689534 0.62311068]
"""
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(wf, {
"penalizer": 10.0 ** np.arange(2, 3),
"l1_ratio": [0, 1/3, 2/3],
"model_ancillary": [True, False],
}, cv=4)
clf.fit(X, Y)
print(clf.best_estimator_)
"""
SkLearnWeibullAFTFitter(alpha=0.05, fit_intercept=True,
l1_ratio=0.66666, model_ancillary=True,
penalizer=0.01)
"""
Note
The lifelines.utils.sklearn_adapter()
is currently only designed to work with rightcensored data.
Time varying survival regression¶
Cox’s time varying proportional hazard model¶
Often an individual will have a covariate change over time. An example of this is hospital patients who enter the study and, at some future time, may receive a heart transplant. We would like to know the effect of the transplant, but we cannot condition on whether they received the transplant naively. Consider that if patients needed to wait at least 1 year before getting a transplant, then everyone who dies before that year is considered as a nontransplant patient, and hence this would overestimate the hazard of not receiving a transplant.
We can incorporate changes over time into our survival analysis by using a modification of the Cox model above. The general mathematical description is:
Note the timevarying \(x_i(t)\) to denote that covariates can change over time. This model is implemented in lifelines as CoxTimeVaryingFitter
. The dataset schema required is different than previous models, so we will spend some time describing this.
Dataset creation for timevarying regression¶
lifelines requires that the dataset be in what is called the long format. This looks like one row per state change, including an ID, the left (exclusive) time point, and right (inclusive) time point. For example, the following dataset tracks three unique subjects.
id  start  stop  group  z  event 

1  0  8  1  0  False 
2  0  5  0  0  False 
2  5  8  0  1  True 
3  0  3  1  0  False 
3  3  12  1  1  True 
In the above dataset, start
and stop
denote the boundaries, id
is the unique identifier per subject, and event
denotes if the subject died at the end of that period. For example, subject ID 2 had variable z=0
up to and including the end of time period 5 (we can think that measurements happen at end of the time period), after which it was set to 1. Since event
is 1 in that row, we conclude that the subject died at time 8,
This desired dataset can be built up from smaller datasets. To do this we can use some helper functions provided in lifelines. Typically, data will be in a format that looks like it comes out of a relational database. You may have a “base” table with ids, durations alive, and a censored flag, and possibly static covariates. Ex:
id  duration  event  var1 

1  10  True  0.1 
2  12  False  0.5 
We will perform a light transform to this dataset to modify it into the “long” format.
from lifelines.utils import to_long_format
base_df = to_long_format(base_df, duration_col="duration")
The new dataset looks like:
id  start  stop  var1  event 

1  0  10  0.1  True 
2  0  12  0.5  False 
You’ll also have secondary dataset that references future measurements. This could come in two “types”. The first is when you have a variable that changes over time (ex: administering varying medication over time, or taking a tempature over time). The second types is an eventbased dataset: an event happens at some time in the future (ex: an organ transplant occurs, or an intervention). We will address this second type later. The first type of dataset may look something like:
Example:
id  time  var2 

1  0  1.4 
1  4  1.2 
1  8  1.5 
2  0  1.6 
where time
is the duration from the entry event. Here we see subject 1 had a change in their var2
covariate at the end of time 4 and at the end of time 8. We can use lifelines.utils.add_covariate_to_timeline()
to fold the covariate dataset into the original dataset.
from lifelines.utils import add_covariate_to_timeline
df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="event")
id  start  stop  var1  var2  event 

1  0  4  0.1  1.4  False 
1  4  8  0.1  1.2  False 
1  8  10  0.1  1.5  True 
2  0  12  0.5  1.6  False 
From the above output, we can see that subject 1 changed state twice over the observation period, finally expiring at the end of time 10. Subject 2 was a censored case, and we lost track of them after time 12.
You may have multiple covariates you wish to add, so the above could be streamlined like so:
from lifelines.utils import add_covariate_to_timeline
df = base_df.pipe(add_covariate_to_timeline, cv1, duration_col="time", id_col="id", event_col="event")\
.pipe(add_covariate_to_timeline, cv2, duration_col="time", id_col="id", event_col="event")\
.pipe(add_covariate_to_timeline, cv3, duration_col="time", id_col="id", event_col="event")
If your dataset is of the second type, that is, eventbased, your dataset may look something like the following, where values in the matrix denote times since the subject’s birth, and None
or NaN
represent the event not happening (subjects can be excluded if the event never occurred as well) :
print(event_df)
id E1
0 1 1.0
1 2 NaN
2 3 3.0
...
Initially, this can’t be added to our baseline DataFrame. However, using lifelines.utils.covariates_from_event_matrix()
we can convert a DataFrame like this into one that can be easily added.
from lifelines.utils import covariates_from_event_matrix
cv = covariates_from_event_matrix(event_df, id_col="id")
print(cv)
event id duration E1
0 1 1.0 1
1 3 3.0 1
...
base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")
For an example of pulling datasets like this from a SQLstore, and other helper functions, see Example SQL queries and transformations to get time varying data.
Cumulative sums¶
One additional flag on add_covariate_to_timeline()
that is of interest is the cumulative_sum
flag. By default it is False, but turning it to True will perform a cumulative sum on the covariate before joining. This is useful if the covariates describe an incremental change, instead of a state update. For example, we may have measurements of drugs administered to a patient, and we want the covariate to reflect how much we have administered since the start. Event columns do make sense to cumulative sum as well. In contrast, a covariate to measure the temperature of the patient is a state update, and should not be summed. See Example cumulative sums over timevarying covariates to see an example of this.
Delaying timevarying covariates¶
add_covariate_to_timeline()
also has an option for delaying, or shifting, a covariate so it changes later than originally observed. One may ask, why should one delay a timevarying covariate? Here’s an example. Consider investigating the impact of smoking on mortality and available to us are timevarying observations of how many cigarettes are consumed each month. Unbeknownst to us, when a subject reaches critical illness levels, they are admitted to the hospital and their cigarette consumption drops to zero. Some expire while in hospital. If we used this dataset naively, we would see that not smoking leads to sudden death, and conversely, smoking helps your health! This is a case of reverse causation: the upcoming death event actually influences the covariates.
To handle this, you can delay the observations by time periods:
from lifelines.utils import covariates_from_event_matrix
base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E", delay=14)
Fitting the model¶
Once your dataset is in the correct orientation, we can use CoxTimeVaryingFitter
to fit the model to your data. The method is similar to CoxPHFitter
, expect we need to tell the fit()
about the additional time columns.
Fitting the Cox model to the data involves using gradient descent. lifelines takes extra effort to help with convergence, so please be attentive to any warnings that appear. Fixing any warnings will generally help convergence. For further help, see Problems with convergence in the Cox proportional hazard model.
from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()
ctv.fit(df, id_col="id", event_col="event", start_col="start", stop_col="stop", show_progress=True)
ctv.print_summary()
ctv.plot()
Short note on prediction¶
Unlike the other regression models, prediction in a timevarying setting is not trivial. To predict, we would need to know the covariates values beyond the observed times, but if we knew that, we would also know if the subject was still alive or not! However, it is still possible to compute the hazard values of subjects at known observations, the baseline cumulative hazard rate, and baseline survival function. So while CoxTimeVaryingFitter
exposes prediction methods, there are logical limitations to what these predictions mean.
[1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from matplotlib import pyplot as plt
from lifelines import CoxPHFitter
import numpy as np
import pandas as pd
Testing the proportional hazard assumptions¶
This Jupyter notebook is a small tutorial on how to test and fix proportional hazard problems.
The proportional hazard assumption is that all individuals have the same hazard function, but a unique scaling factor infront. So the shape of the hazard function is the same for all individuals, and only a scalar infront changes.
At the core of the assumption is that \(a_i\) is not time varying, that is, \(a_i(t) = a_i\). Further more, if we take the ratio of this with another subject (called the hazard ratio):
is constant for all \(t\). In this tutorial we will test this nontime varying assumption, and look at ways to handle violations.
[2]:
from lifelines.datasets import load_rossi
rossi = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi, 'week', 'arrest')
[2]:
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
[3]:
cph.print_summary(model="untransformed variables", decimals=3)
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
loglikelihood = 658.748
time fit was run = 20190403 02:39:31 UTC
model = untransformed variables

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.379 0.684 0.191 1.983 0.047 4.398 0.755 0.004
age 0.057 0.944 0.022 2.611 0.009 6.791 0.101 0.014
race 0.314 1.369 0.308 1.019 0.308 1.698 0.290 0.918
wexp 0.150 0.861 0.212 0.706 0.480 1.058 0.566 0.266
mar 0.434 0.648 0.382 1.136 0.256 1.965 1.182 0.315
paro 0.085 0.919 0.196 0.434 0.665 0.589 0.469 0.299
prio 0.091 1.096 0.029 3.194 0.001 9.476 0.035 0.148

Concordance = 0.640
Loglikelihood ratio test = 33.266 on 7 df, log2(p)=15.370
Checking assumptions with check_assumptions
¶
New to lifelines 0.16.0 is the CoxPHFitter.check_assumptions
method. This method will compute statistics that check the proportional hazard assumption, produce plots to check assumptions, and more. Also included is an option to display advice to the console. Here’s a breakdown of each information displayed:
 Presented first are the results of a statistical test to test for any timevarying coefficients. A timevarying coefficient imply a covariate’s influence relative to the baseline changes over time. This implies a violation of the proportional hazard assumption. For each variable, we transform time four times (these are common transformations of time to perform). If lifelines rejects the null (that is, lifelines rejects that the coefficient is not timevarying), we report this to the user.
 Some advice is presented on how to correct the proportional hazard violation based on some summary statistics of the variable.
 As a compliment to the above statistical test, for each variable that violates the PH assumption, visual plots of the the scaled Schoenfeld residuals is presented against the four time transformations. A fitted lowess is also presented, along with 10 bootstrapped lowess lines (as an approximation to the confidence interval of the original lowess line). Ideally, this lowess line is constant (flat). Deviations away from the constant line are violations of the PH assumption.
Why the scaled Schoenfeld residuals?¶
This section can be skipped on first read. Let \(s_{t,j}\) denote the scaled Schoenfeld residuals of variable \(j\) at time \(t\), \(\hat{\beta_j}\) denote the maximumlikelihood estimate of the \(j\)th variable, and \(\beta_j(t)\) a timevarying coefficient in (fictional) alternative model that allows for timevarying coefficients. Therneau and Grambsch showed that.
The proportional hazard assumption implies that \(\hat{\beta_j} = \beta_j(t)\), hence \(E[s_{t,j}] = 0\). This is what the above proportional hazard test is testing. Visually, plotting \(s_{t,j}\) over time (or some transform of time), is a good way to see violations of \(E[s_{t,j}] = 0\), along with the statisical test.
[4]:
cph.check_assumptions(rossi, p_value_threshold=0.05, show_plots=True)
The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for nonconstant lines. See link [A] below for a full example.
<lifelines.StatisticalResult>
test_name = proportional_hazard_test
null_distribution = chi squared
degrees_of_freedom = 1

test_statistic p log2(p)
age km 11.03 <0.005 10.12
rank 11.09 <0.005 10.17
fin km 0.02 0.89 0.17
rank 0.02 0.90 0.16
mar km 0.60 0.44 1.19
rank 0.67 0.41 1.27
paro km 0.12 0.73 0.45
rank 0.14 0.71 0.49
prio km 0.02 0.88 0.18
rank 0.02 0.88 0.18
race km 1.44 0.23 2.12
rank 1.46 0.23 2.14
wexp km 7.48 0.01 7.32
rank 7.18 0.01 7.08
1. Variable 'age' failed the nonproportional test: pvalue is 0.0009.
Advice 1: the functional form of the variable 'age' might be incorrect. That is, there may be
nonlinear terms missing. The proportional hazard test used is very sensitive to incorrect
functional forms. See documentation in link [D] below on how to specify a functional form.
Advice 2: try binning the variable 'age' using pd.cut, and then specify it in `strata=['age',
...]` in the call in `.fit`. See documentation in link [B] below.
Advice 3: try adding an interaction term with your time variable. See documentation in link [C]
below.
2. Variable 'wexp' failed the nonproportional test: pvalue is 0.0063.
Advice: with so few unique values (only 2), you can include `strata=['wexp', ...]` in the call in
`.fit`. See documentation in link [E] below.

[A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Binvariableandstratifyonit
[C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introducetimevaryingcovariates
[D] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modifythefunctionalform
[E] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification
Alternatively, you can use the proportional hazard test outside of check_assumptions
:
[5]:
from lifelines.statistics import proportional_hazard_test
results = proportional_hazard_test(cph, rossi, time_transform='rank')
results.print_summary(decimals=3, model="untransformed variables")
<lifelines.StatisticalResult>
test_name = proportional_hazard_test
time_transform = rank
null_distribution = chi squared
degrees_of_freedom = 1
model = untransformed variables

test_statistic p log2(p)
age 11.094 0.001 10.173
fin 0.017 0.896 0.158
mar 0.666 0.414 1.271
paro 0.138 0.711 0.493
prio 0.023 0.881 0.183
race 1.462 0.227 2.141
wexp 7.180 0.007 7.084
Stratification¶
In the advice above, we can see that wexp
has small cardinality, so we can easily fix that by specifying it in the strata
. What does the strata
do? Let’s go back to the proportional hazard assumption.
In the introduction, we said that the proportional hazard assumption was that
In a simple case, it may be that there are two subgroups that have very different baseline hazards. That is, we can split the dataset into subsamples based on some variable (we call this the stratifying variable), run the Cox model on all subsamples, and compare their baseline hazards. If these baseline hazards are very different, then clearly the formula above is wrong  the \(h(t)\) is some weighted average of the subgroups’ baseline hazards. This ill fitting average baseline can cause \(a_i\) to have timedependent influence. A better model might be:
where now we have a unique baseline hazard per subgroup \(G\). Because of the way the Cox model is designed, inference of the coefficients is identical (expect now there are more baseline hazards, and no variation of the stratifying variable within a subgroup \(G\)).
[6]:
cph.fit(rossi, 'week', 'arrest', strata=['wexp'])
cph.print_summary(model="wexp in strata")
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['wexp']
number of subjects = 432
number of events = 114
loglikelihood = 580.89
time fit was run = 20190403 02:39:34 UTC
model = wexp in strata

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.38 0.68 0.19 1.99 0.05 4.42 0.76 0.01
age 0.06 0.94 0.02 2.64 0.01 6.91 0.10 0.01
race 0.31 1.36 0.31 1.00 0.32 1.65 0.30 0.91
mar 0.45 0.64 0.38 1.19 0.23 2.09 1.20 0.29
paro 0.08 0.92 0.20 0.42 0.67 0.57 0.47 0.30
prio 0.09 1.09 0.03 3.16 <0.005 9.33 0.03 0.15

Concordance = 0.61
Loglikelihood ratio test = 172.71 on 6 df, log2(p)=112.69
[7]:
cph.check_assumptions(rossi, show_plots=True)
The ``p_value_threshold`` is set at 0.01. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for nonconstant lines. See link [A] below for a full example.
<lifelines.StatisticalResult>
test_name = proportional_hazard_test
null_distribution = chi squared
degrees_of_freedom = 1

test_statistic p log2(p)
age km 11.29 <0.005 10.32
rank 4.62 0.03 4.99
fin km 0.02 0.90 0.16
rank 0.05 0.83 0.28
mar km 0.53 0.47 1.10
rank 1.31 0.25 1.99
paro km 0.09 0.76 0.40
rank 0.00 0.97 0.05
prio km 0.02 0.89 0.16
rank 0.02 0.90 0.16
race km 1.47 0.23 2.15
rank 0.64 0.42 1.23
1. Variable 'age' failed the nonproportional test: pvalue is 0.0008.
Advice 1: the functional form of the variable 'age' might be incorrect. That is, there may be
nonlinear terms missing. The proportional hazard test used is very sensitive to incorrect
functional forms. See documentation in link [D] below on how to specify a functional form.
Advice 2: try binning the variable 'age' using pd.cut, and then specify it in `strata=['age',
...]` in the call in `.fit`. See documentation in link [B] below.
Advice 3: try adding an interaction term with your time variable. See documentation in link [C]
below.

[A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Binvariableandstratifyonit
[C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introducetimevaryingcovariates
[D] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modifythefunctionalform
[E] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification
Since age
is still violating the proportional hazard assumption, we need to model it better. From the residual plots above, we can see a the effect of age start to become negative over time. This will be relevant later. Below, we present three options to handle age
.
Modify the functional form¶
The proportional hazard test is very sensitive (i.e. lots of false positives) when the functional form of a variable is incorrect. For example, if the association between a covariate and the loghazard is nonlinear, but the model has only a linear term included, then the proportional hazard test can raise a false positive.
The modeller can choose to add quadratic or cubic terms, i.e:
rossi['age**2'] = (rossi['age']  rossi['age'].mean())**2
rossi['age**3'] = (rossi['age']  rossi['age'].mean())**3
but I think a more correct way to include nonlinear terms is to use splines. Both Patsy and zEpid provide functionality for splines (tutorial incoming), but let’s stick with the form above.
[8]:
rossi_higher_order_age = rossi.copy()
rossi_higher_order_age['age'] = rossi_higher_order_age['age']  rossi_higher_order_age['age'].mean()
rossi_higher_order_age['age**2'] = (rossi_higher_order_age['age']  rossi_higher_order_age['age'].mean())**2
rossi_higher_order_age['age**3'] = (rossi_higher_order_age['age']  rossi_higher_order_age['age'].mean())**3
cph.fit(rossi_higher_order_age, 'week', 'arrest', strata=['wexp'])
cph.print_summary(model="quad and cubic age terms"); print()
cph.check_assumptions(rossi_higher_order_age, show_plots=True, p_value_threshold=0.05)
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['wexp']
number of subjects = 432
number of events = 114
loglikelihood = 579.37
time fit was run = 20190403 02:39:36 UTC
model = quad and cubic age terms

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.37 0.69 0.19 1.93 0.05 4.24 0.75 0.00
age 0.06 0.94 0.03 1.85 0.06 3.95 0.13 0.00
race 0.35 1.42 0.31 1.13 0.26 1.95 0.26 0.95
mar 0.39 0.68 0.38 1.02 0.31 1.70 1.15 0.36
paro 0.10 0.90 0.20 0.52 0.60 0.74 0.49 0.28
prio 0.09 1.10 0.03 3.22 <0.005 9.59 0.04 0.15
age**2 0.01 1.01 0.00 1.57 0.12 3.09 0.00 0.02
age**3 0.00 1.00 0.00 0.89 0.37 1.42 0.00 0.00

Concordance = 0.62
Loglikelihood ratio test = 175.75 on 8 df, log2(p)=109.94
The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for nonconstant lines. See link [A] below for a full example.
<lifelines.StatisticalResult>
test_name = proportional_hazard_test
null_distribution = chi squared
degrees_of_freedom = 1

test_statistic p log2(p)
age km 0.96 0.33 1.62
rank 4.09 0.04 4.54
age**2 km 1.81 0.18 2.48
rank 0.79 0.37 1.42
age**3 km 2.33 0.13 2.98
rank 0.03 0.87 0.19
fin km 0.03 0.87 0.20
rank 0.02 0.90 0.15
mar km 0.53 0.47 1.10
rank 0.94 0.33 1.59
paro km 0.20 0.66 0.60
rank 0.01 0.93 0.10
prio km 0.02 0.88 0.19
rank 0.01 0.90 0.15
race km 1.28 0.26 1.96
rank 0.47 0.49 1.02
1. Variable 'age' failed the nonproportional test: pvalue is 0.0431.
Advice 1: the functional form of the variable 'age' might be incorrect. That is, there may be
nonlinear terms missing. The proportional hazard test used is very sensitive to incorrect
functional forms. See documentation in link [D] below on how to specify a functional form.
Advice 2: try binning the variable 'age' using pd.cut, and then specify it in `strata=['age',
...]` in the call in `.fit`. See documentation in link [B] below.
Advice 3: try adding an interaction term with your time variable. See documentation in link [C]
below.

[A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Binvariableandstratifyonit
[C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introducetimevaryingcovariates
[D] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modifythefunctionalform
[E] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification
We see we still have potentially some violation, but it’s a heck of a lot less. Also, interestingly, when we include these nonlinear terms for age
, the wexp
proportionality violation disappears. It is not uncommon to see changing the functional form of one variable effects other’s proportional tests, usually positively. So, we could remove the strata=['wexp']
if we wished.
Bin variable and stratify on it¶
The second option proposed is to bin the variable into equalsized bins, and stratify like we did with wexp
. There is a trade off here between estimation and informationloss. If we have large bins, we will lose information (since different values are now binned together), but we need to estimate less new baseline hazards. On the other hand, with tiny bins, we allow the age
data to have the most “wiggle room”, but must compute many baseline hazards each of which has a smaller sample
size. Like most things, the optimial value is somewhere inbetween.
[9]:
rossi_strata_age = rossi.copy()
rossi_strata_age['age_strata'] = pd.cut(rossi_strata_age['age'], np.arange(0, 80, 3))
rossi_strata_age[['age', 'age_strata']].head()
[9]:
age  age_strata  

0  27  (24, 27] 
1  18  (15, 18] 
2  19  (18, 21] 
3  23  (21, 24] 
4  19  (18, 21] 
[10]:
# drop the orignal, redundant, age column
rossi_strata_age = rossi_strata_age.drop('age', axis=1)
cph.fit(rossi_strata_age, 'week', 'arrest', strata=['age_strata', 'wexp'])
[10]:
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
[11]:
cph.print_summary(3, model="stratified age and wexp")
cph.plot()
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['age_strata', 'wexp']
number of subjects = 432
number of events = 114
loglikelihood = 392.443
time fit was run = 20190403 02:39:37 UTC
model = stratified age and wexp

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
fin 0.395 0.674 0.197 2.004 0.045 4.472 0.781 0.009
race 0.280 1.324 0.313 0.895 0.371 1.431 0.334 0.895
mar 0.194 0.824 0.392 0.494 0.621 0.687 0.961 0.574
paro 0.163 0.849 0.200 0.818 0.413 1.275 0.555 0.228
prio 0.080 1.084 0.028 2.854 0.004 7.857 0.025 0.135

Concordance = 0.582
Loglikelihood ratio test = 532.244 on 5 df, log2(p)=372.252
[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x120e05828>
[12]:
cph.check_assumptions(rossi_strata_age)
Proportional hazard assumption looks okay.
Introduce timevarying covariates¶
Our second option to correct variables that violate the proportional hazard assumption is to model the timevarying component directly. This is done in two steps. The first is to transform your dataset into episodic format. This means that we split a subject from a single row into \(n\) new rows, and each new row represents some time period for the subject. It’s okay that the variables are static over this new time periods  we’ll introduce some timevarying covariates later.
See below for how to do this in lifelines:
[13]:
from lifelines.utils import to_episodic_format
# the time_gaps parameter specifies how large or small you want the periods to be.
rossi_long = to_episodic_format(rossi, duration_col='week', event_col='arrest', time_gaps=1.)
rossi_long.head(25)
[13]:
stop  start  arrest  age  fin  id  mar  paro  prio  race  wexp  

0  1.0  0.0  0  27  0  0  0  1  3  1  0 
1  2.0  1.0  0  27  0  0  0  1  3  1  0 
2  3.0  2.0  0  27  0  0  0  1  3  1  0 
3  4.0  3.0  0  27  0  0  0  1  3  1  0 
4  5.0  4.0  0  27  0  0  0  1  3  1  0 
5  6.0  5.0  0  27  0  0  0  1  3  1  0 
6  7.0  6.0  0  27  0  0  0  1  3  1  0 
7  8.0  7.0  0  27  0  0  0  1  3  1  0 
8  9.0  8.0  0  27  0  0  0  1  3  1  0 
9  10.0  9.0  0  27  0  0  0  1  3  1  0 
10  11.0  10.0  0  27  0  0  0  1  3  1  0 
11  12.0  11.0  0  27  0  0  0  1  3  1  0 
12  13.0  12.0  0  27  0  0  0  1  3  1  0 
13  14.0  13.0  0  27  0  0  0  1  3  1  0 
14  15.0  14.0  0  27  0  0  0  1  3  1  0 
15  16.0  15.0  0  27  0  0  0  1  3  1  0 
16  17.0  16.0  0  27  0  0  0  1  3  1  0 
17  18.0  17.0  0  27  0  0  0  1  3  1  0 
18  19.0  18.0  0  27  0  0  0  1  3  1  0 
19  20.0  19.0  1  27  0  0  0  1  3  1  0 
20  1.0  0.0  0  18  0  1  0  1  8  1  0 
21  2.0  1.0  0  18  0  1  0  1  8  1  0 
22  3.0  2.0  0  18  0  1  0  1  8  1  0 
23  4.0  3.0  0  18  0  1  0  1  8  1  0 
24  5.0  4.0  0  18  0  1  0  1  8  1  0 
Each subject is given a new id (but can be specified as well if already provided in the dataframe). This id is used to track subjects over time. Notice the arrest
col is 0 for all periods prior to their (possible) event as well.
Above I mentioned there were two steps to correct age
. The first was to convert to a episodic format. The second is to create an interaction term between age
and stop
. This is a timevarying variable.
Instead of CoxPHFitter
, we must use CoxTimeVaryingFitter
instead since we are working with a episodic dataset.
[14]:
rossi_long['time*age'] = rossi_long['age'] * rossi_long['stop']
[15]:
from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()
ctv.fit(rossi_long,
id_col='id',
event_col='arrest',
start_col='start',
stop_col='stop',
strata=['wexp'])
[15]:
<lifelines.CoxTimeVaryingFitter: fitted with 19809 periods, 432 subjects, 114 events>
[16]:
ctv.print_summary(3, model="age * time interaction")
<lifelines.CoxTimeVaryingFitter: fitted with 19809 periods, 432 subjects, 114 events>
event col = 'arrest'
strata = ['wexp']
number of subjects = 432
number of periods = 19809
number of events = 114
loglikelihood = 575.080
time fit was run = 20190403 02:39:41 UTC
model = age * time interaction

coef exp(coef) se(coef) z p log2(p) lower 0.95 upper 0.95
age 0.073 1.075 0.040 1.830 0.067 3.893 0.005 0.151
fin 0.386 0.680 0.191 2.018 0.044 4.520 0.760 0.011
mar 0.397 0.672 0.382 1.039 0.299 1.743 1.147 0.352
paro 0.098 0.907 0.196 0.501 0.616 0.698 0.481 0.285
prio 0.090 1.094 0.029 3.152 0.002 9.267 0.034 0.146
race 0.295 1.343 0.308 0.955 0.340 1.558 0.310 0.899
time*age 0.005 0.995 0.002 3.337 0.001 10.203 0.008 0.002

Loglikelihood ratio test = 1150.160 on 7 df, log2(p)=0.000
[17]:
ctv.plot()
[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x120d96c88>
In the above scaled Schoenfeld residual plots for age
, we can see there is a slight negative effect for higher time values. This is confirmed in the output of the CoxTimeVaryingFitter
: we see that the coefficient for time*age
is 0.005.
Conclusion¶
The point estimates and the standard errors are very close to each other using either option, we can feel confident that either approach is okay to proceed.
More examples and recipes¶
This section goes through some examples and recipes to help you use lifelines.
Worked Examples¶
If you are looking for some full examples of lifelines, there are full Jupyter notebooks and scripts here and examples and ideas on the development blog.
Statistically compare two populations¶
Often researchers want to compare survivalness between different populations. Here are some techniques to do that:
Logrank test¶
Note
The logrank test has maximum power when the assumption of proportional hazards is true. As a consequence, if the survival functions cross, the logrank test will give an inaccurate assessment of differences.
The lifelines.statistics.logrank_test()
function compares whether the “death” generation process of the two populations are equal:
from lifelines.statistics import logrank_test
results = logrank_test(T1, T2, event_observed_A=E1, event_observed_B=E2)
results.print_summary()
"""
t_0 = 1
alpha = 0.95
null_distribution = chi squared
df = 1
use_bonferroni = True

test_statistic p
3.528 0.00034 **
"""
print(results.p_value) # 0.46759
print(results.test_statistic) # 0.528
If you have more than two populations, you can use pairwise_logrank_test()
(which compares
each pair in the same manner as above), or multivariate_logrank_test()
(which tests the
hypothesis that all the populations have the same “death” generation process).
from lifelines.statistics import multivariate_logrank_test
df = pd.DataFrame({
'durations': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'groups': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], # could be strings too
'events': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
})
results = multivariate_logrank_test(df['durations'], df['groups'], df['events'])
results.print_summary()
"""
t_0 = 1
alpha = 0.95
null_distribution = chi squared
df = 2

test_statistic p
1.0800 0.5827

"""
Survival differences at a point in time¶
Often analysts want to compare the survivalness of groups at specific times, rather than comparing the entire survival curves against each other. For example, analysts may be interested in 5year survival. Statistically comparing the naive KaplanMeier points at a specific time
actually has reduced power. By transforming the KaplanMeier curve, we can recover more power. The function lifelines.statistics.survival_difference_at_fixed_point_in_time_test()
uses
the log(log) transformation implicitly and compares the survivalness of populations at a specific point in time.
from lifelines.statistics import survival_difference_at_fixed_point_in_time_test
results = survival_difference_at_fixed_point_in_time_test(point_in_time, T1, T2, event_observed_A=E1, event_observed_B=E2)
results.print_summary()
Subtraction and division between survival functions¶
If you are interested in taking the difference between two survival functions, simply trying to
subtract the survival_function_
will likely fail if the DataFrame’s indexes are not equal. Fortunately,
the lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter
and lifelines.fitters.nelson_aalen_fitter.NelsonAalenFitter
have a builtin subtract
method:
kmf1.subtract(kmf2)
will produce the difference at every relevant time point. A similar function exists for division: divide
. However, for rigorous testing of differences, lifelines comes with a statistics library. See below.
Restricted mean survival times (RMST)¶
lifelines has a function to accurately compute the restricted mean survival time, defined as
\[\text{RMST}(t) = \int_0^t S(\tau) d\tau\]
This is a good metric for comparing two survival curves, as their difference represents the area between the curves (see figure below). The upper limit is often finite because the tail of the estimated survival curve has high variance and can strongly influence the integral.
from lifelines.utils import restricted_mean_survival_time
from lifelines.datasets import load_waltons
df = load_waltons()
ix = df['group'] == 'miR137'
T, E = df['T'], df['E']
kmf_exp = KaplanMeierFitter().fit(T[ix], E[ix], label='exp')
kmf_con = KaplanMeierFitter().fit(T[~ix], E[~ix], label='control')
limit = 50
rmst_exp = restricted_mean_survival_time(kmf_exp, t=limit)
rmst_con = restricted_mean_survival_time(kmf_con, t=limit)
ax = plt.subplot(311)
kmf_exp.plot(ax=ax, c="#0C7BDC", ci_show=False)
sf_exp_at_limit = kmf_exp.predict(np.append(kmf_exp.timeline, limit)).sort_index().loc[:limit]
ax.fill_between(sf_exp_at_limit.index, sf_exp_at_limit.values, step='post', color="#0C7BDC", alpha=0.20)
ax.axvline(limit, ls='', c='k')
ax.text(10, 0.3, "%.3f" % rmst_exp)
ax.set_xlim(0, 65)
ax = plt.subplot(312)
kmf_con.plot(ax=ax, c="#FFC20A", ci_show=False)
sf_con_at_limit = kmf_con.predict(np.append(kmf_con.timeline, limit)).sort_index().loc[:limit]
ax.fill_between(sf_con_at_limit.index, sf_con_at_limit.values, step='post', color="#FFC20A", alpha=0.20)
ax.axvline(limit, ls='', c='k')
ax.text(10, 0.4, "%.3f" % rmst_con)
ax.set_xlim(0, 65)
ax = plt.subplot(313)
kmf_con.plot(ax=ax, c="#FFC20A", ci_show=False)
kmf_exp.plot(ax=ax, c="#0C7BDC", ci_show=False)
timeline = np.unique(T.tolist() + [limit])
ax.axvline(limit, ls='', c='k')
ax.fill_between(timeline[timeline<=limit], kmf_con.predict(timeline).loc[:limit], kmf_exp.predict(timeline).loc[:limit], step="post", color='k', alpha=0.10)
ax.text(34, 0.4, "%.3f" % (rmst_con  rmst_exp))
ax.set_xlim(0, 65)
Model selection using lifelines¶
If using lifelines for prediction work, it’s ideal that you perform some type of crossvalidation scheme. This crossvalidation allows you to be confident that your outofsample predictions will work well in practice. It also allows you to choose between multiple models.
lifelines has a builtin kfold crossvalidation function. For example, consider the following example:
from lifelines import AalenAdditiveFitter, CoxPHFitter
from lifelines.datasets import load_regression_dataset
from lifelines.utils import k_fold_cross_validation
df = load_regression_dataset()
#create the three models we'd like to compare.
aaf_1 = AalenAdditiveFitter(coef_penalizer=0.5)
aaf_2 = AalenAdditiveFitter(coef_penalizer=10)
cph = CoxPHFitter()
print(np.mean(k_fold_cross_validation(cph, df, duration_col='T', event_col='E')))
print(np.mean(k_fold_cross_validation(aaf_1, df, duration_col='T', event_col='E')))
print(np.mean(k_fold_cross_validation(aaf_2, df, duration_col='T', event_col='E')))
From these results, Aalen’s Additive model with a penalizer of 10 is best model of predicting future survival times.
lifelines also has wrappers to use scikitlearn’s cross validation and grid search tools. See how to use lifelines with scikit learn.
Selecting a parametric model using QQ plots¶
QQ plots normally are constructed by sorting the values. However, this isn’t appropriate when there is censored data. In lifelines, there are routines to still create QQ plots with censored data. These are available under lifelines.plotting.qq_plots()
, and accepts fitted a parametric lifelines model.
from lifelines import *
from lifelines.plotting import qq_plot
# generate some fake lognormal data
N = 1000
T_actual = np.exp(np.random.randn(N))
C = np.exp(np.random.randn(N))
E = T_actual < C
T = np.minimum(T_actual, C)
fig, axes = plt.subplots(2, 2, figsize=(8, 6))
axes = axes.reshape(4,)
for i, model in enumerate([WeibullFitter(), LogNormalFitter(), LogLogisticFitter(), ExponentialFitter()]):
model.fit(T, E)
qq_plot(model, ax=axes[i])
This graphical test can be used to invalidate models. For example, in the above figure, we can see that only the lognormal parametric model is appropriate (we expect deviance in the tails, but not too much). Another use case is choosing the correct parametric AFT model.
The qq_plots()
also works with left censorship as well.
Plotting multiple figures on a plot¶
When .plot
is called, an axis
object is returned which can be passed into future calls of .plot
:
kmf.fit(data1)
ax = kmf.plot()
kmf.fit(data2)
ax = kmf.plot(ax=ax)
If you have a pandas DataFrame with columns “group”, “T”, and “E”, then something like the following would work:
from lifelines import KaplanMeierFitter
from matplotlib import pyplot as plt
ax = plt.subplot(111)
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('group'):
kmf.fit(grouped_df["T"], grouped_df["E"], label=name)
kmf.plot(ax=ax)
Plotting options and styles¶
Let’s load some data
from lifelines.datasets import load_waltons
waltons = load_waltons()
T = waltons['T']
E = waltons['E']
Show censors and edit markers¶
kmf.fit(T, E, label="kmf.plot(show_censors=True, \ncensor_styles={'ms': 6, 'marker': 's'})")
kmf.plot(show_censors=True, censor_styles={'ms': 6, 'marker': 's'})
Invert axis¶
kmf.fit(T, E, label="kmf.plot(invert_y_axis=True)")
kmf.plot(invert_y_axis=True)
Note
This is deprecated and we suggest to use kmf.plot_cumulative_density() instead.
Displaying atrisk counts below plots¶
kmf.fit(T, E, label="label name")
kmf.plot(at_risk_counts=True)
Displaying multiple atrisk counts below plots¶
The function add_at_risk_counts
in lifelines.plotting
allows you to add AtRisk counts at the bottom of your figures. For example:
from lifelines import KaplanMeierFitter
from lifelines.datasets import load_waltons
waltons = load_waltons()
ix = waltons['group'] == 'control'
ax = plt.subplot(111)
kmf_control = KaplanMeierFitter()
ax = kmf_control.fit(waltons.loc[ix]['T'], waltons.loc[ix]['E'], label='control').plot(ax=ax)
kmf_exp = KaplanMeierFitter()
ax = kmf_exp.fit(waltons.loc[~ix]['T'], waltons.loc[~ix]['E'], label='exp').plot(ax=ax)
from lifelines.plotting import add_at_risk_counts
add_at_risk_counts(kmf_exp, kmf_control, ax=ax)
will display
Transforming survivaltable data into lifelines format¶
Some lifelines classes are designed for lists or arrays that represent one individual per row. If you instead have data in a survival table format, there exists a utility method to get it into lifelines format.
Example: Suppose you have a CSV file with data that looks like this:
time  observed deaths  censored 

0  7  0 
1  1  1 
2  2  0 
3  1  2 
4  5  2 
…  …  … 
import pandas as pd
from lifelines.utils import survival_events_from_table
df = pd.read_csv('file.csv', columns = ['time', observed deaths', 'censored'])
df = df.set_index('time')
T, E, W = survival_events_from_table(df, observed_deaths_col='observed deaths', censored_col='censored')
# weights, W, is the number of occurrences of each observation  helps with data compression.
kmf = KaplanMeierFitter().fit(T, E, weights=W)
Transforming observational data into survivaltable format¶
Perhaps you are interested in viewing the survival table given some durations and censoring vectors.
from lifelines.utils import survival_table_from_events
table = survival_table_from_events(T, E)
print(table.head())
"""
removed observed censored entrance at_risk
event_at
0 0 0 0 60 60
2 2 1 1 0 60
3 3 1 2 0 58
4 5 3 2 0 55
5 12 6 6 0 50
"""
Set the index/timeline of a estimate¶
Suppose your dataset has lifetimes grouped near time 60, thus after fitting
lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter
, you survival function might look something like:
print(kmf.survival_function_)
KMestimate
0 1.00
47 0.99
49 0.97
50 0.96
51 0.95
52 0.91
53 0.86
54 0.84
55 0.79
56 0.74
57 0.71
58 0.67
59 0.58
60 0.49
61 0.41
62 0.31
63 0.24
64 0.19
65 0.14
66 0.10
68 0.07
69 0.04
70 0.02
71 0.01
74 0.00
What you would like is to have a predictable and full index from 40 to 75. (Notice that
in the above index, the last two time points are not adjacent – the cause is observing no lifetimes
existing for times 72 or 73). This is especially useful for comparing multiple survival functions at specific time points. To do this, all fitter methods accept a timeline
argument:
kmf.fit(T, timeline=range(40,75))
print(kmf.survival_function_)
KMestimate
40 1.00
41 1.00
42 1.00
43 1.00
44 1.00
45 1.00
46 1.00
47 0.99
48 0.99
49 0.97
50 0.96
51 0.95
52 0.91
53 0.86
54 0.84
55 0.79
56 0.74
57 0.71
58 0.67
59 0.58
60 0.49
61 0.41
62 0.31
63 0.24
64 0.19
65 0.14
66 0.10
67 0.10
68 0.07
69 0.04
70 0.02
71 0.01
72 0.01
73 0.01
74 0.00
lifelines will intelligently forwardfill the estimates to unseen time points.
Example SQL query to get survival data from a table¶
Below is a way to get an example dataset from a relational database (this may vary depending on your database):
SELECT
id,
DATEDIFF('dd', started_at, COALESCE(ended_at, CURRENT_DATE)) AS "T",
(ended_at IS NOT NULL) AS "E"
FROM table
Explanation¶
Each row is an id
, a duration, and a boolean indicating whether the event occurred or not. Recall that we denote a
“True” if the event did occur, that is, ended_at
is filled in (we observed the ended_at
). Ex:
id  T  E 

10  40  True 
11  42  False 
12  42  False 
13  36  True 
14  33  True 
Example SQL queries and transformations to get time varying data¶
For Cox timevarying models, we discussed what the dataset should look like in Dataset creation for timevarying regression. Typically we have a base dataset, and then we fold in the covariate datasets. Below are some SQL queries and Python transformations from endtoend.
Base dataset: base_df
¶
SELECT
id,
group,
DATEDIFF('dd', dt.started_at, COALESCE(dt.ended_at, CURRENT_DATE)) AS "T",
(ended_at IS NOT NULL) AS "E"
FROM dimension_table dt
Timevarying variables: cv
¶
 this could produce more than 1 row per subject
SELECT
id,
DATEDIFF('dd', dt.started_at, ft.event_at) AS "time",
ft.var1
FROM fact_table ft
JOIN dimension_table dt
USING(id)
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline
base_df = to_long_format(base_df, duration_col="T")
df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")
Event variables: event_df
¶
Another very common operation is to add event data to our timevarying dataset. For example, a dataset/SQL table that contains information about the dates of an event (and NULLS if the event didn’t occur). An example SQL query may look like:
SELECT
id,
DATEDIFF('dd', dt.started_at, ft.event1_at) AS "E1",
DATEDIFF('dd', dt.started_at, ft.event2_at) AS "E2",
DATEDIFF('dd', dt.started_at, ft.event3_at) AS "E3"
...
FROM dimension_table dt
In Pandas, this may look like:
id E1 E2 E3
0 1 1.0 NaN 2.0
1 2 NaN 5.0 NaN
2 3 3.0 5.0 7.0
...
Initially, this can’t be added to our baseline timevarying dataset. Using lifelines.utils.covariates_from_event_matrix()
we can convert a DataFrame like this into one that can be easily added.
from lifelines.utils import covariates_from_event_matrix
cv = covariates_from_event_matrix(event_df, id_col='id')
print(cv)
id duration E1 E2 E3
0 1 1.0 1 0 0
1 1 2.0 0 1 0
2 2 5.0 0 1 0
3 3 3.0 1 0 0
4 3 5.0 0 1 0
5 3 7.0 0 0 1
base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")
Example cumulative sums over timevarying covariates¶
Often we have either transactional covariate datasets or state covariate datasets. In a transactional dataset, it may make sense to sum up the covariates to represent administration of a treatment over time. For example, in the risky world of startups, we may want to sum up the funding amount received at a certain time. We also may be interested in the amount of the last round of funding. Below is an example to do just that:
Suppose we have an initial DataFrame of startups like:
seed_df = pd.DataFrame([
{'id': 'FB', 'E': True, 'T': 12, 'funding': 0},
{'id': 'SU', 'E': True, 'T': 10, 'funding': 0},
])
And a covariate DataFrame representing funding rounds like:
cv = pd.DataFrame([
{'id': 'FB', 'funding': 30, 't': 5},
{'id': 'FB', 'funding': 15, 't': 10},
{'id': 'FB', 'funding': 50, 't': 15},
{'id': 'SU', 'funding': 10, 't': 6},
{'id': 'SU', 'funding': 9, 't': 10},
])
We can do the following to get both the cumulative funding received and the latest round of funding:
from lifelines.utils import to_long_format
from lifelines.utils import add_covariate_to_timeline
df = seed_df.pipe(to_long_format, 'T')\
.pipe(add_covariate_to_timeline, cv, 'id', 't', 'E', cumulative_sum=True)\
.pipe(add_covariate_to_timeline, cv, 'id', 't', 'E', cumulative_sum=False)
"""
start cumsum_funding funding stop id E
0 0 0.0 0.0 5.0 FB False
1 5 30.0 30.0 10.0 FB False
2 10 45.0 15.0 12.0 FB True
3 0 0.0 0.0 6.0 SU False
4 6 10.0 10.0 10.0 SU False
5 10 19.0 9.0 10.0 SU True
"""
Sample size determination under a CoxPH model¶
Suppose you wish to measure the hazard ratio between two populations under the CoxPH model. That is, we want to evaluate the hypothesis
H0: relative hazard ratio = 1 vs H1: relative hazard ratio != 1, where the relative hazard ratio is \(\exp{\left(\beta\right)}\) for the experiment group vs the control group. A priori, we are interested in the sample sizes of the two groups necessary to achieve a certain statistical power. To do this in lifelines, there is the lifelines.statistics.sample_size_necessary_under_cph()
function. For example:
from lifelines.statistics import sample_size_necessary_under_cph
desired_power = 0.8
ratio_of_participants = 1.
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.7
n_exp, n_con = sample_size_necessary_under_cph(desired_power, ratio_of_participants, p_exp, p_con, postulated_hazard_ratio)
# (421, 421)
This assumes you have estimates of the probability of event occurring for both the experiment and control group. This could be determined from previous experiments.
Power determination under a CoxPH model¶
Suppose you wish to measure the hazard ratio between two populations under the CoxPH model. To determine the statistical power of a hazard ratio hypothesis test, under the CoxPH model, we can use lifelines.statistics.power_under_cph()
. That is, suppose we want to know the probability that we reject the null hypothesis that the relative hazard ratio is 1, assuming the relative hazard ratio is truly different from 1. This function will give you that probability.
from lifelines.statistics import power_under_cph
n_exp = 50
n_con = 100
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.5
power = power_under_cph(n_exp, n_con, p_exp, p_con, postulated_hazard_ratio)
# 0.4957
Problems with convergence in the Cox proportional hazard model¶
Since the estimation of the coefficients in the Cox proportional hazard model is done using the NewtonRaphson algorithm, there are sometimes problems with convergence. Here are some common symptoms and resolutions:
 First check: look for
ConvergenceWarning
in the output. Most often problems in convergence are the result of problems in the dataset. lifelines has checks it runs against the dataset before fitting and warnings are outputted to the user. delta contains nan value(s).
: First try addingshow_progress=True
in thefit
function. If the values indelta
grow unbounded, it’s possible thestep_size
is too large. Try setting it to a small value (0.10.5).Convergence halted due to matrix inversion problems
: This means that there is high collinearity in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. A common cause of this error is dummying categorical variables but not dropping a column, or some hierarchical structure in your dataset. Try to find the relationship by: adding a penalizer to the model, ex: CoxPHFitter(penalizer=0.1).fit(…) until the model converges. In the print_summary(), the coefficients that have high collinearity will have large (absolute) magnitude in the coefs column.
 using the variance inflation factor (VIF) to find redundant variables.
 looking at the correlation matrix of your dataset, or
 Some coefficients are many orders of magnitude larger than others, and the standard error of the coefficient is also large or there are
nan
’s in the results. This can be seen using theprint_summary
method on a fittedCoxPHFitter
object. Look for a
ConvergenceWarning
about variances being too small. The dataset may contain a constant column, which provides no information for the regression (Cox model doesn’t have a traditional “intercept” term like other regression models).  The data is completely separable, which means that there exists a covariate the completely determines whether an event occurred or not. For example, for all “death” events in the dataset, there exists a covariate that is constant amongst all of them. Look for a
ConvergenceWarning
after thefit
call. See https://stats.stackexchange.com/questions/11109/howtodealwithperfectseparationinlogisticregression  Related to above, the relationship between a covariate and the duration may be completely determined. For example, if the rank correlation between a covariate and the duration is very close to 1 or 1, then the loglikelihood can be increased arbitrarily using just that covariate. Look for a
ConvergenceWarning
after thefit
call.  Another problem may be a collinear relationship in your dataset. See point 3. above.
 Look for a
 If adding a very small
penalizer
significantly changes the results (CoxPHFitter(penalizer=0.0001)
), then this probably means that the step size in the iterative algorithm is too large. Try decreasing it (.fit(..., step_size=0.50)
or smaller), and returning thepenalizer
term to 0.  If using the
strata
argument, make sure your stratification group sizes are not too small. Trydf.groupby(strata).size()
.
Adding weights to observations in a Cox model¶
There are two common uses for weights in a model. The first is as a data size reduction technique (known as case weights). If the dataset has more than one subjects with identical attributes, including duration and event, then their likelihood contribution is the same as well. Thus, instead of computing the loglikelihood for each individual, we can compute it once and multiple it by the count of users with identical attributes. In practice, this involves first grouping subjects by covariates and counting. For example, using the Rossi dataset, we will use Pandas to group by the attributes (but other data processing tools, like Spark, could do this as well):
from lifelines.datasets import load_rossi
rossi = load_rossi()
rossi_weights = rossi.copy()
rossi_weights['weights'] = 1.
rossi_weights = rossi_weights.groupby(rossi.columns.tolist())['weights'].sum()\
.reset_index()
The original dataset has 432 rows, while the grouped dataset has 387 rows plus an additional weights
column. CoxPHFitter
has an additional parameter to specify which column is the weight column.
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(rossi_weights, 'week', 'arrest', weights_col='weights')
The fitting should be faster, and the results identical to the unweighted dataset. This option is also available in the CoxTimeVaryingFitter
.
The second use of weights is sampling weights. These are typically positive, noninteger weights that represent some artificial under/over sampling of observations (ex: inverse probability of treatment weights). It is recommended to set robust=True
in the call to the fit
as the usual standard error is incorrect for sampling weights. The robust
flag will use the sandwich estimator for the standard error.
Warning
The implementation of the sandwich estimator does not handle ties correctly (under the Efron handling of ties), and will give slightly or significantly different results from other software depending on the frequency of ties.
Correlations between subjects in a Cox model¶
There are cases when your dataset contains correlated subjects, which breaks the independentandidenticallydistributed assumption. What are some cases when this may happen?
 If a subject appears more than once in the dataset (common when subjects can have the event more than once)
 If using a matching technique, like propensityscore matching, there is a correlation between pairs.
In both cases, the reported standard errors from a unadjusted Cox model will be wrong. In order to adjust for these correlations, there is a cluster_col
keyword in fit()
that allows you to specify the column in the DataFrame that contains designations for correlated subjects. For example, if subjects in rows 1 & 2 are correlated, but no other subjects are correlated, then cluster_col
column should have the same value for rows 1 & 2, and all others unique. Another example: for matched pairs, each subject in the pair should have the same value.
from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter
rossi = load_rossi()
# this may come from a database, or other libraries that specialize in matching
mathed_pairs = [
(156, 230),
(275, 228),
(61, 252),
(364, 201),
(54, 340),
(130, 33),
(183, 145),
(268, 140),
(332, 259),
(314, 413),
(330, 211),
(372, 255),
# ...
]
rossi['id'] = None # we will populate this column
for i, pair in enumerate(matched_pairs):
subjectA, subjectB = pair
rossi.loc[subjectA, 'id'] = i
rossi.loc[subjectB, 'id'] = i
rossi = rossi.dropna(subset=['id'])
cph = CoxPHFitter()
cph.fit(rossi, 'week', 'arrest', cluster_col='id')
Specifying cluster_col
will handle correlations, and invoke the robust sandwich estimator for standard errors (the same as setting robust=True
).
Serialize a lifelines model to disk¶
When you want to save (and later load) a lifelines model to disk, the suggested tool is dill
(available by pip install dill
). dill
works a lot like pickle
and joblib
:
from dill import loads, dumps
s_cph = dumps(cph)
cph_new = loads(s_cph)
cph.summary
s_kmf = dumps(kmf)
kmf_new = loads(s_kmf)
kmf.summary
API for lifelines¶
Univariate Models¶
AalenJohansenFitter¶

class
lifelines.fitters.aalen_johansen_fitter.
AalenJohansenFitter
(jitter_level=0.0001, seed=None, alpha=0.05, calculate_variance=True)¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the AalenJohansen estimate for the cumulative incidence function in a competing risks framework. Treating competing risks as censoring can result in overestimated cumulative density functions. Using the Kaplan Meier estimator with competing risks as censored is akin to estimating the cumulative density if all competing risks had been prevented.
AalenJohansen cannot deal with tied times. We can get around this by randomly jittering the event times slightly. This will be done automatically and generates a warning.
Parameters:  alpha (float, option (default=0.05)) – The alpha value associated with the confidence intervals.
 jitter_level (float, option (default=0.00001)) – If tied event times are detected, event times are randomly changed by this factor.
 seed (int, option (default=None)) – To produce replicate results with tied event times, the numpy.random.seed can be specified in the function.
 calculate_variance (bool, option (default=True)) – By default, AalenJohansenFitter calculates the variance and corresponding confidence intervals. Due to how the variance is calculated, the variance must be calculated for each event time individually. This is computationally intensive. For some procedures, like bootstrapping, the variance is not necessary. To reduce computation time during these procedures, calculate_variance can be set to False to skip the variance calculation.
Example
>>> from lifelines import AalenJohansenFitter >>> from lifelines.datasets import load_waltons >>> T, E = load_waltons()['T'], load_waltons()['E'] >>> ajf = AalenJohansenFitter(calculate_variance=True) >>> ajf.fit(T, E, event_of_interest=1) >>> ajf.cumulative_density_ >>> ajf.plot()
References
If you are interested in learning more, we recommend the following openaccess paper; Edwards JK, Hester LL, Gokhale M, Lesko CR. Methodologic Issues When Estimating Risks in Pharmacoepidemiology. Curr Epidemiol Rep. 2016;3(4):285296.

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

cumulative_density_at_times
(times, label=None)¶

cumulative_hazard_at_times
(times, label=None)¶

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

fit
(durations, event_observed, event_of_interest, timeline=None, entry=None, label='AJ_estimate', alpha=None, ci_labels=None, weights=None)¶ Parameters:  durations (an array or pd.Series of length n – duration of subject was observed for)
 event_observed (an array, or pd.Series, of length n. Integer indicator of distinct events. Must be) – only positive integers, where 0 indicates censoring.
 event_of_interest (integer – indicator for event of interest. All other integers are considered competing events) – Ex) event_observed contains 0, 1, 2 where 0:censored, 1:lung cancer, and 2:death. If event_of_interest=1, then death (2) is considered a competing event. The returned cumulative incidence function corresponds to risk of lung cancer
 timeline (return the best estimate at the values in timelines (positively increasing))
 entry (an array, or pd.Series, of length n – relative time when a subject entered the study. This is) – useful for lefttruncated (not leftcensored) observations. If None, all members of the population were born at time 0.
 label (a string to name the column of the estimate.)
 alpha (the alpha value in the confidence intervals. Overrides the initializing) – alpha for this call to fit only.
 ci_labels (add custom column names to the generated confidence intervals) – as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<1alpha/2>
 weights (n array, or pd.Series, of length n, if providing a weighted dataset. For example, instead) – of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: self – self, with new properties like
cumulative_incidence_
.Return type:

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶

median_
¶ Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Plots a pretty figure of the model
Matplotlib plot arguments can be passed in inside the kwargs, plus
Parameters: show_censors (bool) – place markers at censorship events. Default: False
censor_styles (dict) – If show_censors, this dictionary will be passed into the plot call.
ci_alpha (float) – the transparency level of the confidence interval. Default: 0.3
ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False
ci_show (bool) – show confidence intervals. Default: True
ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False
at_risk_counts (bool) – show group sizes at time points. See function
add_at_risk_counts
for details. Default: Falseloc (slice) – specify a timebased subsection of the curves to plot, ex:
>>> model.plot(loc=slice(0.,10.))
will plot the time values between t=0. and t=10.
iloc (slice) – specify a locationbased subsection of the curves to plot, ex:
>>> model.plot(iloc=slice(0,10))
will plot the first 10 time points.
invert_y_axis (bool) – boolean to invert the yaxis, useful to show cumulative graphs instead of survival graphs. (Deprecated, use
plot_cumulative_density()
)
Returns: a pyplot axis object
Return type: ax

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

survival_function_at_times
(times, label=None)¶
BreslowFlemingHarringtonFitter¶

class
lifelines.fitters.breslow_fleming_harrington_fitter.
BreslowFlemingHarringtonFitter
(alpha=0.05)¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the BreslowFlemingHarrington estimate for the survival function. This estimator is a biased estimator of the survival function but is more stable when the population is small and there are too few early truncation times, it may happen that is the number of patients at risk and the number of deaths is the same.
Mathematically, the NAF estimator is the negative logarithm of the BFH estimator.
BreslowFlemingHarringtonFitter(alpha=0.05)
Parameters: alpha (float, optional (default=0.05)) – The alpha value associated with the confidence intervals. 
conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

cumulative_density_at_times
(times, label=None)¶

cumulative_hazard_at_times
(times, label=None)¶

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

fit
(durations, event_observed=None, timeline=None, entry=None, label='BFH_estimate', alpha=None, ci_labels=None, weights=None)¶ Parameters:  durations (an array, or pd.Series, of length n) – duration subject was observed for
 timeline – return the best estimate at the values in timelines (positively increasing)
 event_observed (an array, or pd.Series, of length n) – True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated observations, i.e the birth event was not observed. If None, defaults to all 0 (all birth events observed.)
 label (string) – a string to name the column of the estimate.
 alpha (float, optional (default=0.05)) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (iterable) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
Returns: Return type: self, with new properties like
survival_function_
.

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶

median_
¶ Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Plots a pretty figure of the model
Matplotlib plot arguments can be passed in inside the kwargs, plus
Parameters: show_censors (bool) – place markers at censorship events. Default: False
censor_styles (dict) – If show_censors, this dictionary will be passed into the plot call.
ci_alpha (float) – the transparency level of the confidence interval. Default: 0.3
ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False
ci_show (bool) – show confidence intervals. Default: True
ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False
at_risk_counts (bool) – show group sizes at time points. See function
add_at_risk_counts
for details. Default: Falseloc (slice) – specify a timebased subsection of the curves to plot, ex:
>>> model.plot(loc=slice(0.,10.))
will plot the time values between t=0. and t=10.
iloc (slice) – specify a locationbased subsection of the curves to plot, ex:
>>> model.plot(iloc=slice(0,10))
will plot the first 10 time points.
invert_y_axis (bool) – boolean to invert the yaxis, useful to show cumulative graphs instead of survival graphs. (Deprecated, use
plot_cumulative_density()
)
Returns: a pyplot axis object
Return type: ax

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

survival_function_at_times
(times, label=None)¶ Return a Pandas series of the predicted survival value at specific times
Parameters: times (iterable or float) Returns: Return type: pd.Series

ExponentialFitter¶

class
lifelines.fitters.exponential_fitter.
ExponentialFitter
(*args, **kwargs)¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements an Exponential model for univariate data. The model has parameterized form:
\[S(t) = \exp\left(\frac{t}{\lambda}\right), \lambda >0\]which implies the cumulative hazard rate is
\[H(t) = \frac{t}{\lambda}\]and the hazard rate is:
\[h(t) = \frac{1}{\lambda}\]After calling the .fit method, you have access to properties like:
survival_function_
,lambda_
,cumulative_hazard_
A summary of the fit is available with the methodprint_summary()
Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals. Important
The parameterization of this model changed in lifelines 0.19.0. Previously, the cumulative hazard looked like \(\lambda t\). The parameterization is now the reciprocal of \(\lambda\).

cumulative_hazard_
¶ The estimated cumulative hazard (with custom timeline if provided)
Type: DataFrame

confidence_interval_cumulative_hazard_
¶ The lower and upper confidence intervals for the cumulative hazard
Type: DataFrame

hazard_
¶ The estimated hazard (with custom timeline if provided)
Type: DataFrame

confidence_interval_hazard_
¶ The lower and upper confidence intervals for the hazard
Type: DataFrame

survival_function_
¶ The estimated survival function (with custom timeline if provided)
Type: DataFrame

confidence_interval_survival_function_
¶ The lower and upper confidence intervals for the survival function
Type: DataFrame

variance_matrix_
¶ The variance matrix of the coefficients
Type: numpy array

median_
¶ The median time to event
Type: float

lambda_
¶ The fitted parameter in the model
Type: float

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

timeline
¶ The time line to use for plotting and indexing
Type: array

entry
¶ The entry array provided, or None
Type: array or None

cumumlative_density_
¶ The estimated cumulative density function (with custom timeline if provided)
Type: DataFrame

confidence_interval_cumulative_density_
¶ The lower and upper confidence intervals for the cumulative density
Type: DataFrame

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

confidence_interval_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_density_
The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_
The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
The confidence interval of the hazard.

confidence_interval_survival_function_
The lower and upper confidence intervals for the survival function

cumulative_density_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative density function (1survival function) at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

cumulative_hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) – values to return the cumulative hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

event_table
¶

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, left_censorship=False, initial_point=None)¶ Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_interval_censoring
(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to an interval censored dataset.
Parameters:  lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in.
 upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored).
 event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from
lower_bound
andupper_cound
(if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored)  timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_left_censoring
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to a leftcensored dataset
Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) – values to return the hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Produce a prettyplot of the estimate.

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) – specify the number of decimal places to show
 kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

summary
¶ Summary statistics describing the fit.
Returns: df – Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

KaplanMeierFitter¶

class
lifelines.fitters.kaplan_meier_fitter.
KaplanMeierFitter
(alpha=0.05)¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the KaplanMeier estimate for the survival function.
Parameters: alpha (float, option (default=0.05)) – The alpha value associated with the confidence intervals. Examples
>>> from lifelines import KaplanMeierFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> kmf = KaplanMeierFitter() >>> kmf.fit(waltons['T'], waltons['E']) >>> kmf.plot()

survival_function_
¶ The estimated survival function (with custom timeline if provided)
Type: DataFrame

median_
¶ The estimated median time to event. np.inf if doesn’t exist.
Type: float

confidence_interval_
¶ The lower and upper confidence intervals for the survival function. An alias of
confidence_interval_survival_function_
Type: DataFrame

confidence_interval_survival_function_
¶ The lower and upper confidence intervals for the survival function. An alias of
confidence_interval_
Type: DataFrame

cumumlative_density_
¶ The estimated cumulative density function (with custom timeline if provided)
Type: DataFrame

confidence_interval_cumulative_density_
¶ The lower and upper confidence intervals for the cumulative density
Type: DataFrame

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

timeline
¶ The time line to use for plotting and indexing
Type: array

entry
¶ The entry array provided, or None
Type: array or None

event_table
¶ A summary of the life table
Type: DataFrame

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

cumulative_density_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative density at specific times
Parameters: times (iterable or float) Returns: Return type: pd.Series

cumulative_hazard_at_times
(times, label=None)¶

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

fit
(durations, event_observed=None, timeline=None, entry=None, label='KM_estimate', left_censorship=False, alpha=None, ci_labels=None, weights=None)¶ Fit the model to a rightcensored dataset
Parameters:  durations (an array, list, pd.DataFrame or pd.Series) – length n – duration subject was observed for
 event_observed (an array, list, pd.DataFrame, or pd.Series, optional) – True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (an array, list, pd.DataFrame, or pd.Series, optional) – return the best estimate at the values in timelines (postively increasing)
 entry (an array, list, pd.DataFrame, or pd.Series, optional) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”.
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 left_censorship (bool, optional (default=False)) – Deprecated, use
fit_left_censoring
 ci_labels (tuple, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<1alpha/2>
 weights (an array, list, pd.DataFrame, or pd.Series, optional) – if providing a weighted dataset. For example, instead of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: self – self with new properties like
survival_function_
,plot()
,median
Return type:

fit_left_censoring
(durations, event_observed=None, timeline=None, entry=None, label='KM_estimate', alpha=None, ci_labels=None, weights=None)¶ Fit the model to a leftcensored dataset
Parameters:  durations (an array, list, pd.DataFrame or pd.Series) – length n – duration subject was observed for
 event_observed (an array, list, pd.DataFrame, or pd.Series, optional) – True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (an array, list, pd.DataFrame, or pd.Series, optional) – return the best estimate at the values in timelines (postively increasing)
 entry (an array, list, pd.DataFrame, or pd.Series, optional) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”.
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 left_censorship (bool, optional (default=False)) – Deprecated, use
fit_left_censoring
 ci_labels (tuple, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<1alpha/2>
 weights (an array, list, pd.DataFrame, or pd.Series, optional) – if providing a weighted dataset. For example, instead of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: self – self with new properties like
survival_function_
,plot()
,median
Return type:

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶

median_
Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Plots a pretty figure of the model
Matplotlib plot arguments can be passed in inside the kwargs, plus
Parameters: show_censors (bool) – place markers at censorship events. Default: False
censor_styles (dict) – If show_censors, this dictionary will be passed into the plot call.
ci_alpha (float) – the transparency level of the confidence interval. Default: 0.3
ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False
ci_show (bool) – show confidence intervals. Default: True
ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False
at_risk_counts (bool) – show group sizes at time points. See function
add_at_risk_counts
for details. Default: Falseloc (slice) – specify a timebased subsection of the curves to plot, ex:
>>> model.plot(loc=slice(0.,10.))
will plot the time values between t=0. and t=10.
iloc (slice) – specify a locationbased subsection of the curves to plot, ex:
>>> model.plot(iloc=slice(0,10))
will plot the first 10 time points.
invert_y_axis (bool) – boolean to invert the yaxis, useful to show cumulative graphs instead of survival graphs. (Deprecated, use
plot_cumulative_density()
)
Returns: a pyplot axis object
Return type: ax

plot_cumulative_density
(**kwargs)¶ Plots a pretty figure of {0}.{1}
Matplotlib plot arguments can be passed in inside the kwargs, plus
Parameters: show_censors (bool) – place markers at censorship events. Default: False
censor_styles (bool) – If show_censors, this dictionary will be passed into the plot call.
ci_alpha (bool) – the transparency level of the confidence interval. Default: 0.3
ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False
ci_show (bool) – show confidence intervals. Default: True
ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False
at_risk_counts (bool) – show group sizes at time points. See function
add_at_risk_counts
for details. Default: Falseloc (slice) – specify a timebased subsection of the curves to plot, ex:
>>> model.plot(loc=slice(0.,10.))
will plot the time values between t=0. and t=10.
iloc (slice) – specify a locationbased subsection of the curves to plot, ex:
>>> model.plot(iloc=slice(0,10))
will plot the first 10 time points.
invert_y_axis (bool) – boolean to invert the yaxis, useful to show cumulative graphs instead of survival graphs. (Deprecated, use
plot_cumulative_density()
)
Returns: a pyplot axis object
Return type: ax

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_loglogs
(*args, **kwargs)¶ Plot \(\log(S(t))\) against \(\log(t)\)

plot_survival_function
(**kwargs)¶ Alias of
plot

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

survival_function_at_times
(times, label=None)¶ Return a Pandas series of the predicted survival value at specific times
Parameters: times (iterable or float) Returns: Return type: pd.Series

LogLogisticFitter¶

class
lifelines.fitters.log_logistic_fitter.
LogLogisticFitter
(*args, **kwargs)¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements a LogLogistic model for univariate data. The model has parameterized form:
\[S(t) = \left(1 + \left(\frac{t}{\alpha}\right)^{\beta}\right)^{1}, \alpha > 0, \beta > 0,\]and the hazard rate is:
\[h(t) = \frac{\left(\frac{\beta}{\alpha}\right)\left(\frac{t}{\alpha}\right) ^ {\beta1}}{\left(1 + \left(\frac{t}{\alpha}\right)^{\beta}\right)}\]and the cumulative hazard is:
\[H(t) = \log\left(\left(\frac{t}{\alpha}\right) ^ {\beta} + 1\right)\]After calling the .fit method, you have access to properties like:
cumulative_hazard_
,plot
,survival_function_
,alpha_
andbeta_
. A summary of the fit is available with the method ‘print_summary()’Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals. Examples
>>> from lifelines import LogLogisticFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> llf = LogLogisticFitter() >>> llf.fit(waltons['T'], waltons['E']) >>> llf.plot() >>> print(llf.alpha_)

cumulative_hazard_
¶ The estimated cumulative hazard (with custom timeline if provided)
Type: DataFrame

hazard_
¶ The estimated hazard (with custom timeline if provided)
Type: DataFrame

survival_function_
¶ The estimated survival function (with custom timeline if provided)
Type: DataFrame

cumumlative_density_
¶ The estimated cumulative density function (with custom timeline if provided)
Type: DataFrame

variance_matrix_
¶ The variance matrix of the coefficients
Type: numpy array

median_
¶ The median time to event
Type: float

alpha_
¶ The fitted parameter in the model
Type: float

beta_
¶ The fitted parameter in the model
Type: float

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

timeline
¶ The time line to use for plotting and indexing
Type: array

entry
¶ The entry array provided, or None
Type: array or None

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

confidence_interval_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_density_
¶ The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
¶ The confidence interval of the hazard.

confidence_interval_survival_function_
¶ The lower and upper confidence intervals for the survival function

cumulative_density_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative density function (1survival function) at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

cumulative_hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) – values to return the cumulative hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

event_table
¶

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, left_censorship=False, initial_point=None)¶ Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_interval_censoring
(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to an interval censored dataset.
Parameters:  lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in.
 upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored).
 event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from
lower_bound
andupper_cound
(if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored)  timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_left_censoring
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to a leftcensored dataset
Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) – values to return the hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Produce a prettyplot of the estimate.

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) – specify the number of decimal places to show
 kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

summary
¶ Summary statistics describing the fit.
Returns: df – Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

LogNormalFitter¶

class
lifelines.fitters.log_normal_fitter.
LogNormalFitter
(*args, **kwargs)¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements an Log Normal model for univariate data. The model has parameterized form:
\[S(t) = 1  \Phi\left(\frac{\log(t)  \mu}{\sigma}\right), \;\; \sigma >0\]where \(\Phi\) is the CDF of a standard normal random variable. This implies the cumulative hazard rate is
\[H(t) = \log\left(1  \Phi\left(\frac{\log(t)  \mu}{\sigma}\right)\right)\]After calling the .fit method, you have access to properties like:
survival_function_
,mu_
,sigma_
. A summary of the fit is available with the methodprint_summary()
Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals. 
cumulative_hazard_
¶ The estimated cumulative hazard (with custom timeline if provided)
Type: DataFrame

hazard_
¶ The estimated hazard (with custom timeline if provided)
Type: DataFrame

survival_function_
¶ The estimated survival function (with custom timeline if provided)
Type: DataFrame

cumumlative_density_
¶ The estimated cumulative density function (with custom timeline if provided)
Type: DataFrame

variance_matrix_
¶ The variance matrix of the coefficients
Type: numpy array

median_
¶ The median time to event
Type: float

mu_
¶ The fitted parameter in the model
Type: float

sigma_
¶ The fitted parameter in the model
Type: float

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

timeline
¶ The time line to use for plotting and indexing
Type: array

entry
¶ The entry array provided, or None
Type: array or None

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

confidence_interval_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_density_
¶ The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
¶ The confidence interval of the hazard.

confidence_interval_survival_function_
¶ The lower and upper confidence intervals for the survival function

cumulative_density_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative density function (1survival function) at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

cumulative_hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) – values to return the cumulative hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

event_table
¶

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, left_censorship=False, initial_point=None)¶ Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_interval_censoring
(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to an interval censored dataset.
Parameters:  lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in.
 upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored).
 event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from
lower_bound
andupper_cound
(if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored)  timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_left_censoring
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to a leftcensored dataset
Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) – values to return the hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Produce a prettyplot of the estimate.

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) – specify the number of decimal places to show
 kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

summary
¶ Summary statistics describing the fit.
Returns: df – Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

NelsonAalenFitter¶

class
lifelines.fitters.nelson_aalen_fitter.
NelsonAalenFitter
(alpha=0.05, nelson_aalen_smoothing=True)¶ Bases:
lifelines.fitters.UnivariateFitter
Class for fitting the NelsonAalen estimate for the cumulative hazard.
NelsonAalenFitter(alpha=0.05, nelson_aalen_smoothing=True)
Parameters:  alpha (float, optional (default=0.05)) – The alpha value associated with the confidence intervals.
 nelson_aalen_smoothing (bool, optional) – If the event times are naturally discrete (like discrete years, minutes, etc.) then it is advisable to turn this parameter to False. See [1], pg.84.
Notes
[1] Aalen, O., Borgan, O., Gjessing, H., 2008. Survival and Event History Analysis

cumulative_hazard_
¶ The estimated cumulative hazard (with custom timeline if provided)
Type: DataFrame

confidence_interval_
¶ The lower and upper confidence intervals for the cumulative hazard
Type: DataFrame

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

timeline
¶ The time line to use for plotting and indexing
Type: array

entry
¶ The entry array provided, or None
Type: array or None

event_table
¶ A summary of the life table
Type: DataFrame

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

cumulative_density_at_times
(times, label=None)¶

cumulative_hazard_at_times
(times, label=None)¶

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

fit
(durations, event_observed=None, timeline=None, entry=None, label='NA_estimate', alpha=None, ci_labels=None, weights=None)¶ Parameters:  durations (an array, or pd.Series, of length n) – duration subject was observed for
 timeline (iterable) – return the best estimate at the values in timelines (positively increasing)
 event_observed (an array, or pd.Series, of length n) – True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated observations, i.e the birth event was not observed. If None, defaults to all 0 (all birth events observed.)
 label (string) – a string to name the column of the estimate.
 alpha (float) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (iterable) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<1alpha/2>
 weights (n array, or pd.Series, of length n) – if providing a weighted dataset. For example, instead of providing every subject as a single element of durations and event_observed, one could weigh subject differently.
Returns: Return type: self, with new properties like
cumulative_hazard_
.

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶

median_
¶ Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Plots a pretty figure of the model
Matplotlib plot arguments can be passed in inside the kwargs, plus
Parameters: show_censors (bool) – place markers at censorship events. Default: False
censor_styles (dict) – If show_censors, this dictionary will be passed into the plot call.
ci_alpha (float) – the transparency level of the confidence interval. Default: 0.3
ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False
ci_show (bool) – show confidence intervals. Default: True
ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False
at_risk_counts (bool) – show group sizes at time points. See function
add_at_risk_counts
for details. Default: Falseloc (slice) – specify a timebased subsection of the curves to plot, ex:
>>> model.plot(loc=slice(0.,10.))
will plot the time values between t=0. and t=10.
iloc (slice) – specify a locationbased subsection of the curves to plot, ex:
>>> model.plot(iloc=slice(0,10))
will plot the first 10 time points.
invert_y_axis (bool) – boolean to invert the yaxis, useful to show cumulative graphs instead of survival graphs. (Deprecated, use
plot_cumulative_density()
)
Returns: a pyplot axis object
Return type: ax

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

smoothed_hazard_
(bandwidth)¶ Parameters: bandwidth (float) – the bandwith used in the Epanechnikov kernel. Returns: a DataFrame of the smoothed hazard Return type: DataFrame

smoothed_hazard_confidence_intervals_
(bandwidth, hazard_=None)¶ Parameters:  bandwidth (float) – the bandwidth to use in the Epanechnikov kernel. > 0
 hazard_ (numpy array) – a computed (n,) numpy array of estimated hazard rates. If none, uses
smoothed_hazard_

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

survival_function_at_times
(times, label=None)¶
PiecewiseExponentialFitter¶

class
lifelines.fitters.piecewise_exponential_fitter.
PiecewiseExponentialFitter
(breakpoints, *args, **kwargs)¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements an Piecewise Exponential model for univariate data. The model has parameterized hazard rate:
\[\begin{split}h(t) = \begin{cases} 1/\lambda_0, & \text{if $t \le \tau_0$} \\ 1/\lambda_1 & \text{if $\tau_0 < t \le \tau_1$} \\ 1/\lambda_2 & \text{if $\tau_1 < t \le \tau_2$} \\ ... \end{cases}\end{split}\]You specify the breakpoints, \(\tau_i\), and lifelines will find the optional values for the parameters.
After calling the .fit method, you have access to properties like:
survival_function_
,plot
,cumulative_hazard_
A summary of the fit is available with the methodprint_summary()
Parameters:  breakpoints (list) – a list of times when a new exponential model is constructed.
 alpha (float, optional (default=0.05)) – the level in the confidence intervals.
Important
The parameterization of this model changed in lifelines 0.19.1. Previously, the cumulative hazard looked like \(\lambda_i t\). The parameterization is now the reciprocal of \(\lambda_i\).

cumulative_hazard_
¶ The estimated cumulative hazard (with custom timeline if provided)
Type: DataFrame

hazard_
¶ The estimated hazard (with custom timeline if provided)
Type: DataFrame

survival_function_
¶ The estimated survival function (with custom timeline if provided)
Type: DataFrame

cumumlative_density_
¶ The estimated cumulative density function (with custom timeline if provided)
Type: DataFrame

variance_matrix_
¶ The variance matrix of the coefficients
Type: numpy array

median_
¶ The median time to event
Type: float

lambda_i_
¶ The fitted parameter in the model, for i = 0, 1 … n1 breakpoints
Type: float

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

timeline
¶ The time line to use for plotting and indexing
Type: array

entry
¶ The entry array provided, or None
Type: array or None

breakpoints
¶ The provided breakpoints
Type: array

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

confidence_interval_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_density_
¶ The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
¶ The confidence interval of the hazard.

confidence_interval_survival_function_
¶ The lower and upper confidence intervals for the survival function

cumulative_density_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative density function (1survival function) at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

cumulative_hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) – values to return the cumulative hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

event_table
¶

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, left_censorship=False, initial_point=None)¶ Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_interval_censoring
(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to an interval censored dataset.
Parameters:  lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in.
 upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored).
 event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from
lower_bound
andupper_cound
(if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored)  timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_left_censoring
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to a leftcensored dataset
Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) – values to return the hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Produce a prettyplot of the estimate.

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) – specify the number of decimal places to show
 kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

summary
¶ Summary statistics describing the fit.
Returns: df – Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series
WeibullFitter¶

class
lifelines.fitters.weibull_fitter.
WeibullFitter
(*args, **kwargs)¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements a Weibull model for univariate data. The model has parameterized form:
\[S(t) = \exp\left(\left(\frac{t}{\lambda}\right)^\rho\right), \lambda > 0, \rho > 0,\]which implies the cumulative hazard rate is
\[H(t) = \left(\frac{t}{\lambda}\right)^\rho,\]and the hazard rate is:
\[h(t) = \frac{\rho}{\lambda}\left(\frac{t}{\lambda}\right)^{\rho1}\]After calling the .fit method, you have access to properties like:
cumulative_hazard_
,survival_function_
,lambda_
andrho_
. A summary of the fit is available with the methodprint_summary()
.Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals. Important
The parameterization of this model changed in lifelines 0.19.0. Previously, the cumulative hazard looked like \((\lambda t)^\rho\). The parameterization is now the reciprocal of \(\lambda\).
Examples
>>> from lifelines import WeibullFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> wbf = WeibullFitter() >>> wbf.fit(waltons['T'], waltons['E']) >>> wbf.plot() >>> print(wbf.lambda_)

cumulative_hazard_
¶ The estimated cumulative hazard (with custom timeline if provided)
Type: DataFrame

hazard_
¶ The estimated hazard (with custom timeline if provided)
Type: DataFrame

survival_function_
¶ The estimated survival function (with custom timeline if provided)
Type: DataFrame

cumumlative_density_
¶ The estimated cumulative density function (with custom timeline if provided)
Type: DataFrame

variance_matrix_
¶ The variance matrix of the coefficients
Type: numpy array

median_
¶ The median time to event
Type: float

lambda_
¶ The fitted parameter in the model
Type: float

rho_
¶ The fitted parameter in the model
Type: float

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

timeline
¶ The time line to use for plotting and indexing
Type: array

entry
¶ The entry array provided, or None
Type: array or None
See also
Looking

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

confidence_interval_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_density_
¶ The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
¶ The confidence interval of the hazard.

confidence_interval_survival_function_
¶ The lower and upper confidence intervals for the survival function

cumulative_density_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative density function (1survival function) at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

cumulative_hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) – values to return the cumulative hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

event_table
¶

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, left_censorship=False, initial_point=None)¶ Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_interval_censoring
(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to an interval censored dataset.
Parameters:  lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in.
 upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored).
 event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from
lower_bound
andupper_cound
(if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored)  timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_left_censoring
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to a leftcensored dataset
Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) – values to return the hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Produce a prettyplot of the estimate.

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) – specify the number of decimal places to show
 kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

summary
¶ Summary statistics describing the fit.
Returns: df – Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

GeneralizedGammaFitter¶

class
lifelines.fitters.generalized_gamma_fitter.
GeneralizedGammaFitter
(*args, **kwargs)¶ Bases:
lifelines.fitters.KnownModelParametericUnivariateFitter
This class implements a Generalized Gamma model for univariate data. The model has parameterized form:
The survival function is:
\[\begin{split}S(t)=\left\{ \begin{array}{} 1{{\Gamma}_{RL}}\left( \tfrac{1}{{{\lambda }^{2}}};\tfrac{{{e}^{\lambda \left( \tfrac{\text{ln}(t)\mu }{\sigma } \right)}}}{{{\lambda }^{2}}} \right)\text{ if }\lambda >0 \\ {{\Gamma}_{RL}}\left( \tfrac{1}{{{\lambda }^{2}}};\tfrac{{{e}^{\lambda \left( \tfrac{\text{ln}(t)\mu }{\sigma } \right)}}}{{{\lambda }^{2}}} \right)\text{ if }\lambda < 0 \\ \end{array} \right.\,\!\end{split}\]where \(\Gamma_{RL}\) is the regularized lower incomplete Gamma function.
This model has the Exponential, Weibull, Gamma and LogNormal as submodels, and thus can be used as a way to test which model to use:
 When \(\lambda = 1\) and \(\sigma = 1\), then the data is Exponential.
 When \(\lambda = 1\) then the data is Weibull.
 When \(\sigma = \lambda\) then the data is Gamma.
 When \(\lambda = 0\) then the data is LogNormal.
 When \(\lambda = 1\) then the data is InverseWeibull.
 When \(\sigma = \lambda\) then the data is InverseGamma.
After calling the .fit method, you have access to properties like:
cumulative_hazard_
,survival_function_
, A summary of the fit is available with the methodprint_summary()
.Important
The parameterization implemented has \(\log\sigma\), thus there is a ln_sigma_ in the output. Exponentiate this parameter to recover \(\sigma\).
Important
This model is experimental. It’s API may change in the future. Also, it’s convergence is not very stable.
Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals. Examples
>>> from lifelines import GeneralizedGammaFitter >>> from lifelines.datasets import load_waltons >>> waltons = load_waltons() >>> ggf = GeneralizedGammaFitter() >>> ggf.fit(waltons['T'], waltons['E']) >>> ggf.plot() >>> ggf.summary

cumulative_hazard_
¶ The estimated cumulative hazard (with custom timeline if provided)
Type: DataFrame

hazard_
¶ The estimated hazard (with custom timeline if provided)
Type: DataFrame

survival_function_
¶ The estimated survival function (with custom timeline if provided)
Type: DataFrame

cumumlative_density_
¶ The estimated cumulative density function (with custom timeline if provided)
Type: DataFrame

variance_matrix_
¶ The variance matrix of the coefficients
Type: numpy array

median_
¶ The median time to event
Type: float

lambda_
¶ The fitted parameter in the model
Type: float

rho_
¶ The fitted parameter in the model
Type: float

alpha_
¶ The fitted parameter in the model
Type: float

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

timeline
¶ The time line to use for plotting and indexing
Type: array

entry
¶ The entry array provided, or None
Type: array or None

conditional_time_to_event_
¶ Return a DataFrame, with index equal to
survival_function_
’s index, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.Returns: conditional_time_to_ Return type: DataFrame

confidence_interval_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_cumulative_hazard_
.

confidence_interval_cumulative_density_
¶ The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_
¶ The confidence interval of the cumulative hazard. This is an alias for
confidence_interval_
.

confidence_interval_hazard_
¶ The confidence interval of the hazard.

confidence_interval_survival_function_
¶ The lower and upper confidence intervals for the survival function

cumulative_density_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative density function (1survival function) at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

cumulative_hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted cumulative hazard value at specific times.
Parameters:  times (iterable or float) – values to return the cumulative hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

divide
(other)¶ Divide the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

event_table
¶

fit
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, left_censorship=False, initial_point=None)¶ Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_interval_censoring
(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to an interval censored dataset.
Parameters:  lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in.
 upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored).
 event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from
lower_bound
andupper_cound
(if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored)  timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_left_censoring
(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None)¶ Fit the model to a leftcensored dataset
Parameters:  durations (an array, or pd.Series) – length n, duration subject was observed for
 event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (rightcensored). Defaults all True if event_observed==None
 timeline (list, optional) – return the estimate at the values in timeline (positively increasing)
 label (string, optional) – a string to name the column of the estimate.
 alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only.
 ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length2 list: [<lowerbound name>, <upperbound name>]. Default: <label>_lower_<alpha>
 show_progress (boolean, optional) – since this is an iterative fitting algorithm, switching this to True will display some iteration details.
 entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for lefttruncated (not leftcensored) observations. If None, all members of the population entered study when they were “born”: time zero.
 weights (an array, or pd.Series, of length n) – integer weights per observation
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self with new properties like
cumulative_hazard_
,survival_function_
Return type: self

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

hazard_at_times
(times, label=None)¶ Return a Pandas series of the predicted hazard at specific times.
Parameters:  times (iterable or float) – values to return the hazard at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series

median_
Deprecated, use .median_survival_time_
Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

median_survival_time_
¶ Return the unique time point, t, such that S(t) = 0.5. This is the “halflife” of the population, and a robust summary statistic for the population, if it exists.

percentile
(p)¶ Return the unique time point, t, such that S(t) = p.
For known parametric models, this should be overwritten by something more accurate.

plot
(**kwargs)¶ Produce a prettyplot of the estimate.

plot_cumulative_density
(**kwargs)¶

plot_cumulative_hazard
(**kwargs)¶

plot_hazard
(**kwargs)¶

plot_survival_function
(**kwargs)¶

predict
(times, interpolate=False)¶ Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.
Parameters:  times (scalar, or array) – a scalar or an array of times to predict the value of {0} at.
 interpolate (boolean, optional (default=False)) – for methods that produce a stepwise solution (KaplanMeier, NelsonAalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
Returns: predictions
Return type: a scalar if time is a scalar, a numpy array if time in an array.

print_summary
(decimals=2, **kwargs)¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) – specify the number of decimal places to show
 kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

subtract
(other)¶ Subtract the {0} of two {1} objects.
Parameters: other (same object as self) Returns: Return type: DataFrame

summary
¶ Summary statistics describing the fit.
Returns: df – Contains columns coef, exp(coef), se(coef), z, p, lower, upper Return type: pd.DataFrame See also
print_summary

survival_function_at_times
(times, label=None)¶ Return a Pandas series of the predicted survival value at specific times.
Parameters:  times (iterable or float) – values to return the survival function at.
 label (string, optional) – Rename the series returned. Useful for plotting.
Returns: Return type: pd.Series
Regression Models¶
AalenAdditiveFitter¶

class
lifelines.fitters.aalen_additive_fitter.
AalenAdditiveFitter
(fit_intercept=True, alpha=0.05, coef_penalizer=0.0, smoothing_penalizer=0.0)¶ Bases:
lifelines.fitters.BaseFitter
This class fits the regression model:
\[h(tx) = b_0(t) + b_1(t) x_1 + ... + b_N(t) x_N\]that is, the hazard rate is a linear function of the covariates with timevarying coefficients. This implementation assumes nontimevarying covariates, see
TODO: name
Note
This class was rewritten in lifelines 0.17.0 to focus solely on static datasets. There is no guarantee of backwards compatibility.
Parameters:  fit_intercept (bool, optional (default: True)) – If False, do not attach an intercept (column of ones) to the covariate matrix. The intercept, \(b_0(t)\) acts as a baseline hazard.
 alpha (float, optional (default=0.05)) – the level in the confidence intervals.
 coef_penalizer (float, optional (default: 0)) – Attach a L2 penalizer to the size of the coefficients during regression. This improves stability of the estimates and controls for high correlation between covariates. For example, this shrinks the absolute value of \(c_{i,t}\).
 smoothing_penalizer (float, optional (default: 0)) – Attach a L2 penalizer to difference between adjacent (over time) coefficients. For example, this shrinks the absolute value of \(c_{i,t}  c_{i,t+1}\).

cumulative_hazards_
¶ The estimated cumulative hazard
Type: DataFrame

hazards_
¶ The estimated hazards
Type: DataFrame

confidence_intervals_
¶ The lower and upper confidence intervals for the cumulative hazard
Type: DataFrame

durations
¶ The durations provided
Type: array

event_observed
¶ The event_observed variable provided
Type: array

weights
¶ The event_observed variable provided
Type: array

fit
(df, duration_col, event_col=None, weights_col=None, show_progress=False)¶ Parameters: Fit the Aalen Additive model to a dataset.
Parameters:  df (DataFrame) – a Pandas DataFrame with necessary columns duration_col and event_col (see below), covariates columns, and special columns (weights). duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored).
 duration_col (string) – the name of the column in DataFrame that contains the subjects’ lifetimes.
 event_col (string, optional) – the name of the column in DataFrame that contains the subjects’ death observation. If left as None, assume all individuals are uncensored.
 weights_col (string, optional) – an optional column in the DataFrame, df, that denotes the weight per subject. This column is expelled and not used as a covariate, but as a weight in the final regression. Default weight is 1. This can be used for caseweights. For example, a weight of 2 means there were two subjects with identical observations. This can be used for sampling weights.
 show_progress (boolean, optional (default=False)) – Since the fitter is iterative, show iteration number.
Returns: self – self with additional new properties:
cumulative_hazards_
, etc.Return type: Examples
>>> from lifelines import AalenAdditiveFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> aaf = AalenAdditiveFitter() >>> aaf.fit(df, 'T', 'E') >>> aaf.predict_median(df) >>> aaf.print_summary()

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

plot
(columns=None, loc=None, iloc=None, **kwargs)¶ ” A wrapper around plotting. Matplotlib plot arguments can be passed in, plus:
Parameters: columns (string or listlike, optional) – If not empty, plot a subset of columns from the
cumulative_hazards_
. Default all.loc
iloc (slice, optional) –
 specify a locationbased subsection of the curves to plot, ex:
.plot(iloc=slice(0,10))
will plot the first 10 time points.

predict_cumulative_hazard
(X)¶ Returns the hazard rates for the individuals
Parameters: X (a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns) – can be in any order. If a numpy array, columns must be in the same order as the training data.

predict_expectation
(X)¶ Compute the expected lifetime, E[T], using covariates X.
Parameters:  X (a (n,d) covariate numpy array or DataFrame) – If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 Returns the expected lifetimes for the individuals

predict_median
(X)¶ Parameters:  X (a (n,d) covariate numpy array or DataFrame) – If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 Returns the median lifetimes for the individuals

predict_percentile
(X, p=0.5)¶ Returns the median lifetimes for the individuals. http://stats.stackexchange.com/questions/102986/percentilelossfunctions
Parameters:  X (a (n,d) covariate numpy array or DataFrame) – If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 p (float) – default: 0.5

predict_survival_function
(X, times=None)¶ Returns the survival functions for the individuals
Parameters:  X (a (n,d) covariate numpy array or DataFrame) – If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data.
 times – Not implemented yet

print_summary
(decimals=2, **kwargs)¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) – specify the number of decimal places to show
 kwargs – print additional meta data in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

score_
¶ The concordance score (also known as the cindex) of the fit. The cindex is a generalization of the ROC AUC to survival data, including censorships.
For this purpose, the
score_
is a measure of the predictive accuracy of the fitted model onto the training dataset. It’s analogous to the R^2 in linear models.

smoothed_hazards_
(bandwidth=1)¶ Using the epanechnikov kernel to smooth the hazard function, with sigma/bandwidth

summary
¶ Summary statistics describing the fit.
Returns: df Return type: DataFrame
CoxTimeVaryingFitter¶

class
lifelines.fitters.cox_time_varying_fitter.
CoxTimeVaryingFitter
(alpha=0.05, penalizer=0.0, strata=None)¶ Bases:
lifelines.fitters.BaseFitter
This class implements fitting Cox’s timevarying proportional hazard model:
\[h(tx(t)) = h_0(t)\exp(x(t)'\beta)\]Parameters:  alpha (float, optional (default=0.05)) – the level in the confidence intervals.
 penalizer (float, optional) – the coefficient of an L2 penalizer in the regression

params_
¶ The estimated coefficients. Changed in version 0.22.0: use to be
.hazards_
Type: Series

hazard_ratios_
¶ The exp(coefficients)
Type: Series

confidence_intervals_
¶ The lower and upper confidence intervals for the hazard coefficients
Type: DataFrame

event_observed
¶ The event_observed variable provided
Type: Series

weights
¶ The event_observed variable provided
Type: Series

variance_matrix_
¶ The variance matrix of the coefficients
Type: numpy array

strata
¶ the strata provided
Type: list

standard_errors_
¶ the standard errors of the estimates
Type: Series

baseline_cumulative_hazard_
¶ Type: DataFrame

baseline_survival_
¶ Type: DataFrame

fit
(df, id_col, event_col, start_col='start', stop_col='stop', weights_col=None, show_progress=False, step_size=None, robust=False, strata=None, initial_point=None)¶ Fit the Cox Proportional Hazard model to a time varying dataset. Tied survival times are handled using Efron’s tiemethod.
Parameters:  df (DataFrame) – a Pandas DataFrame with necessary columns duration_col and event_col, plus other covariates. duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored).
 id_col (string) – A subject could have multiple rows in the DataFrame. This column contains the unique identifier per subject.
 event_col (string) – the column in DataFrame that contains the subjects’ death observation. If left as None, assume all individuals are noncensored.
 start_col (string) – the column that contains the start of a subject’s time period.
 stop_col (string) – the column that contains the end of a subject’s time period.
 weights_col (string, optional) – the column that contains (possibly timevarying) weight of each subjectperiod row.
 show_progress (since the fitter is iterative, show convergence) – diagnostics.
 robust (boolean, optional (default: True)) – Compute the robust errors using the Huber sandwich estimator, aka WeiLin estimate. This does not handle ties, so if there are high number of ties, results may significantly differ. See “The Robust Inference for the Cox Proportional Hazards Model”, Journal of the American Statistical Association, Vol. 84, No. 408 (Dec., 1989), pp. 1074 1078
 step_size (float, optional) – set an initial step size for the fitting algorithm.
 strata (list or string, optional) – specify a column or list of columns n to use in stratification. This is useful if a categorical covariate does not obey the proportional hazard assumption. This is used similar to the strata expression in R. See http://courses.washington.edu/b515/l17.pdf.
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
Returns: self – self, with additional properties like
hazards_
andprint_summary
Return type:

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

log_likelihood_ratio_test
()¶ This function computes the likelihood ratio test for the Cox model. We compare the existing model (with all the covariates) to the trivial model of no covariates.
Conveniently, we can actually use CoxPHFitter class to do most of the work.

plot
(columns=None, **errorbar_kwargs)¶ Produces a visual representation of the coefficients, including their standard errors and magnitudes.
Parameters:  columns (list, optional) – specify a subset of the columns to plot
 errorbar_kwargs – pass in additional plotting commands to matplotlib errorbar command
Returns: ax – the matplotlib axis that be edited.
Return type: matplotlib axis

predict_log_partial_hazard
(X)¶ This is equivalent to R’s linear.predictors. Returns the log of the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to :math:`(x  bar{x})’beta `
Parameters: X (numpy array or DataFrame) – a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: Return type: DataFrame Note
If X is a DataFrame, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

predict_partial_hazard
(X)¶ Returns the partial hazard for the individuals, partial since the baseline hazard is not included. Equal to \(\exp{(x  \bar{x})'\beta }\)
Parameters: X (numpy array or DataFrame) – a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns: Return type: DataFrame Note
If X is a DataFrame, the order of the columns do not matter. But if X is an array, then the column ordering is assumed to be the same as the training dataset.

print_summary
(decimals=2, **kwargs)¶ Print summary statistics describing the fit, the coefficients, and the error bounds.
Parameters:  decimals (int, optional (default=2)) – specify the number of decimal places to show
 kwargs – print additional meta data in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.

summary
¶ Summary statistics describing the fit. Set alpha property in the object before calling.
Returns: df – Contains columns coef, np.exp(coef), se(coef), z, p, lower, upper Return type: DataFrame
CoxPHFitter¶

class
lifelines.fitters.coxph_fitter.
CoxPHFitter
(alpha=0.05, tie_method='Efron', penalizer=0.0, strata=None)¶ Bases:
lifelines.fitters.BaseFitter
This class implements fitting Cox’s proportional hazard model:
\[h(tx) = h_0(t) \exp((x  \overline{x})' \beta)\]Parameters:  alpha (float, optional (default=0.05)) – the level in the confidence intervals.
 tie_method (string, optional) – specify how the fitter should deal with ties. Currently only ‘Efron’ is available.
 penalizer (float, optional (default=0.0)) – Attach an L2 penalizer to the size of the coefficients during regression. This improves stability of the estimates and controls for high correlation between covariates. For example, this shrinks the absolute value of \(\beta_i\). The penalty is \(\frac{1}{2} \text{penalizer} \beta^2\).
 strata (list, optional) – specify a list of columns to use in stratification. This is useful if a categorical covariate does not obey the proportional hazard assumption. This is used similar to the strata expression in R. See http://courses.washington.edu/b515/l17.pdf.
Examples
>>> from lifelines.datasets import load_rossi >>> from lifelines import CoxPHFitter >>> rossi = load_rossi() >>> cph = CoxPHFitter() >>> cph.fit(rossi, 'week', 'arrest') >>> cph.print_summary()

params_
¶ The estimated coefficients. Changed in version 0.22.0: use to be
.hazards_
Type: Series

hazard_ratios_
¶ The exp(coefficients)
Type: Series

confidence_intervals_
¶ The lower and upper confidence intervals for the hazard coefficients
Type: DataFrame

durations
¶ The durations provided
Type: Series

event_observed
¶ The event_observed variable provided
Type: Series

weights
¶ The event_observed variable provided
Type: Series

variance_matrix_
¶ The variance matrix of the coefficients
Type: numpy array

strata
¶ the strata provided
Type: list

standard_errors_
¶ the standard errors of the estimates
Type: Series

score_
¶ the concordance index of the model.
Type: float

baseline_hazard_
¶ Type: DataFrame

baseline_cumulative_hazard_
¶ Type: DataFrame

baseline_survival_
¶ Type: DataFrame

check_assumptions
(training_df, advice=True, show_plots=False, p_value_threshold=0.01, plot_n_bootstraps=10, columns=None)¶ Use this function to test the proportional hazards assumption. See usage example at https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
Parameters:  training_df (DataFrame) – the original DataFrame used in the call to
fit(...)
or a subsampled version.  advice (boolean, optional) – display advice as output to the user’s screen
 show_plots (boolean, optional) – display plots of the scaled schoenfeld residuals and loess curves. This is an eyeball test for violations. This will slow down the function significantly.
 p_value_threshold (float, optional) – the threshold to use to alert the user of violations. See note below.
 plot_n_bootstraps – in the plots displayed, also display plot_n_bootstraps bootstrapped loess curves. This will slow down the function significantly.
 columns (list, optional) – specify a subset of columns to test.
Examples
>>> from lifelines.datasets import load_rossi >>> from lifelines import CoxPHFitter >>> >>> rossi = load_rossi() >>> cph = CoxPHFitter().fit(rossi, 'week', 'arrest') >>> >>> cph.check_assumptions(rossi)
Notes
The
p_value_threshold
is arbitrarily set at 0.01. Under the null, some covariates will be below the threshold (i.e. by chance). This is compounded when there are many covariates.Similarly, when there are lots of observations, even minor deviances from the proportional hazard assumption will be flagged.
With that in mind, it’s best to use a combination of statistical tests and eyeball tests to determine the most serious violations.
References
section 5 in https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/AppendixCoxRegression.pdf, http://www.mwsug.org/proceedings/2006/stats/MWSUG2006SD08.pdf, http://eprints.lse.ac.uk/84988/1/06_ParkHendry2015ReassessingSchoenfeldTests_Final.pdf
 training_df (DataFrame) – the original DataFrame used in the call to

compute_residuals
(training_dataframe, kind)¶ Parameters:  training_dataframe (pandas DataFrame) – the same training DataFrame given in fit
 kind (string) – {‘schoenfeld’, ‘score’, ‘delta_beta’, ‘deviance’, ‘martingale’, ‘scaled_schoenfeld’}

fit
(df, duration_col=None, event_col=None, show_progress=False, initial_point=None, strata=None, step_size=None, weights_col=None, cluster_col=None, robust=False, batch_mode=None)¶ Fit the Cox proportional hazard model to a dataset.
Parameters:  df (DataFrame) – a Pandas DataFrame with necessary columns duration_col and event_col (see below), covariates columns, and special columns (weights, strata). duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored).
 duration_col (string) – the name of the column in DataFrame that contains the subjects’ lifetimes.
 event_col (string, optional) – the name of thecolumn in DataFrame that contains the subjects’ death observation. If left as None, assume all individuals are uncensored.
 weights_col (string, optional) – an optional column in the DataFrame, df, that denotes the weight per subject. This column is expelled and not used as a covariate, but as a weight in the final regression. Default weight is 1. This can be used for caseweights. For example, a weight of 2 means there were two subjects with identical observations. This can be used for sampling weights. In that case, use robust=True to get more accurate standard errors.
 show_progress (boolean, optional (default=False)) – since the fitter is iterative, show convergence diagnostics. Useful if convergence is failing.
 initial_point ((d,) numpy array, optional) – initialize the starting point of the iterative algorithm. Default is the zero vector.
 strata (list or string, optional) – specify a column or list of columns n to use in stratification. This is useful if a categorical covariate does not obey the proportional hazard assumption. This is used similar to the strata expression in R. See http://courses.washington.edu/b515/l17.pdf.
 step_size (float, optional) – set an initial step size for the fitting algorithm. Setting to 1.0 may improve performance, but could also hurt convergence.
 robust (boolean, optional (default=False)) – Compute the robust errors using the Huber sandwich estimator, aka WeiLin estimate. This does not handle ties, so if there are high number of ties, results may significantly differ. See “The Robust Inference for the Cox Proportional Hazards Model”, Journal of the American Statistical Association, Vol. 84, No. 408 (Dec., 1989), pp. 1074 1078
 cluster_col (string, optional) – specifies what column has unique identifiers for clustering covariances. Using this forces the sandwich estimator (robust variance estimator) to be used.
 batch_mode (bool, optional) – enabling batch_mode can be faster for datasets with a large number of ties. If left as None, lifelines will choose the best option.
Returns: self – self with additional new properties:
print_summary
,hazards_
,confidence_intervals_
,baseline_survival_
, etc.Return type: Note
Tied survival times are handled using Efron’s tiemethod.
Examples
>>> from lifelines import CoxPHFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> cph = CoxPHFitter() >>> cph.fit(df, 'T', 'E') >>> cph.print_summary() >>> cph.predict_median(df)
>>> from lifelines import CoxPHFitter >>> >>> df = pd.DataFrame({ >>> 'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0], >>> 'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], >>> 'weights': [1.1, 0.5, 2.0, 1.6, 1.2, 4.3, 1.4, 4.5, 3.0, 3.2, 0.4, 6.2], >>> 'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> 'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7], >>> }) >>> >>> cph = CoxPHFitter() >>> cph.fit(df, 'T', 'E', strata=['month', 'age'], robust=True, weights_col='weights') >>> cph.print_summary() >>> cph.predict_median(df)

fit_right_censoring
(*args, **kwargs)¶ Alias for
fit
See also
fit

log_likelihood_ratio_test
()¶ This function computes the likelihood ratio test for the Cox model. We compare the existing model (with all the covariates) to the trivial model of no covariates.

plot
(columns=None, hazard_ratios=False, **errorbar_kwargs)¶ Produces a visual representation of the coefficients (i.e. log hazard ratios), including their standard errors and magnitudes.
Parameters:  columns (list, optional) – specify a subset of the columns to plot
 hazard_ratios (bool, optional) – by default, plot will present the loghazard ratios (the coefficients). However, by turning this flag to True, the hazard ratios are presented instead.
 errorbar_kwargs – pass in additional plotting commands to matplotlib errorbar command
Examples
>>> from lifelines import datasets, CoxPHFitter >>> rossi = datasets.load_rossi() >>> cph = CoxPHFitter().fit(rossi, 'week', 'arrest') >>> cph.plot(hazard_ratios=True)
Returns: ax – the matplotlib axis that be edited. Return type: matplotlib axis

plot_covariate_groups
(covariates, values, plot_baseline=True, **kwargs)¶ Produces a plot comparing the baseline survival curve of the model versus what happens when a covariate(s) is varied over values in a group. This is useful to compare subjects’ survival as we vary covariate(s), all else being held equal. The baseline survival curve is equal to the predicted survival curve at all average values in the original dataset.
Parameters:  covariates (string or list) – a string (or list of strings) of the covariate(s) in the original dataset that we wish to vary.
 values (1d or 2d iterable) – an iterable of the specific values we wish the covariate(s) to take on.
 plot_baseline (bool) – also display the baseline survival, defined as the survival at the mean of the original dataset.
 kwargs – pass in additional plotting commands.
Returns: ax – the matplotlib axis that be edited.
Return type: matplotlib axis, or list of axis’
Examples
>>> from lifelines import datasets, CoxPHFitter >>> rossi = datasets.load_rossi() >>> cph = CoxPHFitter().fit(rossi, 'week', 'arrest') >>> cph.plot_covariate_groups('prio', values=np.arange(0, 15, 3), cmap='coolwarm')
>>> # multiple variables at once >>> cph.plot_covariate_groups(['prio', 'paro'], values=[ >>> [0, 0], >>> [5, 0], >>> [10, 0], >>> [0, 1], >>> [5, 1], >>> [10, 1] >>> ], cmap='coolwarm')
>>> # if you have categorical variables, you can do the following to see the >>> # effect of all the categories on one plot. >>> cph.plot_covariate_groups(['dummy1', 'dummy2', 'dummy3'], values=[[1, 0, 0], [0, 1, 0], [0, 0,