Unistat Statistics Software | Survival-Cox Regression

9.4.4. Cox Regression

Apart from time and status variables, data for Survival Analysis often contain measurements on one or more continuous variables, such as temperature, dosage, age or one or more categorical variables such as gender, region, treatment. In such cases it is desirable to construct Life Tables (or survival functions) which reflect the effects of these continuous or categorical variables (which are also called covariates).

A method which combines the elements of nonparametric Life Table analysis and the parametric Regression Analysis was introduced by D R Cox in 1972. It is known as the Cox Regression or Cox’s proportional hazards model. The latter reflects a fundamental assumption of this model, namely that the hazard function of an individual in one group is proportional to the hazard function of another in another group at any time period. In graphical terms, this is equivalent to assuming that the hazard curves of different groups do not cross each other.

Let X be a matrix of p independent variables (covariates) for n cases. Then the hazard function for the j^th case can be written as:

where:

· h₀(t) is the baseline hazard function and

· is the vector of coefficients.

Rearranging the terms we obtain:

Although the right-hand side of this equation is linear, the existence of another unknown on the left-hand side, h₀(t), makes it impossible to solve it using the Linear Regression method. Instead, a maximum likelihood solution for a log-likelihood function that is derived from the above equation is found:

Survival-Cox Regression

where R(t_j) is the set of cases at risk at the start of time j and d_j is the number of cases that terminate at time j.

The maximum likelihood solution of this proportional hazards model is computed employing a modified Newton-Raphson minimisation algorithm. The nature of this method implies that a solution (convergence) cannot always be achieved. In such cases, you are advised to edit the convergence parameters provided, in order to find the right levels for the particular problem at hand. Three convergence parameters can be edited in a dialogue that pops up just after the Variable Selection Dialogue (see 9.4.4.2. Cox Regression Intermediate Inputs).

Stratified Analysis: It may be the case that with some data sets the assumption of proportionality of individual hazard functions cannot be maintained. Under such circumstances it may be possible to define some subgroups within which the hazard functions can assumed to be proportional. This type of analysis is called a stratified analysis and the subgroups within which the proportionality assumption is maintained are called the strata.

The strata are defined by the levels of a factor column, which can be selected from the Variable Selection Dialogue. The program then assumes that different strata have different baseline hazard functions, but that they all have the same coefficient estimates for covariates. A maximum of six strata can be fitted simultaneously.

Baseline Functions at Means of Covariates: In order to maintain numerical stability, UNISTAT always centres the covariates by subtracting their respective means first. Most results, including the coefficient estimates, are invariant to centring. However, the baseline functions are different with and without centring and some sources prefer reporting the centred baseline functions while the others the uncentred ones (baseline functions relative to the origin). UNISTAT makes both sets of results available. See 9.4.4.2. Cox Regression Intermediate Inputs below for how to make this selection. The only other result that is affected by centring is the Linear Fit (XBeta) reported under Case (Diagnostic) Statistics (see 9.4.4.3.2. Cox Regression Further Output Options).

Predictions: If a case (row) of data does not contain any missing values except for the time variable, then a prediction of survival times is generated for this case. If at least one such case exists in data, then a fourth option Predicted Survival is added to the Baseline Functions dialogue. An additional column of predicted survival times is output for each predicted case.

9.4.4.1. Cox Regression Variable Selection

Survival-Cox Regression

Variable: Any number of columns containing numeric data can be selected as independent variables (covariates). However, unlike any other regression procedures, a Cox regression can be run without selecting any independent variables. This is called the initial model (or the null model). UNISTAT reports the -2Log Likelihood value for the initial model for any model fitted.

Interaction: This button is used to create independent variables, which are the products of existing numeric variables. If only one variable is highlighted, then the new independent variable created will be the product of the selected column by itself. If two or more variables are highlighted, then the new term will be the product of these variables. Maximum three-way interactions are allowed. Interactions of dummy variables or lags are not allowed. In order to create interaction terms for dummy variables, create interactions first, and then create dummy variables for them. For further information see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables.

Dummy: This button is used to create n or n – 1 new independent (dummy or indicator) variables for a factor column containing n levels. Each dummy variable corresponds to a level of the factor column. A case in a dummy column will have the value of 1 if the factor contains the corresponding level in the same row, and 0 otherwise. If the selected variable is an interaction term, then dummy variables will be created for this interaction term. Up to three-way interactions are allowed and short or long string columns can be selected as factors. It is possible to include all n levels in the analysis or to omit the first or the last level to remove the inherent over-parameterisation of the model. For further information see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables.

Full: This button becomes activated when two or more categorical variables are highlighted. Like the [Dummy] button, it is also used to create dummy variables. The only difference is that this button will create all necessary dummy variables and their interactions to specify a complete model. For instance, if two categorical variables are highlighted, then this button will create two sets of dummy variables representing the main effects and a third set representing the interaction term. Maximum three-way interactions are supported.

Time: It is compulsory to select at least one column containing numeric data as the time variable. Two types of data can be selected for Cox Regression; (1) Enter Durations where one column is selected containing the duration of each case and (2) Enter Begin and End Times where one column is selected as the starting time and another column as the termination time. In the latter case, the program subtracts the second column from the first one to obtain the duration of each case. If this results in some non positive values, then they are treated as missing cases (see 3.0.2.4. Time Data).

Censored: This variable is selected to show whether the termination time of a case is not known or the termination event has occurred. If the value in the column is zero then it is assumed that the case has been censored, which means that it has been excluded from the study before the termination event has occurred. Non-zero values (usually one) indicate that the termination event has occurred for this case. If a censor variable is not selected, then it is assumed that termination event has occurred for all cases. The default value of 0 indicating the censored cases can be edited to any other value from the next dialogue (see 9.4.4.2. Cox Regression Intermediate Inputs).

Weight: A column containing continuous numeric data can be selected as a weight variable. In this case, the program will normalise this column so that its sum is equal to the valid number of cases and then multiply each covariate by its square root before running the regression.

Factor: A categorical variable containing a limited number of distinct values can be selected as a factor. Unlike other Survival Analysis procedures though, this variable is not used in selecting subgroups to be included in the analysis. All cases are included in the analysis, but a stratified model is fitted. For more information on stratified analysis see the beginning of section 9.4.4. Cox Regression.

9.4.4.2. Cox Regression Intermediate Inputs

Survival-Cox Regression

Tolerance: This value is used to control the sensitivity of nonlinear minimisation procedure employed. Under normal circumstances, you do not need to edit this value. If a convergence cannot be achieved, then larger values of this parameter can be tried by removing one or more zeros.

Maximum Number of Iterations: When convergence cannot be achieved with the default value of 100 function evaluations, a higher value can be tried.

Expansion Factor: When the nonlinear minimisation algorithm moves away from a valid range of input parameters, this expansion factor is used to start a new search in a wider area. Under normal circumstances, you do not need to edit this value. If convergence cannot be achieved, then other values of similar (but larger) magnitude can be tried.

Value for Censored Cases: If a column of the spreadsheet has been selected as a censor variable, UNISTAT will consider cases with a zero entry in this column as censored. However, you can specify any value here to indicate which cases are to be treated as censored. Once a change is made here, it will apply to all Survival Analysis procedures throughout the session.

At Mean of Covariates: This box is used to select the type of baseline survival and hazard functions. A zero value results in baseline functions being reported relative to the origin and other values relative to the covariate means.

One other result that is affected by this choice is the linear fit (XBeta) vector reported under the Case (Diagnostic) Statistics option.

Omit Level: This box will appear only when one or more dummy variables have been included in the model from the Variable Selection Dialogue. Three options are available; (0) do not omit any levels, (1) omit the first level and (2) omit the last level. When no levels are omitted, the model will usually be over-parameterised (see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables).

9.4.4.3. Cox Regression Output Options

When the maximum likelihood model has converged or the maximum number of iterations has been reached without convergence, an Output Options Dialogue will provide access to the following options.

Survival-Cox Regression

When no covariates have been included in the model, only the -2 Log likelihood value will be reported in the Regression Results output. In this case, a Life Table will be displayed with a survival function that is identical to the Kaplan-Meier estimate of the same model (see 9.4.2. Kaplan-Meier Analysis).

9.4.4.3.1. Cox Regression Coefficient Output

Regression Results: The main regression output displays a table for coefficients of the estimated regression equation, their standard errors, Wald statistics, probability values and confidence intervals for the significance level specified in the Variable Selection Dialogue. If any independent variables have been omitted due to multicollinearity, they are reported at the end of the table. If you do not wish to display these variables enter the following line in the [Options] section of Documents\Unistat10\Unistat10.ini file:

DispCollin=0

In Stand-Alone Mode, the estimated regression coefficients can be saved to the data matrix by clicking on the UNISTAT icon situated on the Output Medium Toolbar.

Wald Statistic: This is defined as:

and has a chi-square distribution with one degree of freedom.

Confidence Intervals: The confidence intervals for regression coefficients are computed from:

where each coefficient’s standard error, , is the square root of the diagonal element of covariance matrix.

-2 Log-Likelihood Initial Model: This is the value when all independent variables are excluded from the model:

-2 Log-Likelihood Final Model: This is -2 times the value of the log likelihood function when convergence is achieved.

Pseudo R-squared: In Cox Regression, an r-squared statistic as in the OLS regression is not available. This is because Cox Regression employs an iterative maximum likelihood estimation method. Equivalent statistics to test the goodness of fit have been proposed using the initial (L₀) and maximum (L₁) likelihood values.

McFadden:

Adjusted McFadden:

Cox & Snell:

Survival-Cox Regression

Nagelkerke:

Survival-Cox Regression

Likelihood Ratio: This is a test statistic for the null hypothesis that “all regression coefficients for covariates are zero”. It is equal to -2 times the difference between the initial and final model likelihood values and has a chi-square distribution with k degrees of freedom (the number of independent variables in the model).

If no covariates have been included in the model, then only the -2 Log likelihood value will be reported.

Hazard Ratio:

Hazard ratio is defined as:

Values of hazard ratio above 1 indicate an increased hazard as the values of the corresponding covariate increase at any time period, and vice versa for values below 1.

The standard error of the hazard ratio is found as:

and its confidence intervals as:

which are simply the exponential of the coefficient confidence intervals.

This option is not available when no covariates have been included in the model.

Correlation Matrix of Regression Coefficients: This is a symmetric matrix with unity diagonal elements. The off-diagonal elements give the correlations between the regression coefficients.

This output is not available when no covariates have been included in the model.

Covariance Matrix of Regression Coefficients: This is a symmetric matrix where diagonal elements are the square of parameter standard errors.

This output is not available when no covariates have been included in the model.

9.4.4.3.2. Cox Regression Case Output

Case (Diagnostic) Statistics: Case statistics are useful in determining the influence of individual observations on the overall fit of the model. For further information see 7.2.1.2.2. Linear Regression Case Output.

The values reported here are not sorted according to strata and the time variable. Therefore, when this output is sent to the Data Processor for further analysis, a case-by-case correspondence with the original data will be maintained.

Survival-Cox Regression

Survival Function:

Cumulative Hazard Function:

Cox-Snell Residuals:

These are identical to the cumulative hazard function.

Modified Cox-Snell Residuals:

where is if the case terminates at time j and if it is censored.

Martingale Residuals:

Deviance Residuals:

Score (Partial) Residuals:

for uncensored cases and missing otherwise for all covariates i = 1,…,p, where:

Survival-Cox Regression

Delta-Beta:

where is the p x p covariance matrix of regression coefficients and is an n x p matrix, elements of which are defined as:

Survival-Cox Regression

for j = 1,…,n, i = 1,…, p where is defined as in score residuals above.

Delta-beta is defined as the change in an estimated coefficient when a case is omitted from the analysis. However, unlike in Linear Regression (see 7.2.1.2.2. Linear Regression Case Output), where an exact estimate can be computed without having to run n regressions, the method employed here is an approximation (see Pettitt & Bin Daud, 1989). The user must be aware that different sources and statistical packages may adopt different approximation methods. The best way to find out which method gives the most accurate results is to run a small example and compute each delta-beta by running n regressions, each time omitting one row from the analysis.

Standardised Delta-Beta:

for i = 1,…, p.

This is delta-beta divided by its standard error for each coefficient.

Likelihood Displacement:

where is the m x m covariance matrix of regression coefficients and is an n x m matrix, elements of which are as defined in delta-betas. This statistic reflects the effect of deleting a case on the likelihood function.

Linear Fit (XBeta):

Defined as:

Alongside baseline functions, this is the only result that is affected by whether the baseline functions are reported centred or relative to the origin (see 9.4.4.2. Cox Regression Intermediate Inputs).

Baseline Functions: Baseline survival, hazard and cumulative hazard functions are reported for each stratum, which are ordered according to the time variable in ascending order. Baseline functions reported here would in general be different for a model where the covariates are centred. If there are no predicted cases, then the Baseline Functions dialogue contains three options.

Survival-Cox Regression

If at least one case (row) of data does not contain any missing values except for the time variable, then a fourth option Predicted Survival is added to the Baseline Functions dialogue.

Survival-Cox Regression

Baseline Survival:

where is obtained by solving the following equation:

where D(t_j) is the set of cases that terminate at time j.

Baseline Hazard:

where is as defined for the baseline survival.

Baseline Cumulative Hazard:

Predicted Survival: This is computed for cases containing no missing values except for the time variable. For each such case, a column of predicted survival times is computed as:

Plot of Baseline Survival Function: The baseline survival function is plotted against time. The appearance of this graph depends on the selection of estimation at covariate means (see 9.4.4.2. Cox Regression Intermediate Inputs).

Survival-Cox Regression

Plot of Baseline Cumulative Hazard: The baseline cumulative hazard function is plotted against time. The appearance of this graph depends on the selection of estimation method, at origin or at covariate means (see 9.4.4.2. Cox Regression Intermediate Inputs).

Survival-Cox Regression

9.4.4.4. Cox Regression Examples

Example 1

Data on survival times of patients in a study on multiple myeloma is given in Table 1.3 (p. 9), in Collett, D. (1994).

Open SURVIVAL and select Statistics 2 → Survival Analysis → Cox Regression. From the Variable Selection Dialogue select the data option 1 Enter Durations and Survival time (C10) as [Time], Status (C11) as [Censored] and Age, Sex, BUN, CA, HB, PC and BJ (C12 to C18) as [Variable]s.

The Regression Results output reproduces the coefficient estimates and their standard errors as Collett reports them on Table 3.1 (p. 71):

Cox Regression

Time Variable: Survival time

Censor Variable: Status

Number of Cases Censored: 12 ( 25.0%)

Valid Number of Cases: 48, 0 Omitted

Regression Results

	Coefficient	Standard Error	Wald statistic	Probability	Lower 95%	Upper 95%
Age	-0.0194	0.0279	0.4806	0.4882	-0.0741	0.0354
Sex	-0.2509	0.4023	0.3890	0.5328	-1.0394	0.5376
BUN	0.0208	0.0059	12.3396	0.0004	0.0092	0.0324
CA	0.0131	0.1324	0.0098	0.9211	-0.2465	0.2727
HB	-0.1352	0.0689	3.8537	0.0496	-0.2703	-0.0002
PC	-0.0016	0.0066	0.0587	0.8085	-0.0145	0.0113
BJ	-0.6404	0.4267	2.2529	0.1334	-1.4767	0.1959

-2 Log likelihood:
Initial Model =	215.9399
Final Model =	199.7009
Pseudo R-squared:
McFadden =	0.0752
Adjusted McFadden =	0.0104
Cox & Snell =	0.2870
Nagelkerke =	0.2903
Likelihood Ratio Statistic:
Chi-Square Statistic =	16.2390
Degrees of Freedom =	7
Right-Tail Probability =	0.0230

Example 2

Survival times of patients classified according to age group and whether or not they have had a nephrectomy is given in Table 3.3 (p. 76), in Collett, D. (1994). Example 3.2 (p. 69).

Open SURVIVAL and select Statistics 2 → Survival Analysis → Cox Regression. From the Variable Selection Dialogue select the data option 1 Enter Durations and Time (C27) as [Time], Status (C28) as [Censored]. Then highlight Age group (C30) and click [Dummy] then highlight Nephrectomy (C29) and click [Dummy] again. This will add two terms to the model: Dummy(C30 Age Group) and Dummy(C29 Nephrectomy). The next dialogue will ask for a number of model parameters. Leave all entries unchanged as suggested by the program, but change the last one Omit Level? from 0 to 1, to omit the first level of each factor from the model.

The Regression Results output reproduces the coefficient estimates and their standard errors as Collett reports them in Example 3.12 (p. 100):

Cox Regression

Time Variable: Time

Censor Variable: Status

Number of Cases Censored: 4 ( 11.1%)

Valid Number of Cases: 36, 0 Omitted

Regression Results

	Coefficient	Standard Error	Wald Statistic	Signifi cance	Lower 95%	Upper 95%
Age Group = 2	0.0125	0.4246	0.0009	0.9765	-0.8197	0.8447
3	1.3416	0.5918	5.1396	0.0234	0.1817	2.5014
Nephrectomy = 2	-1.4115	0.5152	7.5044	0.0062	-2.4213	-0.4016

-2 Log likelihood:
Initial Model =	177.6665
Final Model =	165.5084
Pseudo R-squared:
McFadden =	0.0684
Adjusted McFadden =	0.0347
Cox & Snell =	0.2866
Nagelkerke =	0.2887
Likelihood Ratio Statistic:
Chi-Square Statistic =	12.1581
Degrees of Freedom =	3
Right-Tail Probability =	0.0069

Hazard Ratio

	Hazard Ratio	Standard Error	Lower 95%	Upper 95%
Age Group = 2	1.0126	0.4299	0.4406	2.3273
3	3.8250	2.2635	1.1993	12.1996
Nephrectomy = 2	0.2438	0.1256	0.0888	0.6692

Correlation Matrix of Regression Coefficients

	Age Group = 2	3	Nephrectomy = 2
Age Group = 2	1.0000	0.3119	0.0767
3	0.3119	1.0000	0.0818
Nephrectomy = 2	0.0767	0.0818	1.0000

The baseline functions are reported as in Collett Table 3.10 (p. 101):

Cox Regression

Baseline Functions

Time	Baseline Survival	Baseline Hazard	Baseline Cumulative Hazard
5	0.9502	0.0498	0.0510
6	0.8516	0.1038	0.1607
8	0.7552	0.1132	0.2808
9	0.5761	0.2371	0.5514
10	0.5339	0.0733	0.6275
12	0.4861	0.0896	0.7214
14	0.4335	0.1082	0.8359
15	0.3831	0.1163	0.9595
17	0.3326	0.1318	1.1008
18	0.2379	0.2848	1.4360
21	0.1939	0.1851	1.6406
26	0.1198	0.3822	2.1222
35	0.0920	0.2319	2.3860
36	0.0512	0.4430	2.9712
38	0.0369	0.2792	3.2985
48	0.0259	0.2993	3.6543
52	0.0114	0.5598	4.4748
56	0.0070	0.3823	4.9565
68	0.0041	0.4207	5.5024
72	0.0022	0.4673	6.1323
84	0.0009	0.5975	7.0424
108	0.0002	0.8072	8.6886

Survival-Cox Regression

Example 3

Data on times to removal of a catheter following a kidney infection is given in Table 5.1 (p. 157), in Collett, D. (1994), Example 3.2 (p. 69).

Open SURVIVAL and select Statistics 2 → Survival Analysis → Cox Regression. From the Variable Selection Dialogue select the data option 1 Enter Durations and Time (C32) as [Time], Status (C33) as [Censored]. Then select Age (C34) and Sex (C35) as [Variable]s. From the Output Options Dialogue select Case (Diagnostic) Statistics and at the next dialogue click All to select all options.

Different types of residuals calculated are reported by Collett in Table 5.2 (p. 157). Collett also reports the approximate estimated delta-betas in Table 5.4 (p. 172) and the likelihood displacement values in Table 5.5 (p.176).

Cox Regression

Case (Diagnostic) Statistics

Row	Survival Function	Cumulative Hazard Function	Cox-Snell Residuals	Modified Cox-Snell Residuals	Martingale Residuals	Deviance Residuals
1	0.7200	0.3286	0.3286	0.3286	0.6714	0.9398
2	0.9245	0.0785	0.0785	0.0785	0.9215	1.8020
3	0.2386	1.4331	1.4331	1.4331	-0.4331	-0.3828
4	0.9104	0.0939	0.0939	0.0939	0.9061	1.7087
5	0.1697	1.7736	1.7736	1.7736	-0.7736	-0.6334
6	0.7322	0.3117	0.3117	1.3117	-0.3117	-0.7895
7	0.7668	0.2655	0.2655	0.2655	0.7345	1.0877
8	0.5836	0.5386	0.5386	0.5386	0.4614	0.5611
9	0.1916	1.6523	1.6523	1.6523	-0.6523	-0.5480
10	0.2409	1.4234	1.4234	1.4234	-0.4234	-0.3751
11	0.2415	1.4207	1.4207	1.4207	-0.4207	-0.3730
12	0.0914	2.3927	2.3927	2.3927	-1.3927	-1.0201
…	…	…	…	…	…	…

Row	Score (Partial) Residuals Age	Score (Partial) Residuals Sex	Delta-Beta Age	Delta-Beta Sex	Standardised Delta-Beta Age	Standardised Delta-Beta Sex
1	-1.0850	-0.2416	0.0020	-0.1977	0.0750	-0.1804
2	14.4930	0.6644	0.0004	0.5433	0.0155	0.4957
3	3.1291	-0.3065	-0.0011	0.0741	-0.0402	0.0677
4	-10.2215	0.4341	-0.0119	0.5943	-0.4528	0.5423
5	-16.5882	-0.5504	0.0049	0.0139	0.1869	0.0127
6	*	*	-0.0005	-0.1192	-0.0206	-0.1088
7	-17.8286	-0.0000	-0.0095	0.1269	-0.3607	0.1158
8	-7.6201	-0.0000	-0.0032	-0.0346	-0.1235	-0.0315
9	17.0910	-0.0000	-0.0073	-0.0733	-0.2771	-0.0669
10	10.2390	-0.0000	0.0032	-0.2023	0.1232	-0.1846
11	2.8575	-0.0000	0.0060	-0.2158	0.2279	-0.1970
12	5.5338	-0.0000	0.0048	-0.1939	0.1829	-0.1770
…	…	…	…	…	…	…

Row	Likelihood Displacement	Linear Fit (XBeta)
1	0.0328	-1.8604
2	0.3387	-4.0852
3	0.0046	-1.7389
…	…	…
9	0.1334	-3.5993
10	0.0353	-4.1156
11	0.0611	-4.5104
12	0.0432	-4.4800
…	…	…

Previous topic | Next topic