We will start our discussion with a few events that can be observed: death of a person due to a disease, attrition of an employee from an organization and incident of a natural calamity (earthquake or flood). All these examples are from completely different domains, but they have a common thing: time or, better to say, time until an event occurs. Time is crucial in all these situations. If we know beforehand that a certain event may occur at any specific time, then a lot of lives and resources can be saved. Survival analysis is defined as a collection of statistical longitudinal data analysis techniques where time is a major factor. It is utilized in biology, medicine, engineering, marketing, social sciences or behavioral sciences. Survival analysis is also sometimes named as reliability theory under operations research or engineering. It is a complex subject and the reader would need expertise in probability, statistics, calculus and optimization to grasp it fully.
In this chapter, we will explore some basic concepts of survival analysis, nomenclatures and sample datasets.
Concept of Failure Time
We have already talked about event. In general, survival analysis deals with the events related to failure. And failure off course can occur one or more time for any subject. For the topics discussed in this book it is assumed that failure occurs only once for a subject. We will be using the term subject throughout this book to represent the entity which is going through some phases and the failure (or the event) is attached to it. A subject may be a person, a machine, a river, and even an entire geographic region. There are numerous use cases where survival analysis can be applied to find out chances of event occurrence. Some of them are:
Death of a person by any disease
Suicide
Failure of machine tools
Attrition of employees from organization
Divorce
Occurrence any natural catastrophe (flood, earthquake, volcanic eruption, etc.)
In this book, we will be discussing mostly about the death by disease use cases, as survival analysis finds its usage in these cases mostly. Death by disease use case is mostly analyzed in case of drug development, where survival analysis plays a crucial role to identify the right drug by comparative study of several options.
We are talking about time a lot. But what does it signify? By time, we mean years, months, weeks or days from the beginning of analysis of the data until an event (like death, exit of an employee, earthquake, etc.) occurs. As said earlier, event is also termed as failure. So, time taken till failure is referred to as the failure time or survival time. Time may not be a physical unit always; there are cases where it can be used as a logical indicator. Below points are needed to be taken care of before defining a time scale:
Origin of the time must be unambiguously defined.
The scale for measuring the time difference must be defined.
Definition of failure must be clear.
Concept of Survival
When we speak about survival, we mean probabilities. Probability of not occurring an event till some time can be taken as survival probability. In other words, probability of an event occurrence after a certain time is survival probability. For example, when we say survival probability of a heart patient at age 71 is 0.23, it means that the patient will survive at least till age 71 and there is a probability 0.23 that he/she will keep surviving after 71. Age is a time scale here. Similarly, there could be a probability 0.40 that he/she will survive after 50. Reason is clear. At younger age, chances of collapsing by a heart attack is less and thus survival probability will be higher. So, we can have a survival probability distribution over random variable time (here age) like below:
Table 1.1 A Sample Survival Probability Distribution Time (Age) | 40 | 45 | 50 | 60 | 65 | 70 |
Survival Probability | 0.51 | 0.42 | 0.38 | 0.36 | 0.28 | 0.24 |
One of the purposes of survival analysis is to find out this probability distribution. A lot of other domain-specific statistical inferences can also be drawn from this. It can be observed that survival probability decreases over time. It is a very important feature of distribution. We will discuss it in greater detail in Chapter 2. Like heart patient use case, the same analysis can be done for employee attrition of an organization. The purpose is to find out survival probability distribution of employee exit at various times after he/she joins there. Interesting part is that the term survival is very generic here. It should not necessarily always mean saving yourself from something. It is not also always related to disease, patients or healthcare. Survival means non-occurrence of an event till some time. Events could either be any one from the list as discussed in the section ‘Concept of Failure Time’ or something else.
Censoring
Most survival analyses must consider a very important analytical problem called censoring. It is caused by not observing some subjects fo...