1.1 Big Data: Introduction
The current scenario in the world tells us the importance of data that is collected in our daily lives in various forms. The data is collected in unimaginable size every minute of the day. When one half of the earth sleeps, the other half starts their morning with web surfing. So, one can say that data never sleeps. A 2018 article from Forbes tells us that 2.5 quintillion bytes of data are created each day, and the number is increasing with every year ahead. All this data is being collected from Netflix, Amazon, Google, office meetings and messages and emails, hospital records or health records, financial firms, the entertainment world, government sector, social networking sites, shopping sites, etc. The amount of data helps in making personalized experiences for humans, and all this is because of big data [1, 26].
“Big data” is a term that covers large and complex datasets that cannot be processed using traditional methods of data management. All this vast data can be stored, processed, and analyzed computationally to get useful results. There is no exact range of which data can be considered as big data, but the more data one has, the more meaningful and resourceful it is. But according to a few, any data that cannot be treated with traditional management models can be treated as big data. This big data concept gathered momentum in the early 2000s. Big data has brought major changes in the information management industry. So, one needs to know how to make this data informative and knowledgeable [2, 27].
Big data undergoes various stages, the last one of which is the analysis of the data, which gives all the information that one can extract from that dataset. Traditional data analysis methods included exploratory paths that consider the past and the current form of data, but big data analysis is a predictive analysis that tends to focus on the current phase and future outcome of the data; earlier, analytics was a model-driven process, but now it is a data-driven process [3, 28]. Another difference between traditional and new management methods is that nowadays data analysts tend to use structured and clean data for building a model, but they want to try that model on unstructured data, which is not possible using traditional data management methods. Models are built using statistical and probabilistic methods while analyzing big data, which help effectively in making real-time predictions and detecting anomalies that were not possible before. Real-time data found around us in our day-to-day lives can be in any form such as finance or government records, research or biological data, and many more. All this data is useless unless it is filtered and a conclusion is made out of it. I have dealt with healthcare data to understand how big data can be used to gain knowledge and derive conclusions [4, 29].
1.2 Big Data: 5 Vs
Big data was first defined using 3 Vs, but with the expansion of the term, it is now defined with 5 Vs. These Vs are Volume, Velocity, Variety, Veracity, and Value. These are all the characteristics of big data, or we can say that these are the parameters that are used to define whether data is big data or not [5, 30].
1.2.1 Volume
As the name suggests, big data comprises large chunks of data, which ultimately defines the word “volume.” These large chunks of data can be of any volume. It plays an important role in interpreting the worthiness of data, which means to consider a chunk of data as big data volume plays a very important role. In the case of the healthcare sector, a huge amount of data concerning an individual is generated on a daily basis. This huge amount of data is needed to be handled properly. So, here big data comes into use.
1.2.2 Velocity
Since the amount of data is very large, it is necessary that the rate of collection of data should also increase. That’s why the term “velocity” is introduced in the context of big data. Velocity in big data is very important because there is continuous circulation as well as building up of data, so it is necessary to process and analyze these data at the same rate so that we can gain valuable information from these chunks of data. Suppose a survey is being held to know the actual cause of malnutrition among children below the age of 5. This data is collected as well as interpreted simultaneously to know about each and every reason which is responsible for causing malnutrition.
1.2.3 Variety
It defines the diverseness of the huge amount of data that is being collected. In the context of big data, “variety” tells us about characteristics of data, that is whether the data is sorted, which means data of the same category are in the same group, or unsorted, which means data are not arranged at all—there is no relationship that can be established between them.
1.2.4 Veracity
It is related to the affirmation of data. It tells us about the reliability of...