Computer Science
Big Data
Big Data refers to extremely large and complex datasets that traditional data processing applications are unable to handle. It encompasses the collection, storage, and analysis of vast amounts of information to extract valuable insights and make data-driven decisions. Big Data technologies and techniques are essential for managing the volume, velocity, and variety of data in today's digital world.
Written by Perlego with AI-assistance
Related key terms
1 of 5
11 Key excerpts on "Big Data"
- eBook - PDF
Big Data Computing
A Guide for Business and Technology Managers
- Vivek Kale(Author)
- 2016(Publication Date)
- Chapman and Hall/CRC(Publisher)
The answer to these challenges is a scalable, integrated 208 Big Data Computing computer systems hardware and software architecture designed for parallel processing of Big Data computing applications. This chapter explores the challenges of Big Data computing. 9.1.1 What Is Big Data? Big Data can be defined as volumes of data available in varying degrees of complexity, gen- erated at different velocities and varying degrees of ambiguity that cannot be processed using traditional technologies, processing methods, algorithms, or any commercial off- the-shelf solutions. Data defined as Big Data includes weather, geospatial, and geographic information sys- tem (GIS) data; consumer-driven data from social media; enterprise-generated data from legal, sales, marketing, procurement, finance and human-resources department; and device-generated data from sensor networks, nuclear plants, X-ray and scanning devices, and airplane engines (Figures 9.1 and 9.2). 9.1.1.1 Data Volume The most interesting data for any organization to tap into today is social media data. The amount of data generated by consumers every minute provides extremely important insights into choices, opinions, influences, connections, brand loyalty, brand manage- ment, and much more. Social media sites not only provide consumer perspectives but also competitive positioning, trends, and access to communities formed by common interest. Organizations today leverage the social media pages to personalize marketing of products and services to each customer. Data variety Torrent Batch Trickle Stream Structure data GB MB PB TB Photo Video Audio HTML Free text Data volume Data velocity Data veracity Certain Confirmed Complete Consistent Clear Correct FIGURE 9.1 4V characteristics of Big Data. 209 Introducing Big Data Computing Many additional applications are being developed and are slowly becoming a reality. - eBook - ePub
Big Data Analysis for Green Computing
Concepts and Applications
- Rohit Sharma, Dilip Kumar Sharma, Dhowmya Bhatt, Binh Thai Pham, Rohit Sharma, Dilip Kumar Sharma, Dhowmya Bhatt, Binh Thai Pham(Authors)
- 2021(Publication Date)
- CRC Press(Publisher)
Big Data are used to define the enormous volume of data that cannot be stored, processed, or analyzed using traditional database technologies. These data or data sets are too large and complex for the existing traditional database. The idea of Big Data is vague and includes extensive procedures to recognize and make an interpretation of the information into new bits of knowledge. Although the term Big Data is available from the last century; however, its true meaning has been recognized after social media has become popular. Thus, one can say that the term is relatively new to the IT industry and organizations. However, there are several instances where the researchers have used and utilized the term in their literature.The authors in [9 ] defined a large volume of scientific data as Big Data required for visualization. Several authors have defined Big Data in different ways. One of the earliest was given by Gartner in 2001 [1 ]. The Gartner definition has no keyword as “Big Data”; however, he defines the term with three Vs: volume, velocity, and variety. He has discussed the increasing rate and size of data. This definition given by Gartner is later adopted by various agencies and authors such as NIST [10 ], Gartner himself in 2012 [11 ], and later IBM [12 ], and others include a fourth V to original three Vs: veracity. The authors in [13 ] explained the term Big Data as “the volume of data that is beyond the current technology’s capability to efficiently store, manage, and process”. The authors in [14 ] and [15 ] used Gartner’s classification of 2001 of three Vs: volume, variety, and velocity to define Big Data (Figure 3.1 ).FIGURE 3.1Three Vs of Big Data.So based on the discussion, the term Big Data can be defined as a high volume of a variety of data that are generated at a rapid speed, and the traditional database system is not fully functional to store, process, and analyze that in real time. Let us look into the three Vs that are defined by various authors in their work.- Volume is the measure of information created from a variety of sources that keep on growing. The major benefit of such a large collection of information is that it helps in decision making by identifying hidden patterns with the help of data analytics. Having more parameters let us say 200 to forecast weather will be able to predict weather better as compared to forecasting with 4–6 parameters. The volume in Big Data refers to the size of zettabytes (ZB – 10,007 bytes) or yottabyte (YB – 10,008 bytes). Thus, it becomes a huge challenge for the current infrastructure to store this amount of data. Most companies like to put their old data in archives or logs form, i.e., in an offline mode. But the disadvantage of this is that the data are not available for processing. Thus, it requires a scalable and distributed storage that is being offered by cloud as cloud storage either in the form of object, file, or block storage.
Assuming that the enormous volume of the data cannot be processed by the traditional database, the options left for processing are either breaking the data into chunks for massively parallel processing frameworks such as Apache Hadoop or database like Greenplum. Using the data warehouse or database evolves the predefined schemas to be entered into the system, given the other Vs – variety: this is again not feasible for Big Data. Apache Hadoop does not place any such condition on the structure of data and can process it without a predefined structure. For storing data, the Hadoop framework used its own distributed file system known as Hadoop Distributed File System or HDFS. When data are required by any node, HDFS provides that data to the node. A typical Hadoop framework has three steps for storing data [16
- eBook - PDF
Guide to Cloud Computing for Business and Technology Managers
From Distributed Computing to Cloudware Applications
- Vivek Kale(Author)
- 2014(Publication Date)
- Chapman and Hall/CRC(Publisher)
The answer to these challenges is a scalable, integrated computer systems hardware and software architecture designed for parallel processing of Big Data computing applications. This chapter explores the challenges of Big Data computing. 21.1.1 What Is Big Data? Big Data can be defined as volumes of data available in varying degrees of complexity, generated at different velocities and varying degrees of ambi-guity, which cannot be processed using traditional technologies, processing methods, algorithms, or any commercial off-the-shelf solutions. Data defined as Big Data include weather; geospatial and GIS data; consumer-driven data from social media; enterprise-generated data from legal, sales, marketing, procurement, finance, and human-resources depart-ment; and device-generated data from sensor networks, nuclear plants, x-ray and scanning devices, and airplane engines. 21.1.1.1 Data Volume The most interesting data for any organization to tap into today are social media data. The amount of data generated by consumers every minute pro-vides extremely important insights into choices, opinions, influences, con-nections, brand loyalty, brand management, and much more. Social media sites provide not only consumer perspectives but also competitive posi-tioning, trends, and access to communities formed by common interest. Organizations today leverage the social media pages to personalize market-ing of products and services to each customer. Every enterprise has massive amounts of e-mails that are generated by its employees, customers, and executives on a daily basis. These e-mails are all considered an asset of the corporation and need to be managed as such. After Enron and the collapse of many audits in enterprises, the US government mandated that all enterprises should have a clear life-cycle management of e-mails and that e-mails should be available and auditable on a case-by-case basis. - eBook - PDF
- Peter Bühlmann, Petros Drineas, Michael Kane, Mark van der Laan, Peter Bühlmann, Petros Drineas, Michael Kane, Mark van der Laan(Authors)
- 2016(Publication Date)
- Chapman and Hall/CRC(Publisher)
All of these factors suggest a kind of ubiquity of data, but also contain a functionally vague understanding, which is situationally determined, and because of that it can be deployed in many contexts, has many advocates, and can be claimed by many as well. Partly because of this context-sensitive definition of the concept of Big Data, it is by no means a time phenomenon or novelty, but has a long genealogy that goes back to the earliest civilizations. Some aspects of this phenomenon will be discussed in the following sections. In addition, we will show in this chapter how Big Data embody a conception of data science at least at two levels. First of all, data science is the technical-scientific discipline, specialized in managing the multitude of data: collect, store, access, analyze, visualize, interpret, and protect. It is rooted in computer science and statistics; computer science is traditionally oriented toward data structures, algorithms, and scalability, and statistics is focused on analyzing and interpreting the data. In particular, we may identify here the triptych database technology/information retrieval, computational intelligence/machine learning, and finally inferential statistics. The first pillar concerns database/information retrieval technology. Both are core disciplines of computer science since many decades. Emerging from this tradition in recent years, notably researchers of Google and Yahoo have been working on techniques to cluster many computers in a data center, making data accessible and allowing for data-intensive calculations: think, for example, of BigTable, Google File Systems, a programming paradigm as Map Reduce, and the open source variant Hadoop. The paper of Halevy precedes this development as well. The second pillar relates to intelligent algorithms from the field of computational intelligence (machine learning and data mining). - eBook - ePub
The Data Revolution
A Critical Analysis of Big Data, Open Data and Data Infrastructures
- Rob Kitchin(Author)
- 2021(Publication Date)
- SAGE Publications Ltd(Publisher)
The most common initial delineators of Big Data were ‘the 3Vs’ of volume, velocity and variety (Laney 2001; Zikopoulos et al. 2012). Big Data are:- huge in volume, consisting of terabytes or petabytes of data;
- high in velocity, being created in real time;
- diverse in variety, being structured and unstructured in nature.
Prior to Big Data, databases were constrained across these three attributes and it was only possible for two to exist at any one time (volume and velocity; varied and velocity; volume and varied) (Croll 2012). A number of technological developments over the past three decades enabled these three attributes to become simultaneously achievable, including:- ubiquitous computing and the widespread rollout of information and communication technologies, especially fixed and mobile internet;
- the embedding of software into all kinds of objects, machines and systems, transforming them from ‘dumb’ to ‘smart’, as well as the creation of purely digital devices and systems, and advances in database design and systems of information management;
- distributed and ‘forever storage’ of data at affordable costs; and
- new forms of data analytics designed to cope with data abundance as opposed to data scarcity.
(See Chapter 5 of the first edition for full discussion of these enablers.)The 3Vs, however, are not the only attributes of Big Data enabled by these developments. Other identified qualities include: - No longer available |Learn more
Big Data Architect's Handbook
A guide to building proficiency in tools and systems used by leading big data experts
- Syed Muhammad Fahad Akhtar(Author)
- 2018(Publication Date)
- Packt Publishing(Publisher)
If we take a simpler definition, it can basically be stated as a huge volume of data that cannot be stored and processed using the traditional approach. As this data may contain valuable information, it needs to be processed in a short span of time. This valuable information can be used to make predictive analyses, as well as for marketing and many other purposes. If we use the traditional approach, we will not be able to accomplish this task within the given time frame, as the storage and processing capacity would not be sufficient for these types of tasks.That was a simpler definition in order to understand the concept of big data. The more precise version is as follows: Data that is massive in volume, with respect to the processing system, with a variety of structured and unstructured data containing different data patterns to be analyzed.From traffic patterns and music downloads, to web history and medical records, data is recorded, stored, and analyzed to enable technology and services to produce the meaningful output that the world relies on every day. If we just keep holding on to the data without processing it, or if we don't store the data, considering it of no value, this may be to the company's disadvantage.Have you ever considered how YouTube is suggesting to you the videos that you are most likely to watch? How Google is serving you localized ads, specifically targeted to you as ones that you are going to open, or of the product you are looking for? These companies are keeping all of the activities you do on their website and utilizing them for an overall better user experience, as well as for their benefit, to generate revenue. There are many examples available of this type of behavior and it is increasing as more and more companies are realizing the power of data. This raises a challenge for technology researchers: coming up with more robust and efficient solutions that can cater to new challenges and requirements.Now, as we have some understanding of what big data is, we will move ahead and discuss its different characteristics.Passage contains an image
Characteristics of Big Data
These are also known as the dimensions of Big Data. In 2001, Doug Laney first presented what became known as the three Vs of Big Data to describe some of the characteristics that make Big Data different from other data processing. These three Vs are volume, velocity, and variety. This the era of technological advancement and loads of research is going on. As a result of this reaches and advancements, these three Vs have become the six Vs of Big Data as of now. It may also increase in future. As of now, the six Vs of Big Data are volume, velocity, variety, veracity, variability, and value, as illustrated - Zhu Han, Mingyi Hong, Dan Wang(Authors)
- 2017(Publication Date)
- Cambridge University Press(Publisher)
Part I Overview of Big Data Applications 1 Introduction 1.1 Background Today, scientists, engineers, educators, citizens, and decision-makers have unprece- dented amounts and types of data available to them. Data come from many disparate sources, including scientific instruments, medical devices, telescopes, microscopes, satellites; digital media including text, video, audio, e-mail, weblogs, twitter feeds, image collections, click streams, and financial transactions; dynamic sensor, social, and other types of networks; scientific simulations, models, and surveys; or computational analysis of observational data. Data can be temporal, spatial, or dynamic; structured or unstructured. Information and knowledge derived from data can differ in repre- sentation, complexity, granularity, context, provenance, reliability, trustworthiness, and scope. Data can also differ in the rate at which they are generated and accessed. The phrase “Big Data” refers to the kinds of data that challenge existing analytical methods due to size, complexity, or rate of availability. The challenges in managing and analyzing “Big Data” can require fundamentally new techniques and technologies in order to handle the size, complexity, or rate of avail- ability of these data. At the same time, the advent of Big Data offers unprecedented opportunities for data-driven discovery and decision-making in virtually every area of human endeavor. A key example of this is the scientific discovery process, which is a cycle involving data analysis, hypothesis generation, the design and execution of new experiments, hypothesis testing, and theory refinement. Realizing the transformative potential of Big Data requires addressing many challenges in the management of data and knowledge, computational methods for data analysis, and automating many aspects of data-enabled discovery processes.- eBook - ePub
- Marine Corlosquet-Habart, Jacques Janssen, Marine Corlosquet-Habart, Jacques Janssen(Authors)
- 2018(Publication Date)
- Wiley-ISTE(Publisher)
This expansive volume of data is what brought forth the Big Data phenomenon. With current data stores unable to absorb such growth in data volumes, companies, engineers and researchers have had to create new solutions, notably offering distributed storage and processing of these masses of data (see section 1.4). The places that store this data, the famous data centers, also raise significant questions in terms of energetic consumption. One report highlights the fact that data centers handling American data consumed 91 billion kWh of electricity in 2013, equivalent to the annual output of 34 large coal-fired power plants [DEL 14]. This figure is likely to reach 140 billion in 2020, equivalent to the annual output of 50 power plants, costing the American population $13 billion per year in electricity bills. If we add to this the emission of 100 million metric tons of CO 2 per year, it is easy to see why large organizations have very quickly started taking this problem seriously, as demonstrated by the frequent installation of data centers in cold regions around the world, with ingenious systems for recycling natural energy [EUD 16]. 1.3.3. Velocity The last of the three historic Vs, the V for velocity, represents what would probably more naturally be called speed. It also covers multiple components, and it is intrinsic to the Big Data phenomenon. This is clear from the figures above regarding the development of the concept and volume of data, like a film in fast-forward. Speed can refer to the speed at which the data are generated, the speed at which they are transmitted and processed, and also the speed at which they can change form, provide value and, of course, disappear. Today, we must confront large waves of masses of data that must be processed in real time - eBook - PDF
- Matthias Dehmer, Frank Emmert-Streib, Stefan Pickl, Andreas Holzinger(Authors)
- 2016(Publication Date)
- Chapman and Hall/CRC(Publisher)
Section 2.3.4 addresses further applications before we conclude this chapter in Section 2.4. 2.2 Background Data generation and collection has increasingly grown over the last 10 years. At the same time storage has become more and more affordable [ 1 , 6 ]. The appearance of smartphones and tablets enabled people to be connected to the Internet almost anywhere at any time. Along with the expansion of broadband networks, those devices are enabled by integrated sensors to generate additional usable data, for example, motion profiles. Health and fitness tracking is now also possible due to so-called wearables. These developments opened a huge market for new companies in software development but also generated new growth opportunities for established companies (Internet of things, smart metering, etc.). With respect to analytics and enabling new products, Big Data may create significant value to the organizations [ 8 ]. Big Data is classified mostly with the three V’s of volume, velocity, and variety [8]. Regarding quality and accuracy, there is a fourth V for veracity [ 9 ]. These four V’s present the—mostly technical—challenges. In addition, the three F’s of fast, flexible, and focused and other functions and software components are aspects that need to be considered to find a holistic approach for a Big Data platform [ 10 ]. The beginning of Big Data can be dated to 2004 when Google worked on their big table project [ 3 ]. In the same year Facebook was founded. Since then there have been a lot of new more or less popular developments which are capable of coping with the vast amount of data [ 4 , 11 , 12 ]. It is desirable that the performance increases linearly with the number of servers [ 13 ]. However, this linear scalability is not achieved by the implemented server clusters. As shown by [ 3 ], the performance per server drops by a factor of 2–5 when increasing the number of servers from 1 to 500. There are also other approaches addressing scalability. - eBook - PDF
Big Data at Work
Dispelling the Myths, Uncovering the Opportunities
- Thomas Davenport(Author)
- 2014(Publication Date)
- Harvard Business Review Press(Publisher)
Another commonly used tool is MapReduce, a Google-developed framework for dividing Big Data pro-cessing across a group of linked computer nodes. Hadoop contains a version of MapReduce. These new technologies are by no means the only ones that organi-zations need to investigate. In fact, the technology environment for Big Data has changed dramatically over the past several years, and it will continue to do so. There are new forms of databases such as colum-nar (or vertical) databases; new programming languages—interactive scripting languages like Python, Pig, and Hive are particularly popu-lar for Big Data; and new hardware architectures for processing data, such as Big Data appliances (specialized servers) and in-memory ana-lytics (computing analytics entirely within a computer’s memory, as opposed to moving on and off disk storage). There is another key aspect of the Big Data technology environment that differs from traditional information management. In that previ-ous world, the goal of data analysis was to segregate data into a sepa-rate pool for analysis—typically a data warehouse (which contains a wide variety of data sets addressing a variety of purposes and topics) or mart (which typically contains a smaller amount of data for a single purpose or business function). However, the volume and velocity of Big Data—remember, it can sometimes be described as a fast-moving river of information that never stops—means that it can rapidly overcome any segregation approach. Just to give one example: eBay, which col-lects a massive amount of online clickstream data from its customers, has more than 40 petabytes of data in its data warehouse—much more than most organizations would be willing to store. And it has much Technology for Big Data 117 more data in a set of Hadoop clusters—nobody seems to know exactly (and the number changes daily), but well over 100 petabytes. - eBook - PDF
Big Data, Big Analytics
Emerging Business Intelligence and Analytic Trends for Today's Businesses
- Michael Minelli, Michele Chambers, Ambiga Dhiraj(Authors)
- 2012(Publication Date)
- Wiley(Publisher)
Today we can run the algorithm, look at the results, extract the results, and feed the business process—automatically and at massive scale, using all of the data available. Big Data TECHNOLOGY 65 We continue our conversation with Mehta later in the book. For the moment, let’s boil his observations down to three main points: 1. The technology stack has changed. New proprietary technologies and open-source inventions enable different approaches that make it easier and more affordable to store, manage, and analyze data. 2. Hardware and storage is affordable and continuing to get cheaper to enable massive parallel processing. 3. The variety of data is on the rise and the ability to handle unstruc- tured data is on the rise. Data Discovery: Work the Way People’s Minds Work There is a lot of buzz in the industry about data discovery, the term used to describe the new wave of business intelligence that enables users to explore data, make discoveries, and uncover insights in a dynamic and intuitive way versus predefined queries and preconfigured drill-down dashboards. This approach has resonated with many business users who are looking for the freedom and flexibility to view Big Data. In fact, there are two software companies that stand out in the crowd by growing their busi- nesses at unprecedented rates in this space: Tableau Software and QlikTech International. Both companies’ approach to the market is much different than the tra- ditional BI software vendor. They grew through a sales model that many refer to as “land and expand.” It basically works by getting intuitive software in the hands of some business users to get in the door and grow upward. In the past, BI players typically went for the big IT sale to be the preferred tool for IT to build reports for the business users to then come and use. In order to succeed at the BI game of the “land and expand model,” you need a product that is easy to use with lots of sexy output.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.










