Mathematics

Large Data Set

A large data set refers to a collection of data that is too large to be processed or analyzed using traditional methods. These data sets typically contain millions or billions of data points and require specialized tools and techniques to extract meaningful insights. The study of large data sets is known as big data analytics.

Written by Perlego with AI-assistance

Related key terms

1 of 5

3 Key excerpts on "Large Data Set"

eBook - PDF
Handbook of Developmental Research Methods
- Brett Laursen, Todd D. Little, Noel A. Card, Brett Laursen, Todd D. Little, Noel A. Card(Authors)
- 2012(Publication Date)
- The Guilford Press
  (Publisher)
We then discuss some specific statistical issues in using large-scale longitudinal data sets and end with a section on how to apply the use of this method to developmental science questions. What Is a Large‑Scale Data Set? The first issue that should be addressed is what is meant by the term large-scale data set. Basically, a large-scale data set is a data set that has broad or expansive research 9. Large-Scale Data Sets 149 applications. These types of data sets need not be “large” in terms of sample size (i.e., a large N ), but they do need to serve as a rich source of research, whether the source of the data is primary or secondary. Secondary data analyses are often represented by large national population studies. These data sets are collected for all researchers to use and thus are not the “primary” data of any single researcher or research team. Primary data sets, however, are often associated with community samples in which individual researchers have collected data based on their own theories or models. These data sets have generally been proprietary and not available to other researchers except those on the research team. Thus the distinguishing element between primary and secondary data is whether or not the researcher(s) collected the data themselves. However, both types of data are “large-scale” data sets provided that they can be used to answer a broad array of research questions. Large-scale data sets typically fall into one of two categories: population or process. Population data sets can be thought of as national, representative data sets. These data sets are usually collected by research organizations that gather data across the country, and the sample is chosen by a specific sampling strategy to make sure that all racial/ethnic, gender, and economic groups are represented in the data set. They are generally collected to be used by the general research community and may not be based on any particular theory of development or family functioning.
Sign up to read
Learn more about book
eBook - ePub
An Introduction to Numerical Methods
A MATLAB® Approach
- Abdelwahab Kharab, Ronald Guenther(Authors)
- 2023(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
Chapter 17 Dealing with Large Sets of Data
DOI: 10.1201/9781003354284-17

In this chapter, we give a brief description how to deal with large, often vast data sets. Given a massive volume of data, even modern data processing software can be challenged. Some examples of big data sets are stock exchanges, social media sites, banking and securities, medical applications, energy and utilities, and so on.

Given effective ways to gather and store data, we are faced with the challenge of interpreting it, to learn from it, and to use it in making decisions. We shall assume that the data we are dealing with is good data, that is every effort has been made to gather and store honestly collected data and that it has not been “fudged” or changed for nefarious reasons. If that is the case, or if that is suspected, then it is a matter for possibly criminal prosecution, the point of the investigation of the data and the questions asked are quite different from what we will be doing today.

17.1 Introduction
We begin with a few obvious remarks and then illustrate some of the basic problems with a project not usually done in a mathematics course.
First, data often comes from the results of experiments. No experiment is perfectly repeatable. One tries very hard to repeat an experiment under the same conditions, one comes very close, but perfect agreement is never possible and so one obtains data, close, but not perfect. How does one deal with data, purportedly measuring the same phenomenon under very similar conditions, but not quite in agreement.
Even more difficulties occurs when one has massive amount of data that one has to interpret.
We begin with a simple experiment. One of the ways that the acceleration due to gravity was initially determined was to study the motion of a simple pendulum. The point here is not to determine g
Sign up to read
Learn more about book
eBook - ePub
Applications of Machine Learning in Big-Data Analytics and Cloud Computing
- Subhendu Kumar Pani, Somanath Tripathy, George Jandieri, Sumit Kundu, Talal Ashraf Butt(Authors)
- 2022(Publication Date)
- River Publishers
  (Publisher)
Machine learning is promising when large-scale data processing is considered as it has the capability of learning from the previous data and incorporating those findings on new incoming data. Although its applications in big data are limitless, it faces a vast number of challenges. This paper gives us an introduction to the implementation of machine learning in big data. We discuss about four out of the five Vs of big data and about the challenges faced in this regard. Some previous works are also discussed in the related work section. Some of the tools under batch analysis, stream analysis, and interactive analysis are explained here. Under the section of machine learning algorithms in big data, popular algorithms in supervised, semi-supervised, and unsupervised categories are briefly discussed. Machine learning also finds its way in sectors like healthcare, financial services, automotive, etc. It can also be used in chatbots, recommendation engines, user modeling, predictive analytics, etc. Then come the challenges that are faced by the machine learning algorithms. Under each characteristic, there come several areas that are a challenge and can be improved to enhance the productivity of the algorithm. Large and complex datasets are also difficult to process. New techniques need to be evolved or the existing technologies need to be updated for enhanced processing and better results.

Keywords: Big data, machine learning algorithms, tools and techniques, applications.

7.1 Introduction

With the onset of large volumes of data, traditional methods of data storage need to be enhanced. The authors in [1 ] define big data as a collection of very large complex datasets whose processing becomes difficult using the traditional applications for processing of data. This paper [1 ] also mentions that the biggest source of data, i.e., about 90% of the data, is from social media and also discusses the five Vs or the big data characteristics. The first one is the volume which refers to the complex and large quantity of data which is generated every second. The authors in [2 ] mentioned that in the near future, dataset will be generated in zettabytes which is far more than the capacity of processing of the current database systems. Next comes the velocity which is the speed of the generation of data. It is mentioned in [2 ] that it is the speed with which the data comes from the different sources and also the speed with which it flows. The authors in [3 ] referred velocity as the collection and analysis of data to be done in a timely manner. Variety of the data refers to the different categories of data available. It may be unstructured data or structured data. Value of the data refers to how much the data can provide us with relevant information. It is also mentioned in [3 ] that for the purpose of revenue, the data can be sold to third parties. Veracity of the data refers to both its validity and reliability. Veracity is referred in [3 ] as the quality of data which needs to be good, and proper sanitization and cleaning of data needs to be done so that the results we get are effective and accurate. Figure 7.1 represents the 5 Vs in big data. Coming to machine learning, it is a component of artificial intelligence and can make intelligent decisions. Machine learning algorithm can be of three types. They are supervised, semi-supervised and unsupervised. This paper mainly focuses on the introduction of machine learning in big data. With the increasing amounts of data, making the prediction models is becoming more difficult with the existing set of tools and techniques. Therefore, the authors in [4 ] mentioned the necessity to process large amounts of realtime data in a way that increases the accuracy of the predictions and can be incorporated into applications of different domains. Section 7.2 discusses on some of the previous works which implement machine learning in big data. Section 7.3 discusses some of the tools used in big data. Section 7.4 mentions some of the machine learning algorithms that are used in big data. Sections 7.5 and 7.6 list its applications and challenges, respectively. We finally end the paper with Sections 7.7
Sign up to read
Learn more about book

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

1 of 8

View all

Large Data Set

Related key terms

3 Key excerpts on "Large Data Set"

Handbook of Developmental Research Methods

An Introduction to Numerical Methods

A MATLAB® Approach

17.1 Introduction

Applications of Machine Learning in Big-Data Analytics and Cloud Computing

7.1 Introduction

Explore more topic indexes