Computer Science

Spark Big Data

Spark Big Data refers to the use of Apache Spark, a powerful open-source distributed computing system, for processing and analyzing large-scale datasets. It provides a fast and efficient way to handle big data workloads, offering features such as in-memory processing, fault tolerance, and support for various data sources. Spark Big Data is widely used in data analytics, machine learning, and real-time processing applications.

Written by Perlego with AI-assistance

10 Key excerpts on "Spark Big Data"

  • Book cover image for: Supercomputing Frontiers: 4th Asian Conference, SCFA 2018, Singapore, March 26-29, 2018, Proceedings
    Such frameworks keep data processing in memory and there-fore try to efficiently exploit its high speed. Among these frameworks, Spark [ 36 ] has become the de facto framework for in-memory Big Data analytics. Spark is recently used to run diverse set of applications including machine learning, stream processing and etc. For example, Netflix has a Spark cluster of over 8000 machines processing multiple petabytes of data in order to improve the customer experience by providing better recommendations for their streaming services [ 5 ]. On the other hand, high performance computing (HPC) systems recently gained a huge interest as a promising platform for performing fast Big Data processing given their high performance nature [ 1 , 17 ]. HPC systems are equipped with low-latency networks and thousands of nodes with many cores and therefore have the potential to perform fast Big Data processing. For instance, PayPal recently shipped its fraud detection software to HPC systems to be able to detect frauds among millions of transactions in a timely manner [ 26 ]. However, when introducing Big Data processing to HPC systems, one should be aware of the different architectural designs in current Big Data processing and HPC systems. Big Data processing systems have shared nothing architecture and nodes are equipped with individual disks, thus they can co-locate the data and compute resources on the same machine (i.e., data-centric paradigm). On the other hand, HPC systems employ a shared architecture (e.g., parallel file sys-tems) [ 19 ] which results in separation of data resources from the compute nodes (i.e., compute-centric paradigm). Figure 1 illustrates these differences in the design of these two systems. These differences in the design of these two systems introduce two major challenges: Big Data applications will face high latencies when performing I/O due to the necessary data transfers between the parallel file system and computation nodes.
  • Book cover image for: Hands-On Data Science and Python Machine Learning

    Apache Spark - Machine Learning on Big Data

    So far in this book we've talked about a lot of general data mining and machine learning techniques that you can use in your data science career, but they've all been running on your desktop. As such, you can only run as much data as a single machine can process using technologies such as Python and scikit-learn.
    Now, everyone talks about big data, and odds are you might be working for a company that does in fact have big data to process. Big data meaning that you can't actually control it all, you can't actually wrangle it all on just one system. You need to actually compute it using the resources of an entire cloud, a cluster of computing resources. And that's where Apache Spark comes in. Apache Spark is a very powerful tool for managing big data, and doing machine learning on large Datasets. By the end of the chapter, you will have an in-depth knowledge of the following topics:
    • Installing and working with Spark
    • Resilient Distributed Datasets (RDDs )
    • The MLlib (Machine Learning Library )
    • Decision Trees in Spark
    • K-Means Clustering in Spark
    Passage contains an image

    Installing Spark

    In this section, I'm going to get you set up using Apache Spark, and show you some examples of actually using Apache Spark to solve some of the same problems that we solved using a single computer in the past in this book. The first thing we need to do is get Spark set up on your computer. So, we're going to walk you through how to do that in the next couple of sections. It's pretty straightforward stuff, but there are a few gotchas. So, don't just skip these sections; there are a few things you need to pay special attention to get Spark running successfully, especially on a Windows system. Let's get Apache Spark set up on your system, so you can actually dive in and start playing around with it.
    We're going to be running this just on your own desktop for now. But, the same programs that we're going to write in this chapter could be run on an actual Hadoop cluster. So, you can take these scripts that we're writing and running locally on your desktop in Spark standalone mode, and actually run them from the master node of an actual Hadoop cluster, then let it scale up to the entire power of a Hadoop cluster and process massive Datasets that way. Even though we're going to set things up to run locally on your own computer, keep in mind that these same concepts will scale up to running on a cluster as well.
  • Book cover image for: Software Architecture for Big Data and the Cloud
    • Ivan Mistrik, Rami Bahsoon, Nour Ali, Maritta Heisel, Bruce Maxim(Authors)
    • 2017(Publication Date)
    • Morgan Kaufmann
      (Publisher)
    It's the first general-purpose compute platform to have emerged after the removal of cluster resource management from the MapReduce paradigm. The Spark computing framework grew from work at UC Berkeley and has quickly gained momentum within the data intensive computing community due to its performance and flexibility [9]. The speed increase is due, in part, to the Resilient Distributed Dataset (RDD) abstraction that allows working data be cached in memory, eliminating the need for costly intermediate stage disk writes [78]. At its core, Apache Spark is a cluster-computing platform that provides an API allowing users to create distributed applications, although it has grown to be the key component in the larger Berkley Data Analytics Stack (BDAS). Spark was designed to alleviate some of the constraints of the MapReduce programming model, specifically its poor performance when utilizing the same dataset for iterative compute processes due to its lack of an abstraction for distributed memory access [78]. Spark has been optimized for compute tasks that require the reuse of the same working dataset across multiple parallel operations, especially iterative machine learning and data analytic tasks, whilst also maintaining scalability and fault tolerance [64]. It has been argued that modern data analytic tasks that require the reuse of data are increasingly common in iterative machine learning and graph computation algorithms [77]. Examples of such algorithms include PageRank, used to create a rank of the popularity of web pages and other linked data sources and K-means clustering, used to group common members of a dataset together. From the application developer's perspective, Spark allows for the creation of standalone programs in Java, Scala, R, and Python. Interestingly Spark also offers users the ability to utilize an interactive shell that runs atop of the cluster, behaving much like an interactive Python interpreter would
  • Book cover image for: Practical Big Data Analytics
    • Nataraj Dasgupta, Giancarlo Zaccone, Patrick Hannah(Authors)
    • 2018(Publication Date)
    • Packt Publishing
      (Publisher)
    # COMMAND ---------- # Pair RDDswordPairs = wordsRDD.map(lambda word: (word, 1))print wordPairs.collect()# COMMAND ----------# #### Part 2: Counting with pair RDDs # There are multiple ways of performing group-by operations in Spark# One such method is groupByKey()# # ** Using groupByKey() **# # This method creates a key-value pair whereby each key (in this case word) is assigned a value of 1 for our wordcount operation. It then combines all keys into a single list. This can be quite memory intensive, especially if the dataset is large.# COMMAND ----------# Using groupByKeywordsGrouped = wordPairs.groupByKey()for key, value in wordsGrouped.collect(): print '{0}: {1}'.format(key, list(value))# COMMAND ----------# Summation of the key values (to get the word count)wordCountsGrouped = wordsGrouped.map(lambda (k,v): (k, sum(v)))print wordCountsGrouped.collect()# COMMAND ----------# ** (2c) Counting using reduceByKey **# # reduceByKey creates a new pair RDD. It then iteratively applies a function first to each key (i.e., within the key values) and then across all the keys, i.e., in other words it applies the given function iteratively.# COMMAND ----------wordCounts = wordPairs.reduceByKey(lambda a,b: a+b)print wordCounts.collect()# COMMAND ----------# %md# ** Combining all of the above into a single statement **# COMMAND ----------wordCountsCollected = (wordsRDD .map(lambda word: (word, 1)) .reduceByKey(lambda a,b: a+b) .collect())print wordCountsCollected# COMMAND ----------# %md# # This tutorial has provided a basic overview of Spark and introduced the Databricks community edition where users can upload and execute their own Spark notebooks. There are various in-depth tutorials on the web and also at Databricks on Spark and users are encouraged to peruse them if interested in learning further about Spark. Passage contains an image

    Summary

    In this chapter, we read about some of the core features of Spark, one of the most prominent technologies in the Big Data landscape today. Spark has matured rapidly since its inception in 2014, when it was released as a Big Data solution that alleviated many of the shortcomings of Hadoop, such as I/O contention and others.
    Today, Spark has several components, including dedicated ones for streaming analytics and machine learning, and is being actively developed. Databricks is the leading provider of the commercially supported version of Spark and also hosts a very convenient cloud-based Spark environment with limited resources that any user can access at no charge. This has dramatically lowered the barrier to entry as users do not need to install a complete Spark environment to learn and use the platform.
    In the next chapter, we will begin our discussion on machine learning. Most of the text, until this section, has focused on the management of large scale data. Making use of the data effectively and gaining insights
  • Book cover image for: Networking for Big Data
    • Shui Yu, Xiaodong Lin, Jelena Misic, Xuemin (Sherman) Shen, Shui Yu, Xiaodong Lin, Jelena Misic, Xuemin (Sherman) Shen(Authors)
    • 2015(Publication Date)
    Application parallelization is natural computational paradigms for approaching Big Data problems. However, getting additional computational resources is not as simple as just upgrading to a bigger and more powerful machine on the fly. The traditional serial algorithm is inefficient for the Big Data. If there is enough data parallelism in the application, users can take advantage of the cloud’s reduced cost model to use hundreds of computers for a short-term costs [4]. BIG DATA MANAGEMENT SYSTEMS Many researchers have suggested that commercial database management systems (DBMSs) are not suitable for processing extremely large-scale data. The classic architecture database server will be in potential bottleneck when faced with peak workloads. To adapt various large data processing models, Kossmann et al. [5] presented four different architectures based on classic multitier database application architecture which includes partitioning, replication, distributed control, and caching architecture. It is clear that the alternative providers have different business models and target different kinds of applications; Google seems to be more interested in small applications with light workloads whereas Azure, a Microsoft cloud platform, is currently the most affordable service for medium to large services. Most of recent cloud service providers are utilizing hybrid architecture that is capable of satisfying their actual service requirements. In this section, we mainly discuss Big Data architecture from three key aspects: distributed file system, nonstructural, and semistructured data storage. Analysis/Data mining/Data fusion Data recognition Data archive Operational data store ODS Transaction systems FIGURE 4.3 Traditional data management. Big Data Distributed Systems Management ◾ 61 Distributed File System Google file system (GFS) is a fraction of based distributed file system that supports fault tolerance by data partitioning and replication.
  • Book cover image for: Data analysis and Information processing
    • Jovan Pehcevski(Author)
    • 2023(Publication Date)
    • Arcler Press
      (Publisher)
    ACKNOWLEDGEMENTS The research in this paper was partially supported by the Engineering College of CSU under the Graduate Research grant. Data Analysis and Information Processing 148 REFERENCES 1. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A. and Zaharia, M. (2015) Spark SQL: Relational Data Processing in SPARK. Proceedings of the ACM SIGMOD International Conference on Management of Data, Melbourne, 31 May-4 June 2015, 1383-1394. https://doi. org/10.1145/2723372.2742797 2. Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, 20-24 May 2013, 42-47. https://doi.org/10.1109/ CTS.2013.6567202 3. Lars, E. (2015) What’s the Best Way to Manage Big Data for Healthcare: Batch vs. Stream Processing? Evariant Inc., Farmington. 4. Hu, M. and Liu, B. (2004) Mining and Summarizing Customer Reviews. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 22-25 August 2004, 168-177. 5. Liu, B. (2010) Sentiment Analysis and Subjectivity. In: Indurkhya, N. and Damerauthe, F.J., Eds., Handbook of Natural Language Processing, 2nd Edition, Chapman and Hall/CRC, London, 1-38. 6. Wang, H., Can, D., Kazemzadeh, A., Bar, F. and Narayanan, S. (2012) A System for Real-Time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle. Proceedings of ACL 2012 System Demonstrations, Jeju Island, 10 July 2012, 115-120. 7. Bifet, A., Maniu, S., Qian, J., Tian, G., He, C. and Fan, W. (2015) StreamDM: Advanced Data Mining in Spark Streaming. IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, 14-17 November 2015, 1608-1611. 8. Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., Patel, J.M., Ramasamy, K. and Taneja, S. (2015) Twitter Heron: Stream Processing at Scale.
  • Book cover image for: Big Data at Work
    eBook - PDF

    Big Data at Work

    Dispelling the Myths, Uncovering the Opportunities

    Another commonly used tool is MapReduce, a Google-developed framework for dividing big data pro-cessing across a group of linked computer nodes. Hadoop contains a version of MapReduce. These new technologies are by no means the only ones that organi-zations need to investigate. In fact, the technology environment for big data has changed dramatically over the past several years, and it will continue to do so. There are new forms of databases such as colum-nar (or vertical) databases; new programming languages—interactive scripting languages like Python, Pig, and Hive are particularly popu-lar for big data; and new hardware architectures for processing data, such as big data appliances (specialized servers) and in-memory ana-lytics (computing analytics entirely within a computer’s memory, as opposed to moving on and off disk storage). There is another key aspect of the big data technology environment that differs from traditional information management. In that previ-ous world, the goal of data analysis was to segregate data into a sepa-rate pool for analysis—typically a data warehouse (which contains a wide variety of data sets addressing a variety of purposes and topics) or mart (which typically contains a smaller amount of data for a single purpose or business function). However, the volume and velocity of big data—remember, it can sometimes be described as a fast-moving river of information that never stops—means that it can rapidly overcome any segregation approach. Just to give one example: eBay, which col-lects a massive amount of online clickstream data from its customers, has more than 40 petabytes of data in its data warehouse—much more than most organizations would be willing to store. And it has much Technology for Big Data 117 more data in a set of Hadoop clusters—nobody seems to know exactly (and the number changes daily), but well over 100 petabytes.
  • Book cover image for: Data Mining
    eBook - PDF
    • Ciza Thomas(Author)
    • 2018(Publication Date)
    • IntechOpen
      (Publisher)
    Although much of parallel systems are homogeneous inherently, the recent trend in HPC systems is the use of heteroge-neous computing resources, where the heterogeneity is generally the result of technological advancement in progress of time. With increasing heterogeneity, cloud computing has emerged as a technology which aimed at facilitating heterogeneous and distributed computing platforms. Cloud computing is an important choice for efficient distribution and management of big datasets that cannot be stored in a commodity computer ’ s memory solitarily. Not only the increasing data volume but also the difficulty of indexing, searching, and trans-ferring the data exponentially increases depending on data explosion [19, 20]. Effective data storage, management, distribution, visualization, and especially multi-modal processing in real/near-real-time applications are challenged as open issues for RS [21]. 2. Big data Big data is characterized as 3 V by many studies: volume, velocity, and variety [22, 23]. Volume is the most important big data quality that expresses the size of the dataset. The velocity indicates the rate of production of big data, but the increase in the rate of production also reveals the need for faster processing of the data. Variety refers to the diversity of different sources of data. Given the variety of the data, the majority of the data obtained is unstructured or semi-structured [21]. Considering the velocity of data, velocity requirements vary according to application areas. In general, velocity is addressed under the heading of processing at a specific time interval, such as batch processing, near-real-time requirement, continuous input – output requirement real time, and stream processing requirements.
  • Book cover image for: Communication, Management and Information Technology
    eBook - PDF

    Communication, Management and Information Technology

    International Conference on Communciation, Management and Information Technology (ICCMIT 2016, Cosenza, Italy, 26-29 April 2016)

    • Marcelo Sampaio de Alencar(Author)
    • 2016(Publication Date)
    • CRC Press
      (Publisher)
    In recent years, there are many technologies have been developed to process this huge volumes of data. Apache Hadoop [18] is an open source software framework that enables the distributed processing of large data sets across clusters of commodity hard-ware using simple programming models. There are two main components of Hadoop: Hadoop Distrib-uted File System (HDFS) and MapReduce. HDFS is a distributed, scalable file system written in java for the Hadoop framework. MapReduce is a program-ming paradigm that allows users to define two func-tions, map and reduce, to process large number data in parallel. Companies like Facebook, Yahoo!, Ama-zon, Baidu, AOL, and IBM use Hadoop on a daily basis. Hadoop has many advantages include [19]: cost effective, fault tolerant, flexibility, and scalabil-ity. Hadoop has many other related software projects that uses the MapReduce and HDFS framework such as Apache Pig, Apache Hive, Apache Mahout, Apache HBase, and others [18]. Apache Pig [1] was originally developed at Yahoo in 2006 for processing big data. In 2007, it was moved into the Apache Software Foundation. It allows people using Hadoop to focus more on analyzing large data sets and spend less time hav-ing to write MapReduce programs. Apache Hive [20] was developed at Facebook in 2009. It is data warehouse software for querying and managing large datasets residing in distributed stor-age. It built on top of Apache Hadoop. Hive defines a simple SQL-like query language, called Hive Query Language (HQL), which enables users familiar with SQL to query the data. Hive is optimized for scal-ability, extensibility, and fault-tolerance. Apache HBase [21] is a distributed columnar database that supports structured data storage for very large tables. Jaql [22] was created by workers at IBM Research Labs in 2008 and released to open source. It is a query language for JavaScript Object Nota-tion (JSON), but it supports more than just JSON such as XML, CSV, flat files, and more.
  • Book cover image for: Signal Processing and Networking for Big Data Applications
    Part I Overview of Big Data Applications 1 Introduction 1.1 Background Today, scientists, engineers, educators, citizens, and decision-makers have unprece- dented amounts and types of data available to them. Data come from many disparate sources, including scientific instruments, medical devices, telescopes, microscopes, satellites; digital media including text, video, audio, e-mail, weblogs, twitter feeds, image collections, click streams, and financial transactions; dynamic sensor, social, and other types of networks; scientific simulations, models, and surveys; or computational analysis of observational data. Data can be temporal, spatial, or dynamic; structured or unstructured. Information and knowledge derived from data can differ in repre- sentation, complexity, granularity, context, provenance, reliability, trustworthiness, and scope. Data can also differ in the rate at which they are generated and accessed. The phrase “big data” refers to the kinds of data that challenge existing analytical methods due to size, complexity, or rate of availability. The challenges in managing and analyzing “big data” can require fundamentally new techniques and technologies in order to handle the size, complexity, or rate of avail- ability of these data. At the same time, the advent of big data offers unprecedented opportunities for data-driven discovery and decision-making in virtually every area of human endeavor. A key example of this is the scientific discovery process, which is a cycle involving data analysis, hypothesis generation, the design and execution of new experiments, hypothesis testing, and theory refinement. Realizing the transformative potential of big data requires addressing many challenges in the management of data and knowledge, computational methods for data analysis, and automating many aspects of data-enabled discovery processes.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.