Computer Science

Map Reduce and Filter

Map, reduce, and filter are fundamental higher-order functions used in functional programming and data processing. Map applies a function to each element in a collection, producing a new collection. Reduce combines elements of a collection into a single value using a specified operation. Filter selects elements from a collection based on a given condition, creating a new collection containing only the matching elements.

Written by Perlego with AI-assistance

Related key terms

1 of 5

10 Key excerpts on "Map Reduce and Filter"

eBook - PDF
Data-Intensive Text Processing with MapReduce
- Jimmy Lin, Chris Dyer(Authors)
- 2022(Publication Date)
- Springer
  (Publisher)
Therefore, a developer can focus on what computations need to be performed, as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them. Like OpenMP and MPI, MapReduce provides a means to distribute computation without burdening the programmer with the details of distributed computing (but at a different level of granularity). However, organizing and coordinating large amounts of computation is only part of the challenge. Large-data processing by definition requires bringing data and code together for computation to occur—no small feat for datasets that are terabytes and perhaps petabytes in size! MapReduce addresses this challenge by providing a simple abstraction for the developer, transparently handling most of the details behind the scenes in a scalable, robust, and efficient manner. As we mentioned in Chapter 1, instead of moving large amounts of data around, it is far more efficient, if possible, to move the code to the data. This is operationally realized by spreading data across the local disks of nodes in a cluster and running processes on nodes that hold the data. The complex task of managing storage in such a processing environment is typically handled by a distributed file system that sits underneath MapReduce. This chapter introduces the MapReduce programming model and the underlying distributed file system. We start in Section 2.1 with an overview of functional programming, from which MapReduce draws its inspiration. Section 2.2 introduces the basic programming model, focusing on mappers and reducers. Section 2.3 discusses the role of the execution framework in actually running MapReduce programs (called jobs). Section 2.4 fills in additional details by introducing partitioners and combiners, which provide greater control over data ﬂow.
Sign up to read
Learn more about book
eBook - PDF
Mining of Massive Datasets
- Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman(Authors)
- 2014(Publication Date)
- Cambridge University Press
  (Publisher)
2.2 MapReduce MapReduce is a style of computing that has been implemented in several systems, including Google’s internal implementation (simply called MapReduce) and the popular open-source implementation Hadoop which can be obtained, along with the HDFS file system from the Apache Foundation. You can use an implemen- tation of MapReduce to manage many large-scale computations in a way that is tolerant of hardware faults. All you need to write are two functions, called Map and Reduce, while the system manages the parallel execution, coordination of tasks that execute Map or Reduce, and also deals with the possibility that one of these tasks will fail to execute. In brief, a MapReduce computation executes as follows: (1) Some number of Map tasks each are given one or more chunks from a dis- tributed file system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function. (2) The key-value pairs from each Map task are collected by a master controller and sorted by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the same key wind up at the same Reduce task. 2.2 MapReduce 23 (3) The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function. Figure 2.2 suggests this computation. Input chunks Group by keys Key-value (k,v) pairs their values Keys with all output Combined Map tasks Reduce tasks (k, [v, w,...]) Figure 2.2 Schematic of a MapReduce computation 2.2.1 The Map Tasks We view input files for a Map task as consisting of elements, which can be any type: a tuple or a document, for example. A chunk is a collection of elements, and no element is stored across two chunks.
Sign up to read
Learn more about book
eBook - PDF
Cloud Computing
Data-Intensive Computing and Scheduling
- Frederic Magoules, Jie Pan, Fei Teng(Authors)
- 2016(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
reducer tasks’ start-up is restricted. The loss of task results or failed execution of tasks also produce a wrong À nal result. With MapReduce, complex issues such as fault-tolerance, data distribu-tion and load balancing are all hidden from the users. MapReduce can handle them automatically. In this way, MapReduce programming model simpli À es parallel pro-gramming. This simplicity is retained in all frameworks that implement MapReduce model. By using these frameworks, the users only have to de À ne two functions, map and reduce , according to their applications. Fundamentals of the MapReduce model. The idea of MapReduce was inspired by high-order function and functional programming. Map and reduce are two prim-itives in functional programming languages, such as Lisp, Haskell, etc. A map func-tion processes a fragment of a key-value pairs list to generate a list of intermediate key-value pairs. A reduce function merges all intermediate values associated with a same key, and produces a list of key-value pairs as output. Refer to the refer-ence [Dean and Ghemawat, 2004] for a more formal description. The syntax of the MapReduce model is the following: map (key1,value1) → list(key2,value2) reduce (key2,list(value2) → list(key2,value3) In the above expressions, the input data of the map function is a large set of (key1,value1) pairs. Each key-value pair is processed by the map function without depending on other peer key-value pairs. The map function produces an-other pair of key-values, noted as (key2,value2) , where, the key (denoted as key2 ) is not the original key as in the input argument (denoted as key1 ). The out-put of the map phase is processed before entering the reduce phase, that is, key-value pairs (key2,value2) are grouped into lists of (key2,value2) , each group having the same value of key2 . These lists of (key2,value2) are taken as input 88 Cloud Computing: Data-Intensive Computing and Scheduling FIGURE 5.1: Logical view of the MapReduce model.
Sign up to read
Learn more about book
eBook - PDF
Data-Intensive Computing
Architectures, Algorithms, and Applications
- Ian Gorton, Deborah K. Gracio(Authors)
- 2012(Publication Date)
- Cambridge University Press
  (Publisher)
The Map function processes a block of input producing a sequence of (key, value) pairs, while the Reduce function processes a set of values associated with a single key. The framework itself is responsible for “shuffling” the output of the Map tasks to the appropriate Reduce task using 1 Source: Google Trends. 180 8.1 Introduction and Background 181 ds News reference volume 0 0 2004 2005 2006 2007 2008 2009 2010 A B C D Google T E F 5.00 10.0 15.0 Search Volume index mapreduce hadoop 1.00 4.20 Figure 8.1. Search popularity for the terms “mapreduce” and “hadoop” from 2006 to 2012. (Source: Google Trends.) a distributed sort. The model is sufficiently expressive to capture a variety of algorithms and high-level programming models while allowing programmers to largely ignore the challenges of distributed computing and focus instead on the semantics of their task. Examples include machine learning [53], relational query processing [45, 58, 63, 72], web data processing [21], and spatio-temporal indexing [9]. Hadoop and the MapReduce model have been shown to scale to hundreds or thousands of nodes as early as 2009 [63]. MapReduce clusters can be constructed inexpensively from commodity computers connected in a shared-nothing configuration (that is, neither memory nor storage are shared across nodes). Although the discussion of MapReduce frequently turns to performance and scalability, it is important to realize that the original motivation was to simplify parallel processing on large-scale clusters, calling to mind the title of the original 2004 MapReduce paper by Dean et al. [21]. The popularity of MapReduce indicates that it filled a real gap in the IT land- scape, but it is reasonable to ask why this gap existed.
Sign up to read
Learn more about book
No longer available |Learn more
Mining of Massive Datasets
- Anand Rajaraman, Jeffrey David Ullman(Authors)
- 2011(Publication Date)
- Cambridge University Press
  (Publisher)
2 Large-Scale File Systems and Map-Reduce Modern Internet applications have created a need to manage immense amounts of data quickly. In many of these applications, the data is extremely regular, and there is ample opportunity to exploit parallelism. Important examples are: (1) The ranking of Web pages by importance, which involves an iterated matrix- vector multiplication where the dimension is in the tens of billions, and (2) Searches in “friends” networks at social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges. To deal with applications such as these, a new software stack has developed. It begins with a new form of file system, which features much larger units than the disk blocks in a conventional operating system and also provides replication of data to protect against the frequent media failures that occur when data is distributed over thousands of disks. On top of these file systems, we find higher-level programming systems de- veloping. Central to many of these is a programming system called map-reduce. Implementations of map-reduce enable many of the most common calculations on large-scale data to be performed on large collections of computers, eﬃciently and in a way that is tolerant of hardware failures during the computation. Map-reduce systems are evolving and extending rapidly. We include in this chapter a discussion of generalizations of map-reduce, first to acyclic workﬂows and then to recursive algorithms. We conclude with a discussion of communica- tion cost and what it tells us about the most eﬃcient algorithms in this modern computing environment. 2.1 Distributed File Systems Most computing is done on a single processor, with its main memory, cache, and local disk (a compute node). In the past, applications that called for parallel processing, such as large scientific calculations, were done on special-purpose parallel computers with many processors and specialized hardware.
Sign up to read
Learn more about book
eBook - ePub
The Cloud Computing Book
The Future of Computing Explained
- Douglas Comer(Author)
- 2021(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
Hadoop software includes interfaces that support multiple programming languages. That is, a programmer who decides to write Map and Reduce functions can choose a programming language, which can differ from the language in which Hadoop is implemented. Popular language interfaces include:

Hadoop's Java API for Java programs

Hadoop's Streaming mechanism for scripting languages (e.g., Python)

Hadoop's Pipe API for use with C and C++

The key idea is that Hadoop gives a programmer the freedom to choose a language that is appropriate to the problem being solved:

Unlike software systems that restrict programmers to one programming language, Hadoop offers mechanisms that give a programmer a choice among popular languages.

11.20 Summary

Although VMs allow arbitrary applications to be transported to a cloud data center, building cloud-native software offers a chance for improved security, reduced flaws, enhanced performance, increased scalability, and higher reliability.

One particular style of computation is especially appropriate for a cloud data center: parallel processing. The basis for using parallel computers to increase performance lies in three steps: partioning a problem into pieces, using multiple computers to process the pieces at the same time, and combining the results. Parallel processing does have potential limitations, including unsplittable problems, limited parallelism, limited resources and cost, competition from databases, and communication overhead (especially for I/O-bound computations).

The MapReduce programming paradigm (sometimes called the MapReduce algorithm) offers a formalized framework for parallel processing. MapReduce divides processing into five steps: Split, Map, Shuffle, Reduce, and Merge. Depending on the problem being solved, data can be split into meaningful pieces, fixed-size chunks, or by using a hash of the data items.

MapReduce works best for large, complex problems. For problems that process a small amount of data or for which the computation is trivial, MapReduce introduces unnecessary overhead. A programmer must also consider data transmission because sending data takes time. Data copying can be reduced by arranging for the processors that perform Map operations to access data directly.
Sign up to read
Learn more about book
eBook - ePub
Big Data Analytics with Hadoop 3
Build highly effective analytics solutions to gain valuable insight into your big data
- Sridhar Alla(Author)
- 2018(Publication Date)
- Packt Publishing
  (Publisher)
Big Data Processing with MapReduce

This chapter will puts everything we have learned in the book into a practical use case of building an end-to-end pipeline to perform big data analytics. In a nutshell, the following topics will be covered throughout this chapter:

The MapReduce framework

MapReduce job types:

Single mapper jobs

Single mapper reducer jobs

Multiple mappers reducer jobs

MapReduce patterns:

Aggregation patterns

Filtering patterns

Join patterns

Passage contains an image

The MapReduce framework

MapReduce is a framework used to compute a large amount of data in a Hadoop cluster. MapReduce uses YARN to schedule the mappers and reducers as tasks, using the containers. The
MapReduce framework enables you to write distributed applications to process large amounts of data from a filesystem, such as a Hadoop Distributed File System (HDFS ), in a reliable and fault-tolerant manner. When you want to use the MapReduce framework to process data, it works through the creation of a job, which then runs on the framework to perform the tasks needed. A MapReduce job usually works by splitting the input data across worker nodes, running the mapper tasks in a parallel manner.

At this time, any failures that happen, either at the HDFS level or the failure of a mapper task, are handled automatically, to be fault-tolerant. Once the mappers have completed, in the results are copied over the network to other machines running the reducer tasks.
An example of using a MapReduce job to count frequencies of words is shown in the following diagram: MapReduce uses YARN as a resource manager, which is shown in the following diagram:
The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.
Sign up to read
Learn more about book
eBook - ePub
Social Media Data Mining and Analytics
- Gabor Szabo, Gungor Polatkan, P. Oscar Boykin, Antonios Chalkiopoulos(Authors)
- 2018(Publication Date)
- Wiley
  (Publisher)
Listing 5.1 .

Listing 5.1 : An illustration of the MapReduce paradigm with a simple Python example. (mapreduce_def.py)

def map2(fun, items): '''The Map operation in MapReduce. Slightly different from Python's standard "map". ''' result = [] for x in items: result.extend(fun(x)) return result def reduce(fun, items): '''The Reduce operation in MapReduce.''' result = None for x in items: if result is not None: result = fun(result, x) else: result = x return result def times2(x): yield 2 * x def add(x, y): return x + y if __name__ == '__main__': print map2(times2, [1, 2, 3]) # Prints [2, 4, 6] print reduce(add, [1, 2, 3]) # Prints 6

Google developed a system based on these two functions called MapReduce. A similar implementation in Java was created at Yahoo called Hadoop, which became hugely popular as an open-source implementation of the approach. An additional twist that MapReduce and Hadoop put in the toy example presented in the preceding code is that the output of the map function is a pair that is considered as a key and a value. Then, rather than a single reduce function seeing all the map output, for each key, the reduce function is applied to the groups of records having the same key.

These systems harness hundreds or thousands of computers, each with some subset of a global filesystem on the local disk. Thus, the first phase, or map phase of the operation, can be performed on computers that are already storing the input data. After the map tasks run, the output is a key-value pair, and all pairs with the same key are sent to the same reducer task.
Sign up to read
Learn more about book
No longer available |Learn more
Big Data Using Hadoop and Hive
- Nitin Kumar(Author)
- 2021(Publication Date)
- Mercury Learning and Information
  (Publisher)
CHAPTER 7

MAP REDUCE

MapReduce is a framework on Hadoop for processing a massive amount of data in-parallel in a distributed environment. It’s a reliable and fault-tolerant process running on common hardware, which makes it cost-effective.

The HDFS stores files in identical, fixed-sized blocks (such as 64 MB or 128 MB). The MapReduce framework accesses and processes these blocks in a distributed parallel environment. MapReduce spreads the jobs across these fixed-size blocks to execute in parallel and later aggregate the output to write into external storage, such as the HDFS or AWS S3. The MapReduce framework works on key-value pairs; it has two main parts, the Mapper and Reducer. Each mapper task uses a small data block as the key-value (<key, value>) pairs, and passes the input to the reducer, and the reducer processes a subset of the map task output, aggregates the data, and writes into the storage, e.g., the HDFS.

MAPREDUCE PROCESS

The MapReduce framework splits data into a key-value pair, which is passed as <key, value> in different phases and produces a different set (modified) of the key-value pair as the output. Since different stages of the key-value data are written into storage, it is serialized by implementing Writable.
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
MapReduce tasks go through different phases before writing the final output to the HDFS. The input file first passes to the Mapper task, converting these input into key-value pairs using InputFormat. Sorting and shuffling allow the opportunity to decide which Reducer has to pass the map input collection, i.e., the key-value pairs.

In theory, the Reducer starts processing once the entire Mapper task is completed; however, sorting provides a way to start processing the Reducer while the Mapper is still processing it. The Reducer aggregates the input, processes it, and then writes into the HDFS file.

It’s also possible to have zero Reducer tasks. A zero reducer
Sign up to read
Learn more about book
eBook - PDF
Functional Programming in C#
Classic Programming Techniques for Modern Projects
- Oliver Sturm(Author)
- 2011(Publication Date)
- Wiley
  (Publisher)
A functional programmer evaluates the order in which certain results, and thereby calculations, depend upon one another. With MapReduce, the process is the same, but the system works differently. It is probably closer to functional programming, but the firm requirement to implement everything in only two formalized steps still seems restrictive. Experts claim that the family of problems that can be solved with the help of MapReduce is extremely large, and there’s the benefit of hiding away all the details of parallelization — every implementation of a problem solution that adheres to the pattern will automatically be parallelized by the engine. For example, consider a structure of data initialized like this: static List InitOrders( ) { return new List { new Order { Name = “Customer 1 Order”, Lines = new List { new OrderLine{ ProductName = “Rubber Chicken”, ProductPrice=8.95m, Count=5}, new OrderLine{ ProductName = “Pulley”, ProductPrice=0.99m, Count=5 } } }, new Order { Name = “Customer 2 Order”, Lines = new List { new OrderLine{ ProductName = “Canister of Grog”, ProductPrice=13.99m, Count=10} } } }; } This type of data structure may be returned from a relational database, or you might have in-memory objects to work with. Calculating some summaries over this data with the help of MapReduce is easy enough: var orderValues = MapReduce( o => Functional.Map( ol => Tuple.Create(o.Name, ol.ProductPrice * ol.Count), o.Lines), (r, t) => r + t.Item2, 0m, orders); foreach (var result in orderValues) Console.WriteLine(“Order: {0}, Value: {1}”, result.Item1, result.Item2); var orderLineCounts = MapReduce( o => Functional.Map(ol => Tuple.Create(o.Name, 1), o.Lines), (r, t) => r + 1, 0, orders); foreach (var result in orderLineCounts) Console.WriteLine(“Order: {0}, Lines: {1}”, result.Item1, result.Item2); The hard part is arriving at the answers.
Sign up to read
Learn more about book

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

1 of 8

View all

Map Reduce and Filter

Related key terms

10 Key excerpts on "Map Reduce and Filter"

Data-Intensive Text Processing with MapReduce

Mining of Massive Datasets

Cloud Computing

Data-Intensive Computing and Scheduling

Data-Intensive Computing

Architectures, Algorithms, and Applications

Mining of Massive Datasets

The Cloud Computing Book

The Future of Computing Explained

11.20 Summary

Big Data Analytics with Hadoop 3

Build highly effective analytics solutions to gain valuable insight into your big data

Big Data Processing with MapReduce

The MapReduce framework

Social Media Data Mining and Analytics

Big Data Using Hadoop and Hive

Functional Programming in C#

Classic Programming Techniques for Modern Projects

Explore more topic indexes