Search This Blog

Saturday, July 23, 2016

MapReduce in Apache Spark

Based on the Course CS120x Distributed Machine Learning with Apache Spark.

Basically we can summarize the map/reduce paradigm as following:
  • Map: transforms a series of elements by applying a function individually to each element in the series. It then returns the series of transformed elements.
  • Filter: applies a function individually to each element in a series but, the function evaluates to True or False and only elements that evaluate to True are retained.
  • Reduce: operates on pairs of elements in a series. It applies a function that takes in two values and returns a single value. Using this function, reduce is able to, iteratively, “reduce” a series to a single value.
We have define an array of 10 elements and transform it in a Resilient Distributed Dataset (RDD)
numberRDD = range(0,10)
numberRDD = sc.parallelize(numberRDD, 4)
numberRDD.collect()
> Out[1]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 
Map the numberRDD using a lambda function that mulitplies each element by 5
numberRDD.map(lambda x:x*5).collect()
> Out[2]: [0, 5, 10, 15, 20, 25, 30, 35, 40, 45] 
Filter the numberRDD in order to obtain only the number multiple of 2
numberRDD.filter(lambda x:x%2==0).collect()
> Out[3]: [0, 2, 4, 6, 8]
Reduce the numberRDD summing pairs of numbers
numberRDD.reduce(lambda x1,x2:x1+x2)
> Out[4]: [45]
Putting all together we obtain the sum of the numbers in the even positions
numberRDD.map(lambda x:x*5).filter(lambda x:x%2==0).reduce(lambda x1,x2:x1+x2)
> Out[5]: [100]
This post has been written using Markdown and Dillinger. Here an interesting Markdown Cheatsheet

No comments:

Post a Comment