Based on the Course CS120x Distributed Machine Learning with Apache Spark.

Basically we can summarize the map/reduce paradigm as following:

Map: transforms a series of elements by applying a function individually to each element in the series. It then returns the series of transformed elements.
Filter: applies a function individually to each element in a series but, the function evaluates to True or False and only elements that evaluate to True are retained.
Reduce: operates on pairs of elements in a series. It applies a function that takes in two values and returns a single value. Using this function, reduce is able to, iteratively, “reduce” a series to a single value.

We have define an array of 10 elements and transform it in a Resilient Distributed Dataset (RDD)

numberRDD = range(0,10)
numberRDD = sc.parallelize(numberRDD, 4)
numberRDD.collect()
> Out[1]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Map the numberRDD using a lambda function that mulitplies each element by 5

numberRDD.map(lambda x:x*5).collect()
> Out[2]: [0, 5, 10, 15, 20, 25, 30, 35, 40, 45]

Filter the numberRDD in order to obtain only the number multiple of 2

numberRDD.filter(lambda x:x%2==0).collect()
> Out[3]: [0, 2, 4, 6, 8]

Reduce the numberRDD summing pairs of numbers

numberRDD.reduce(lambda x1,x2:x1+x2)
> Out[4]: [45]

Putting all together we obtain the sum of the numbers in the even positions

numberRDD.map(lambda x:x*5).filter(lambda x:x%2==0).reduce(lambda x1,x2:x1+x2)
> Out[5]: [100]

This post has been written using Markdown and Dillinger. Here an interesting Markdown Cheatsheet

Notes on Machine Learning, AI, Big Data etc etc