Saturday, July 23, 2016

MapReduce in Apache Spark

Based on the Course CS120x Distributed Machine Learning with Apache Spark.

Basically we can summarize the map/reduce paradigm as following:

Map: transforms a series of elements by applying a function individually to each element in the series. It then returns the series of transformed elements.
Filter: applies a function individually to each element in a series but, the function evaluates to True or False and only elements that evaluate to True are retained.
Reduce: operates on pairs of elements in a series. It applies a function that takes in two values and returns a single value. Using this function, reduce is able to, iteratively, “reduce” a series to a single value.

We have define an array of 10 elements and transform it in a Resilient Distributed Dataset (RDD)

numberRDD = range(0,10)
numberRDD = sc.parallelize(numberRDD, 4)
numberRDD.collect()
> Out[1]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Map the numberRDD using a lambda function that mulitplies each element by 5

numberRDD.map(lambda x:x*5).collect()
> Out[2]: [0, 5, 10, 15, 20, 25, 30, 35, 40, 45]

Filter the numberRDD in order to obtain only the number multiple of 2

numberRDD.filter(lambda x:x%2==0).collect()
> Out[3]: [0, 2, 4, 6, 8]

Reduce the numberRDD summing pairs of numbers

numberRDD.reduce(lambda x1,x2:x1+x2)
> Out[4]: [45]

Putting all together we obtain the sum of the numbers in the even positions

numberRDD.map(lambda x:x*5).filter(lambda x:x%2==0).reduce(lambda x1,x2:x1+x2)
> Out[5]: [100]

This post has been written using Markdown and Dillinger. Here an interesting Markdown Cheatsheet

Saturday, July 16, 2016

TensorFlow on Databricks

TensorFlow is an Open Source Software Library for Machine Learning and AI tasks.

In these months is becoming a widely used tool in the AI community (and not only).

Databricks is an interesting Cluster Manager based on Apache Spark. It offers a Community Edition for free (pricing).

Since some ML tasks can be very computational intensive (e.g. training of the Deep Networks) could be a good idea to have a Cluster on Databricks and use it.

You can run this Notebook on your Databricks cluster (or import it).

Even though the Notebook says that "It is not required for the Databricks Community Edition", I experimented that it is necessary for the Community Edition as well.

Notes on Machine Learning, AI, Big Data etc etc

Search This Blog

Saturday, July 23, 2016

MapReduce in Apache Spark

Based on the Course CS120x Distributed Machine Learning with Apache Spark.

Saturday, July 16, 2016

TensorFlow on Databricks

About Me

Popular Posts

Blog Archive

Search This Blog

Saturday, July 23, 2016

MapReduce in Apache Spark

Based on the Course CS120x Distributed Machine Learning with Apache Spark.

Saturday, July 16, 2016

TensorFlow on Databricks

About Me

Popular Posts

Subscribe To

Blog Archive