Which Spark Machine Learning API Should You Use?

A brief introduction to Spark MLlib’s APIs for basic statistics, classification, clustering, and collaborative filtering, and what they can do for you.

But what can machine learning do for you? And how will you find out? There’s a good place to start close to home, if you’re already using Apache Spark for batch and stream processing. Along with Spark SQL and Spark Streaming, which you’re probably already using, Spark provides MLLib, which is, among other things, a library of machine learning and statistical algorithms in API form.

Here is a brief guide to four of the most essential MLlib APIs, what they do, and how you might use them.

Basic statistics

Mainly you’ll use these APIs for A-B testing or A-B-C testing. Frequently in business we assume that if two averages are the same then the two things are roughly equivalent. That isn’t necessarily true. Consider if a car manufacturer replaces the seat in a car and surveys customers on how comfortable it is. At one end the shorter customers may say the seat is much more comfortable. At the other end, taller customers will say it is really uncomfortable to the point that they wouldn’t buy the car and the people in the middle balance out the difference.

Read more at InfoWorld