Spark MLlib (Machine Learning Library )

          官档 :  http://spark.apache.org/docs/1.6.3/mllib-guide.html

     MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
    包含机器学习中 多种学习算法和工具,包括分类,回归,聚类,协同过滤,降维等

          可以看现在有 spark.ml package 与 spark.mllib package 两种包: 

    It divides into two packages:
        spark.mllib contains the original API built on top of RDDs.
        spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

         spark.mllib基于rdd的,spark.ml基于dataframe的,可见spark.ml的api封装更高 , 所以先学 spark.mllib

Data Types

       Local vector    本地向量,用来保存 X1 ~ XN
        Labeled point    向量标签,代表一行数据, 保存  Y =  X1 ~ XN
        Local matrix     本地矩阵
        Distributed matrix  分布式矩阵
               RowMatrix
               IndexedRowMatrix
               CoordinateMatrix
               BlockMatrix

    Local vector

            存储在单一节点上,存储  integer-typed and 0-based indices and double-typed values
            分为 稠密向量(dense) 和 稀疏向量(sparse) :

                     每个位置都有值来表示,某位置没有值填充,这叫做稠密向量。

                     当有很多位置值为0,这时就不需要浪费空间,只需要指定哪些位置有值,职位多少就可以了,这就是稀疏向量。

import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Create a dense vector (1.0, 0.0, 3.0).  稠密向量
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)  

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
//  稀疏向量 第一个参数指定值的个数, 第一个数组指定位置,第二个数组指定值
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
// 还可以这样
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

Labeled point

        A labeled point is a local vector, either dense or sparse, associated with a label/response

       由 一个 Label 和 vector 组成,代表一行数据,label 代表 Y , vector 代表 X1 ~ XN

         For binary classification, a label should be either 0 (negative) or 1 (positive).  
                        二分类中 Label  指定为 0 或 1
            For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ....
                        多分类中,Label指定为 
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

MLlib 支持 LIBSVM 格式的数据,下面格式的:

label index1:value1 index2:value2 ...

读取 LIBSVM格式文件 创建 类型为 LabelPoint 的 RDD :

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD

val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

Local matrix



添加新评论