如何进行Spark中MLlib的本质分析,相信很多没有经验的人对此束手无策,为此本文总结了问题出现的原因和解决方法,通过这篇文章希望你能解决这个问题。
org.apache.spark.ml(http://spark.apache.org/docs/latest/ml-guide.html )
org.apache.spark.ml.attribute org.apache.spark.ml.classification org.apache.spark.ml.clustering org.apache.spark.ml.evaluation org.apache.spark.ml.feature org.apache.spark.ml.param org.apache.spark.ml.recommendation org.apache.spark.ml.regression org.apache.spark.ml.source.libsvm org.apache.spark.ml.tree org.apache.spark.ml.tuning org.apache.spark.ml.util
org.apache.spark.mllib (http://spark.apache.org/docs/latest/mllib-guide.html )
org.apache.spark.mllib.classification
org.apache.spark.mllib.clustering
org.apache.spark.mllib.evaluation
org.apache.spark.mllib.feature
org.apache.spark.mllib.fpm
org.apache.spark.mllib.linalg
org.apache.spark.mllib.linalg.distributed
org.apache.spark.mllib.pmml
org.apache.spark.mllib.random
org.apache.spark.mllib.rdd
org.apache.spark.mllib.recommendation
org.apache.spark.mllib.regression
org.apache.spark.mllib.stat
org.apache.spark.mllib.stat.distributed
org.apache.spark.mllib.stat.test
org.apache.spark.mllib.tree
org.apache.spark.mllib.tree.configuration
org.apache.spark.mllib.tree.impurity
org.apache.spark.mllib.tree.loss
org.apache.spark.mllib.tree.model
org.apache.spark.mllib.util
ML概念
DataFrame: Spark ML uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms DataFrame with features into a DataFrame with predictions.
Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
Parameter: All Transformers and Estimators now share a common API for specifying parameters.
ML分类和回归
Classification
Logistic regression
Decision tree classifier
Random forest classifier
Gradient-boosted tree classifier
Multilayer perceptron classifier
One-vs-Rest classifier (a.k.a. One-vs-All)
Regression
Linear regression
Decision tree regression
Random forest regression
Gradient-boosted tree regression
Survival regression
Decision trees
Tree Ensembles
Random Forests
Gradient-Boosted Trees (GBTs)
ML聚类
K-means
Latent Dirichlet allocation (LDA)
MLlib 数据类型
Local vector
Labeled point
Local matrix
Distributed matrix
RowMatrix
IndexedRowMatrix
CoordinateMatrix
BlockMatrix
MLlib 分类和回归
Binary Classification: linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive Bayes
Multiclass Classification:logistic regression, decision trees, random forests, naive Bayes
Regression:linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression
MLlib 聚类
K-means
Gaussian mixture
Power iteration clustering (PIC,多用于图像识别)
Latent Dirichlet allocation (LDA,多用于主题分类)
Bisecting k-means
Streaming k-means
MLlib Models
DecisionTreeModel DistributedLDAModel GaussianMixtureModel GradientBoostedTreesModel IsotonicRegressionModel KMeansModel LassoModel LDAModel LinearRegressionModel LocalLDAModel LogisticRegressionModel MatrixFactorizationModel NaiveBayesModel PowerIterationClusteringModel RandomForestModel RidgeRegressionModel StreamingKMeansModel SVMModel Word2VecModel
Example
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.sql.Row
val training = sqlContext.createDataFrame(Seq( (1.0, Vectors.dense(0.0, 1.1, 0.1)), (0.0, Vectors.dense(2.0, 1.0, -1.0)), (0.0, Vectors.dense(2.0, 1.3, 1.0)), (1.0, Vectors.dense(0.0, 1.2, -0.5)) ))
.toDF("label", "features")
val lr = new LogisticRegression()
println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")
lr.setMaxIter(10).setRegParam(0.01)
val model1 = lr.fit(training)
println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)
val paramMap = ParamMap(lr.maxIter -> 20)
.put(lr.maxIter, 30)
.put(lr.regParam -> 0.1, lr.threshold -> 0.55)
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability")
val paramMapCombined = paramMap ++ paramMap2
val model2 = lr.fit(training, paramMapCombined)
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
test = sqlContext.createDataFrame(Seq( (1.0, Vectors.dense(-1.0, 1.5, 1.3)), (0.0, Vectors.dense(3.0, 2.0, -0.1)), (1.0, Vectors.dense(0.0, 2.2, -1.5)) ))
.toDF("label", "features")
model2.transform(test)
.select("features", "label", "myProbability", "prediction")
.collect()
.foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) => println(s"($features, $label) -> prob=$prob, prediction=$prediction") }
看完上述内容,你们掌握如何进行Spark中MLlib的本质分析的方法了吗?如果还想学到更多技能或想了解更多相关内容,欢迎关注亿速云行业资讯频道,感谢各位的阅读!
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。
原文链接:https://my.oschina.net/igooglezm/blog/636491