点击率预测是一个非常常见的机器学习问题,可以利用Spark来处理大规模的数据集并构建模型。在这个案例中,我们将使用Spark来构建一个点击率预测模型,以预测某个广告被点击的概率。
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("CTR Prediction").getOrCreate()
// Load data from a CSV file
val data = spark.read.option("header", "true").option("inferSchema", "true").csv("data.csv")
import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder, VectorAssembler}
// Convert categorical features into numerical features
val indexer = new StringIndexer().setInputCol("ad_id").setOutputCol("ad_id_index")
val encoder = new OneHotEncoder().setInputCol("ad_id_index").setOutputCol("ad_id_encoded")
// Assemble all features into a single feature vector
val assembler = new VectorAssembler().setInputCols(Array("ad_id_encoded", "ad_content", "ad_position")).setOutputCol("features")
import org.apache.spark.ml.classification.LogisticRegression
// Instantiate a logistic regression model
val lr = new LogisticRegression().setLabelCol("clicked").setFeaturesCol("features")
// Create a pipeline for feature engineering and model training
val pipeline = new Pipeline().setStages(Array(indexer, encoder, assembler, lr))
// Fit the pipeline to the training data
val model = pipeline.fit(data)
// Generate predictions on the test data
val predictions = model.transform(testData)
// Evaluate the model performance
val evaluator = new BinaryClassificationEvaluator().setLabelCol("clicked").setRawPredictionCol("prediction")
val roc = evaluator.evaluate(predictions)
println(s"Area under ROC curve: $roc")
通过这个案例,我们可以看到如何使用Spark来构建一个点击率预测模型,并对其进行评估。希望这个教程能帮助你更好地了解如何使用Spark进行机器学习任务。