- 阅读权限
- 255
- 威望
- 0 级
- 论坛币
- 50288 个
- 通用积分
- 83.6306
- 学术水平
- 253 点
- 热心指数
- 300 点
- 信用等级
- 208 点
- 经验
- 41518 点
- 帖子
- 3256
- 精华
- 14
- 在线时间
- 766 小时
- 注册时间
- 2006-5-4
- 最后登录
- 2022-11-6
已卖:4194份资源
院士
还不是VIP/贵宾
TA的文库 其他... Bayesian NewOccidental
Spatial Data Analysis
东西方数据挖掘
- 威望
- 0 级
- 论坛币
 - 50288 个
- 通用积分
- 83.6306
- 学术水平
- 253 点
- 热心指数
- 300 点
- 信用等级
- 208 点
- 经验
- 41518 点
- 帖子
- 3256
- 精华
- 14
- 在线时间
- 766 小时
- 注册时间
- 2006-5-4
- 最后登录
- 2022-11-6
|
经管之家送您一份
应届毕业生专属福利!
求职就业群
感谢您参与论坛问题回答
经管之家送您两个论坛币!
+2 论坛币
- */
- package org.apache.spark.examples.ml;
- // $example on$
- import java.util.Arrays;
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.JavaRDD;
- import org.apache.spark.api.java.JavaSparkContext;
- import org.apache.spark.ml.feature.HashingTF;
- import org.apache.spark.ml.feature.IDF;
- import org.apache.spark.ml.feature.IDFModel;
- import org.apache.spark.ml.feature.Tokenizer;
- import org.apache.spark.mllib.linalg.Vector;
- import org.apache.spark.sql.DataFrame;
- import org.apache.spark.sql.Row;
- import org.apache.spark.sql.RowFactory;
- import org.apache.spark.sql.SQLContext;
- import org.apache.spark.sql.types.DataTypes;
- import org.apache.spark.sql.types.Metadata;
- import org.apache.spark.sql.types.StructField;
- import org.apache.spark.sql.types.StructType;
- // $example off$
- public class JavaTfIdfExample {
- public static void main(String[] args) {
- SparkConf conf = new SparkConf().setAppName("JavaTfIdfExample");
- JavaSparkContext jsc = new JavaSparkContext(conf);
- SQLContext sqlContext = new SQLContext(jsc);
- // $example on$
- JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
- RowFactory.create(0, "Hi I heard about Spark"),
- RowFactory.create(0, "I wish Java could use case classes"),
- RowFactory.create(1, "Logistic regression models are neat")
- ));
- StructType schema = new StructType(new StructField[]{
- new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
- new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
- });
- DataFrame sentenceData = sqlContext.createDataFrame(jrdd, schema);
- Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
- DataFrame wordsData = tokenizer.transform(sentenceData);
- int numFeatures = 20;
- HashingTF hashingTF = new HashingTF()
- .setInputCol("words")
- .setOutputCol("rawFeatures")
- .setNumFeatures(numFeatures);
- DataFrame featurizedData = hashingTF.transform(wordsData);
- IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
- IDFModel idfModel = idf.fit(featurizedData);
- DataFrame rescaledData = idfModel.transform(featurizedData);
- for (Row r : rescaledData.select("features", "label").take(3)) {
- Vector features = r.getAs(0);
- Double label = r.getDouble(1);
- System.out.println(features);
- System.out.println(label);
- }
- // $example off$
- jsc.stop();
- }
- }
复制代码
扫码加我 拉你入群
请注明:姓名-公司-职位
以便审核进群资格,未注明则拒绝
|
|
|