楼主: Lisrelchen
1081 0

[Case Study]Simple Text Classification using Java [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
49957 个
通用积分
79.5487
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. /

  2. package org.apache.spark.examples.ml;

  3. import java.util.List;

  4. import com.google.common.collect.Lists;

  5. import org.apache.spark.SparkConf;
  6. import org.apache.spark.api.java.JavaSparkContext;
  7. import org.apache.spark.ml.Pipeline;
  8. import org.apache.spark.ml.PipelineModel;
  9. import org.apache.spark.ml.PipelineStage;
  10. import org.apache.spark.ml.classification.LogisticRegression;
  11. import org.apache.spark.ml.feature.HashingTF;
  12. import org.apache.spark.ml.feature.Tokenizer;
  13. import org.apache.spark.sql.DataFrame;
  14. import org.apache.spark.sql.Row;
  15. import org.apache.spark.sql.SQLContext;

  16. /**
  17. * A simple text classification pipeline that recognizes "spark" from input text. It uses the Java
  18. * bean classes {@link LabeledDocument} and {@link Document} defined in the Scala counterpart of
  19. * this example {@link SimpleTextClassificationPipeline}. Run with
  20. * <pre>
  21. * bin/run-example ml.JavaSimpleTextClassificationPipeline
  22. * </pre>
  23. */
  24. public class JavaSimpleTextClassificationPipeline {

  25.   public static void main(String[] args) {
  26.     SparkConf conf = new SparkConf().setAppName("JavaSimpleTextClassificationPipeline");
  27.     JavaSparkContext jsc = new JavaSparkContext(conf);
  28.     SQLContext jsql = new SQLContext(jsc);

  29.     // Prepare training documents, which are labeled.
  30.     List<LabeledDocument> localTraining = Lists.newArrayList(
  31.       new LabeledDocument(0L, "a b c d e spark", 1.0),
  32.       new LabeledDocument(1L, "b d", 0.0),
  33.       new LabeledDocument(2L, "spark f g h", 1.0),
  34.       new LabeledDocument(3L, "hadoop mapreduce", 0.0));
  35.     DataFrame training = jsql.createDataFrame(jsc.parallelize(localTraining), LabeledDocument.class);

  36.     // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
  37.     Tokenizer tokenizer = new Tokenizer()
  38.       .setInputCol("text")
  39.       .setOutputCol("words");
  40.     HashingTF hashingTF = new HashingTF()
  41.       .setNumFeatures(1000)
  42.       .setInputCol(tokenizer.getOutputCol())
  43.       .setOutputCol("features");
  44.     LogisticRegression lr = new LogisticRegression()
  45.       .setMaxIter(10)
  46.       .setRegParam(0.001);
  47.     Pipeline pipeline = new Pipeline()
  48.       .setStages(new PipelineStage[] {tokenizer, hashingTF, lr});

  49.     // Fit the pipeline to training documents.
  50.     PipelineModel model = pipeline.fit(training);

  51.     // Prepare test documents, which are unlabeled.
  52.     List<Document> localTest = Lists.newArrayList(
  53.       new Document(4L, "spark i j k"),
  54.       new Document(5L, "l m n"),
  55.       new Document(6L, "spark hadoop spark"),
  56.       new Document(7L, "apache hadoop"));
  57.     DataFrame test = jsql.createDataFrame(jsc.parallelize(localTest), Document.class);

  58.     // Make predictions on test documents.
  59.     DataFrame predictions = model.transform(test);
  60.     for (Row r: predictions.select("id", "text", "probability", "prediction").collect()) {
  61.       System.out.println("(" + r.get(0) + ", " + r.get(1) + ") --> prob=" + r.get(2)
  62.           + ", prediction=" + r.get(3));
  63.     }

  64.     jsc.stop();
  65.   }
  66. }
复制代码


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Case study cation simple study Using package import Java

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加JingGuanBbs
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-30 16:05