楼主: ReneeBK
1285 9

Chalk: Scala Library for Natural Language Processing Library [推广有奖]

  • 1关注
  • 62粉丝

VIP

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49402 个
通用积分
51.7504
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57815 点
帖子
4006
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

https://github.com/scalanlp/chalk


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Processing Language processI Library Process Natural

本帖被以下文库推荐

沙发
ReneeBK 发表于 2016-4-21 10:37:43 |只看作者 |坛友微信交流群
  1. package chalk.corpora

  2. import io.Source
  3. import collection.mutable.ArrayBuffer
  4. import nak.data.{Observation, Example}
  5. import java.io.{File, FileInputStream, InputStream}

  6. /**
  7. * Reads tag sequences in the conll shared task format. See http://mlcomp.org/faq/domains "Sequence Tagging" for the spec.
  8. * @author dlwh
  9. */
  10. object CONLLSequenceReader {
  11.   def readTrain(f: InputStream, name: String = "sequence", splitToken:String =" "):Iterator[Example[IndexedSeq[String],IndexedSeq[IndexedSeq[String]]]] = {
  12.     val source = Source.fromInputStream(f).getLines()
  13.     new Iterator[Example[IndexedSeq[String],IndexedSeq[IndexedSeq[String]]]] {
  14.       def hasNext = source.hasNext
  15.       var index = 0
  16.       def next():Example[IndexedSeq[String],IndexedSeq[IndexedSeq[String]]] = {
  17.         val inputs = new ArrayBuffer[IndexedSeq[String]]()
  18.         val outputs = new ArrayBuffer[String]
  19.         import scala.util.control.Breaks._
  20.         breakable {
  21.           while(source.hasNext) {
  22.             val line = source.next()
  23.             if(line.trim().isEmpty) break

  24.             val split = line.split(splitToken)
  25.             inputs += split.take(split.length -1).toIndexedSeq
  26.             outputs += split.last
  27.           }
  28.         }
  29.         val id = name + "-" + index
  30.         index += 1
  31.         Example(outputs, inputs, id)
  32.       }
  33.     }
  34.   }

  35.   /**
  36.    * This format reads a CONLL file with the last column (i.e. the label) missing. If you have the label,
  37.    * use readTrain, even if you plan on testing with it. Silly, I know.
  38.    * @param f
  39.    * @param name
  40.    * @return
  41.    */
  42.   def readTest(f: InputStream, name: String = "test-sequence", splitToken: String = " "):Iterator[Observation[IndexedSeq[IndexedSeq[String]]]] = {
  43.     val source = Source.fromInputStream(f).getLines()
  44.     new Iterator[Observation[IndexedSeq[IndexedSeq[String]]]] {
  45.       def hasNext = source.hasNext
  46.       var index = 0
  47.       def next() = {
  48.         val inputs = new ArrayBuffer[IndexedSeq[String]]()
  49.         import scala.util.control.Breaks._
  50.         breakable {
  51.           while(source.hasNext) {
  52.             val line = source.next()
  53.             if(line.trim().isEmpty) break

  54.             val split = line.split(splitToken)
  55.             inputs += split
  56.           }
  57.         }
  58.         val id = name + "-" + index
  59.         index += 1
  60.         Observation(inputs,id)
  61.       }
  62.     }
  63.   }

  64.   def main(args: Array[String]) {
  65.     println(readTrain(new FileInputStream(new File(args(0)))).length)
  66.     println(readTest(new FileInputStream(new File(args(1)))).length)
  67.   }

  68. }
复制代码

使用道具

藤椅
ReneeBK 发表于 2016-4-21 10:38:52 |只看作者 |坛友微信交流群
  1. package chalk.corpora

  2. /*
  3. Copyright 2009 David Hall, Daniel Ramage

  4. Licensed under the Apache License, Version 2.0 (the "License")
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at

  7. http://www.apache.org/licenses/LICENSE-2.0

  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License.
  13. */


  14. //import breeze.util.Implicits._
  15. import java.net._
  16. import java.io.File
  17. import scala.xml.XML

  18. /**
  19. * Class to help load corpora. Each Corpus is represented by a class
  20. * of type "Corpus" which knows how to load the data.
  21. *
  22. * Corpora are located in one of two ways:
  23. * <ul>
  24. *  <li> Provide a class name for the corpus for a class already in the class path.</li>
  25. *  <li>2) Located through one or more public repositories, usually your
  26. *     MAVEN_HOME and the breeze maven repository.
  27. *     
  28. *     Files located in this way must be installed in:
  29. *      /path/to/repository/org/breeze/corpora/(corpus name in lower case)/(version)/(corpus-name)-(version).jar
  30. *
  31. *     The jar file itself must have a file named breeze-corpus.xml which has the format:
  32. *   <code>
  33. *       <corpus>
  34. *         <classname>absolute classname</classname>
  35. *       </corpus>
  36. *   </code>
  37. *    This class is then created and returned.
  38. *   </li>
  39. * </ul>
  40. *
  41. * @author dlwh
  42. */
  43. class Corpora(val repositories: Seq[URL]) {
  44.   import Corpora._
  45.   def this(remoteURL: String) = this( Array(CorpusUtils.defaultCorpusRepo, new URL(remoteURL)))

  46.   /**
  47.   * Create a new instance of the corpus.
  48.   */
  49.   def load[T](clss: Class[U] forSome {type U<:Corpus[T]}) = {
  50.     clss.newInstance
  51.   }

  52.   /**
  53.   * Either loads the Corpus with the given class name, or searches the repositories for the corpus with the (lowercase) name.
  54.   */
  55.   def load[T](name: String):Corpus[T] = {
  56.     val clss = cForName[T](name) orElse cForName[T]("breeze.corpora."+name)
  57.     clss match {
  58.       case None => locateCorpus(name)
  59.       case Some(c) => load(c.asSubclass(classOf[Corpus[T]]))
  60.     }
  61.   }


  62.   /**
  63.   * Searches for a corpus in the available repositories.
  64.   */
  65.   def locateCorpus[T](name: String) : Corpus[T] = {
  66.     val jarLocation = findJar(name)
  67.     val cl = new URLClassLoader(Array(jarLocation))
  68.     val className = {
  69.       val strm = cl.getResourceAsStream("breeze-corpus.xml")
  70.       val xml = XML.load(strm)
  71.       println(xml)
  72.       strm.close()
  73.       (xml \ "classname").text.trim()
  74.     }
  75.     println(className + " " + jarLocation)
  76.     load(cl.loadClass(className).asSubclass(classOf[Corpus[T]]))
  77.   }

  78.   private def findJar(name: String) = {
  79.     val lowerName = name.toLowerCase
  80.     val urls = repositories.iterator

  81.     var jar: URL = null
  82.     while(urls.hasNext && (jar eq null)) {
  83.       val url = urls.next
  84.       try { // is it a directory?
  85.         val f = new File(url.toURI)
  86.         val location = new File(new File(f, "org/breeze/corpora/"), lowerName)
  87.         println(location)
  88.         if(location.exists && location.isDirectory) {
  89.           val latestVersion = location.listFiles.filter(_.isDirectory).reduceLeft( (x,y) =>
  90.             new File(lexicographicOrder(x.getName,y.getName))
  91.           )
  92.           println(latestVersion)
  93.           jar = latestVersion.listFiles.filter(_.getName.endsWith(".jar"))(0).toURI.toURL
  94.         }
  95.       } catch { // not a file, so we'll hope it's a maven repo
  96.         case _:Exception =>
  97.         val location = url.toString +"/org/breeze/corpora/"+lowerName
  98.         val metadata = new URL(location + "/maven-metadata.xml")
  99.         try {
  100.           val stream = metadata.openStream()
  101.           try {
  102.             val xml = XML.load(stream);
  103.             val latestVersion = (xml \ "version").text
  104.             jar = new URL(location + "/" + latestVersion + "/"+lowerName + "-"+latestVersion + ".jar")
  105.           } finally {
  106.             stream.close()
  107.           }
  108.         } catch {
  109.           case _:Exception => ()
  110.         }
  111.       }
  112.     }
  113.     if(jar == null) throw new RuntimeException("Couldn't find the corpus " + name)
  114.     jar
  115.   }


  116.   private def lexicographicOrder(f1: String, f2: String) = {
  117.     if(f1.split(".").zip(f2.split(".")).forall { case (s1,s2) =>
  118.       try {
  119.         s1.toInt > s2.toInt
  120.       } catch {
  121.         case _:Exception => s1 > s2
  122.       }
  123.     }) f1 else f2
  124.   }

  125.   private def cForName[T](name: String) = try {
  126.     Some(Class.forName(name).asInstanceOf[Class[Corpus[T]]])
  127.   } catch {
  128.     case _: Exception => None
  129.   }

  130. }

  131. object Corpora extends Corpora("http://repo.breeze.org/repo/") {
  132. }

  133. object CorpusUtils {
  134.   protected[corpora] def defaultCorpusRepo = {
  135.     val repoFile = {
  136.       var a = System.getenv("SCALANLP_CORPORA")
  137.       if(a == null) a = System.getenv("MAVEN_HOME")
  138.       if(a == null) a = System.getenv("HOME") + "/.m2/"
  139.     a } + "repository/"
  140.     new File(repoFile).toURI.toURL
  141.   }

  142. }
复制代码

使用道具

板凳
ReneeBK 发表于 2016-4-21 10:39:30 |只看作者 |坛友微信交流群
  1. package chalk.corpora;

  2. /*
  3. Copyright 2009 David Hall, Daniel Ramage

  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at

  7. http://www.apache.org/licenses/LICENSE-2.0

  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License.
  13. */


  14. import scala.io._;
  15. import java.io.Reader;
  16. import java.net._;
  17. import java.io._;
  18. import scala.collection.mutable.ArrayBuffer;

  19. /**
  20. * Represents a corpus of something.
  21. *
  22. * @author dlwh
  23. */
  24. trait Corpus[+T] {
  25.   def name : String;
  26.   /** Splits could be "train" "test", or "small", "large", or whatever. */
  27.   def splits : Map[String,Seq[T]];
  28.   def license = "Unknown";
  29.   def author = "Unknown";
  30. }


  31. /**
  32. * Pretty useful for creating a corpus from a jar file where the
  33. * resources are packaged in the jar file. Simply specify a mapFun
  34. * to transform the data and and put a "data/categories" file in the
  35. * jar that has "splitName resourceName" pairs separated by a space,
  36. * where the resourceName points to another resource with a list of
  37. * files to be used in that split. It's easier than it sounds.
  38. *
  39. * @author dlwh
  40. */
  41. trait JarCorpus[+T] extends Corpus[T] {
  42.   protected def mapFun(category:String, path:String, s: InputStream):T

  43.   // Utility to read "k v" pairs from a resource.
  44.   protected def stringsFromFile(cl: ClassLoader, s : String) = {
  45.     val strm = cl.getResourceAsStream(s)
  46.     val src = Source.fromInputStream(strm);
  47.     val result = (for{ line <- src.getLines();
  48.                       trimmed = line.trim}
  49.                       yield trimmed).toIndexedSeq;
  50.     strm.close();
  51.     result;
  52.   }

  53.   // place for top level categories files
  54.   protected def categoryFile: String = "data/categories";

  55.   /**
  56.   * Provides a list of (categoryName,resourceName), where the
  57.   * categoryName is the name of the split (like "train") and the
  58.   * resourceName is a resource path that has a list of the files
  59.   * to load.
  60.   */
  61.   protected def categories: Seq[(String,String)] = stringsFromFile(classLoader, categoryFile) map { x=>
  62.     val res = x.split(" ");
  63.     (res(0),res(1));
  64.   };

  65.   lazy val splits = {
  66.     Map[String,Seq[T]]() ++ categories.map { case(cat,res) =>
  67.       val paths = stringsFromFile(classLoader,res).toStream;
  68.       val data = for (path <- paths) yield {
  69.         val strm = classLoader.getResourceAsStream(path)
  70.         val result = mapFun(cat,path,new BufferedInputStream(strm));
  71.         strm.close();
  72.         result;
  73.       }
  74.       (cat,data)
  75.     }
  76.   }

  77.   protected lazy val classLocation = {
  78.     this.getClass.getProtectionDomain.getCodeSource.getLocation;
  79.   }

  80.   protected lazy val classLoader = {
  81.     this.getClass.getClassLoader;
  82.   }
  83. }
复制代码

使用道具

报纸
ReneeBK 发表于 2016-4-21 10:40:53 |只看作者 |坛友微信交流群
  1. package chalk.data;

  2. /*
  3. Copyright 2009 David Hall, Daniel Ramage

  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at

  7. http://www.apache.org/licenses/LICENSE-2.0

  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License.
  13. */

  14. import scala.collection.mutable.StringBuilder;
  15. import scala.io.Source;
  16. import java.io.File;

  17. import nak.data.{Observation,Example}

  18. /**
  19. * Represents a sequence of text. This is just a string with an id, and a nice map method.
  20. *
  21. * @author dlwh
  22. */
  23. case class Text(id:String, contents: String) extends Observation[String] {
  24.   def features = contents;
  25.    
  26.   def withLabel[L](l:L) = new LabeledText[L](id,l,contents);
  27. }

  28. object Text {
  29. def fromFile(f :File) = {
  30.    new Text(f.getName,Source.fromFile(f).getLines().toSeq.foldLeft(new StringBuilder)( _ append _).toString);
  31. }
  32. }

  33. /**
  34. * A text with a label.
  35. * @author dlwh
  36. */
  37. class LabeledText[L](id:String,val label:L, contents: String) extends Text(id,contents) with Example[L,String];

  38. object LabeledText {
  39.   /**
  40.    * Creates a new text with the file as the id, and the directory name as the label.
  41.    */
  42.   def fromFile(f:File) = {
  43.     new Text(f.getName,Source.fromFile(f).getLines().toSeq.foldLeft(new StringBuffer)(_ append _).toString).withLabel(f.getParentFile.getName);
  44.   }
  45. }
复制代码

使用道具

地板
ReneeBK 发表于 2016-4-21 10:41:28 |只看作者 |坛友微信交流群
  1. package chalk.data;

  2. /*
  3. Copyright 2009 David Hall, Daniel Ramage

  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at

  7. http://www.apache.org/licenses/LICENSE-2.0

  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License.
  13. */

  14. import scala.collection.mutable.StringBuilder;
  15. import scala.io.Source;
  16. import java.io.File;

  17. import nak.data.{Observation,Example}

  18. /**
  19. * Represents a sequence of text. This is just a string with an id, and a nice map method.
  20. *
  21. * @author dlwh
  22. */
  23. case class Text(id:String, contents: String) extends Observation[String] {
  24.   def features = contents;
  25.    
  26.   def withLabel[L](l:L) = new LabeledText[L](id,l,contents);
  27. }

  28. object Text {
  29. def fromFile(f :File) = {
  30.    new Text(f.getName,Source.fromFile(f).getLines().toSeq.foldLeft(new StringBuilder)( _ append _).toString);
  31. }
  32. }

  33. /**
  34. * A text with a label.
  35. * @author dlwh
  36. */
  37. class LabeledText[L](id:String,val label:L, contents: String) extends Text(id,contents) with Example[L,String];

  38. object LabeledText {
  39.   /**
  40.    * Creates a new text with the file as the id, and the directory name as the label.
  41.    */
  42.   def fromFile(f:File) = {
  43.     new Text(f.getName,Source.fromFile(f).getLines().toSeq.foldLeft(new StringBuffer)(_ append _).toString).withLabel(f.getParentFile.getName);
  44.   }
  45. }
复制代码

使用道具

7
ReneeBK 发表于 2016-4-21 10:43:11 |只看作者 |坛友微信交流群
  1. package chalk.slab

  2. import akka.actor.{Actor,ActorLogging,ActorSystem,Props}
  3. import akka.pattern.{ask,pipe}
  4. import akka.util.Timeout
  5. import scala.collection.mutable.ListBuffer
  6. import scala.concurrent.duration._
  7. import scala.concurrent.Future

  8. /**
  9.   * An actor that mixes-in an AnalysisFunction and hands Slabs contained in Process messages over
  10.   * to the function.
  11.   */
  12. trait AnalysisComponent[C,B,I<:B,O<:B] extends Actor with ActorLogging with AnalysisFunction[C,B,I,O] {
  13.   import AnalysisComponent._
  14.   def receive = {
  15.     case Process(slab) => sender ! apply(slab.asInstanceOf[Slab[C,B,I]])
  16.   }
  17. }

  18. /**
  19.   * Companion object, e.g. to hold messages that can be processed by an AnalysisComponent actor.
  20.   */
  21. object AnalysisComponent {
  22.   case class Process[C,B,I<:B](slab: Slab[C,B,I])
  23. }

  24. trait StringAnalysisComponent[I<:Span,O<:Span]
  25.     extends AnalysisComponent[String,Span,I,O]

  26. /**
  27.   * An actor that uses SentenceSegmenter.
  28.   */
  29. class SentenceSegmenterActor extends SentenceSegmenter[Span]
  30.     with StringAnalysisComponent[Span,Sentence]

  31. /**
  32.   * An actor that uses Tokenizer.
  33.   */
  34. class TokenizerActor extends Tokenizer[Sentence] with StringAnalysisComponent[Sentence, Token]


  35. /**
  36.   * An analysis engine that runs Slabs through a pipeline of AnalysisComponents. It currently
  37.   * requires explicit declaration of the analyzers, but this would ideally be done through
  38.   * configuration. No compile-time consistency check for the types in the pipeline is performed.
  39.   * Anyhoo, this gives the basic idea for how an actor based AnalysisEngine might work, so
  40.   * it should be a good starting point.
  41.   */
  42. class AnalysisEngine extends Actor with ActorLogging {

  43.   import AnalysisComponent._
  44.   import AnalysisEngine._
  45.   import Span._
  46.   import Slab.StringSlab
  47.   implicit val ec = context.dispatcher
  48.   implicit val timeout = Timeout(10 seconds)
  49.   
  50.   val sentenceSegmenter = context.system.actorOf(Props[SentenceSegmenterActor])
  51.   val tokenizer = context.system.actorOf(Props[TokenizerActor])

  52.   def receive = {
  53.     case Process(slab) =>
  54.       log.info("Processing slab:\n " + slab.content)
  55.       (for {
  56.         slab1 <- (sentenceSegmenter ? Process(slab)).mapTo[StringSlab[Sentence]]
  57.         slab2 <- (tokenizer ? Process(slab1)).mapTo[StringSlab[Sentence with Token]]
  58.       } yield {
  59.         slab2
  60.       }) pipeTo sender

  61.     case ProcessCorpus(texts) =>
  62.       Future.traverse(texts)(text => self ? Process(Slab(text))) pipeTo sender
  63.   }
  64. }

  65. /**
  66.   * Example application doing actor based Slab processing.
  67.   */
  68. object AnalysisEngine {

  69.   case class ProcessCorpus(corpus: Iterator[String])
  70.   
  71.   import AnalysisComponent._
  72.   import Span._
  73.   import Slab.StringSlab

  74.   val text1 = "Here is an example text. It has four sentences and it mentions Jimi Hendrix and Austin, Texas! In this third sentence, it also brings up Led Zeppelin and Radiohead, but does it ask a question? It also has a straggler sentence that doesn't end with punctuation"

  75.   val text2 = "How much wood can a woodchuck chuck? Can a woodchuck chuck wood?"

  76.   val text3 = "The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof. Aravind Joshi of the University of Pennsylvania. Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950. He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960. Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001. Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics."
  77.   
  78.   def main(args: Array[String]) {

  79.     val system = ActorSystem("ChalkSystem")
  80.     implicit val ec = system.dispatcher
  81.     implicit val timeout = Timeout(10 seconds)

  82.     val engine = system.actorOf(Props[AnalysisEngine])
  83.     val corpus = Iterator(text1,text2,text3)
  84.    
  85.     for {
  86.       slabs <- (engine ? ProcessCorpus(corpus)).mapTo[Iterator[StringSlab[Sentence with Token]]]
  87.       slab <- slabs
  88.     } {
  89.       // Notice that the last sentence (lacking EOS char) is missing.
  90.       val sentences = slab.iterator[Sentence].toList
  91.       println("\nSENTENCES\n\n" + sentences.map(_.in(slab).content).mkString("\n"))
  92.       
  93.       val tokens = slab.iterator[Token].toList
  94.       println("\nTOKENS\n\n" + tokens.map(_.in(slab).content).mkString("\n"))
  95.     }
  96.     Thread.sleep(3000)
  97.     system.shutdown
  98.   }

  99. }
复制代码

使用道具

8
neuroexplorer 发表于 2016-4-21 11:39:46 |只看作者 |坛友微信交流群
thanks for sharing.

使用道具

9
garfiny 发表于 2016-5-10 13:48:13 |只看作者 |坛友微信交流群
感谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加JingGuanBbs
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-19 17:46