Chalk: Scala Library for Natural Language Processing Library

1关注
62粉丝

VIP

已卖：4898份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库 其他...

R资源总汇

Panel Data Analysis

Experimental Design

0%

威望: 1 级
论坛币: 49640 个
通用积分: 55.8137
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57805 点
帖子: 4005
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2016-4-21 10:36:45 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

本帖隐藏的内容

https://github.com/scalanlp/chalk

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Processing Language processI Library Process Natural

本帖被以下文库推荐

· Scala资源总汇|主题: 156, 订阅: 10

沙发

ReneeBK 发表于 2016-4-21 10:37:43

package chalk.corpora
import io.Source
import collection.mutable.ArrayBuffer
import nak.data.{Observation, Example}
import java.io.{File, FileInputStream, InputStream}
/**
* Reads tag sequences in the conll shared task format. See http://mlcomp.org/faq/domains "Sequence Tagging" for the spec.
* @author dlwh
*/
object CONLLSequenceReader {
def readTrain(f: InputStream, name: String = "sequence", splitToken:String =" "):Iterator[Example[IndexedSeq[String],IndexedSeq[IndexedSeq[String]]]] = {
val source = Source.fromInputStream(f).getLines()
new Iterator[Example[IndexedSeq[String],IndexedSeq[IndexedSeq[String]]]] {
def hasNext = source.hasNext
var index = 0
def next():Example[IndexedSeq[String],IndexedSeq[IndexedSeq[String]]] = {
val inputs = new ArrayBuffer[IndexedSeq[String]]()
val outputs = new ArrayBuffer[String]
import scala.util.control.Breaks._
breakable {
while(source.hasNext) {
val line = source.next()
if(line.trim().isEmpty) break
val split = line.split(splitToken)
inputs += split.take(split.length -1).toIndexedSeq
outputs += split.last
}
}
val id = name + "-" + index
index += 1
Example(outputs, inputs, id)
}
}
}
/**
* This format reads a CONLL file with the last column (i.e. the label) missing. If you have the label,
* use readTrain, even if you plan on testing with it. Silly, I know.
* @param f
* @param name
* @return
*/
def readTest(f: InputStream, name: String = "test-sequence", splitToken: String = " "):Iterator[Observation[IndexedSeq[IndexedSeq[String]]]] = {
val source = Source.fromInputStream(f).getLines()
new Iterator[Observation[IndexedSeq[IndexedSeq[String]]]] {
def hasNext = source.hasNext
var index = 0
def next() = {
val inputs = new ArrayBuffer[IndexedSeq[String]]()
import scala.util.control.Breaks._
breakable {
while(source.hasNext) {
val line = source.next()
if(line.trim().isEmpty) break
val split = line.split(splitToken)
inputs += split
}
}
val id = name + "-" + index
index += 1
Observation(inputs,id)
}
}
}
def main(args: Array[String]) {
println(readTrain(new FileInputStream(new File(args(0)))).length)
println(readTest(new FileInputStream(new File(args(1)))).length)
}
}

复制代码

藤椅

ReneeBK 发表于 2016-4-21 10:38:52

package chalk.corpora
/*
Copyright 2009 David Hall, Daniel Ramage
Licensed under the Apache License, Version 2.0 (the "License")
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
//import breeze.util.Implicits._
import java.net._
import java.io.File
import scala.xml.XML
/**
* Class to help load corpora. Each Corpus is represented by a class
* of type "Corpus" which knows how to load the data.
*
* Corpora are located in one of two ways:
* <ul>
* <li> Provide a class name for the corpus for a class already in the class path.</li>
* <li>2) Located through one or more public repositories, usually your
* MAVEN_HOME and the breeze maven repository.
*
* Files located in this way must be installed in:
* /path/to/repository/org/breeze/corpora/(corpus name in lower case)/(version)/(corpus-name)-(version).jar
*
* The jar file itself must have a file named breeze-corpus.xml which has the format:
* <code>
* <corpus>
* <classname>absolute classname</classname>
* </corpus>
* </code>
* This class is then created and returned.
* </li>
* </ul>
*
* @author dlwh
*/
class Corpora(val repositories: Seq[URL]) {
import Corpora._
def this(remoteURL: String) = this( Array(CorpusUtils.defaultCorpusRepo, new URL(remoteURL)))
/**
* Create a new instance of the corpus.
*/
def load[T](clss: Class[U] forSome {type U<:Corpus[T]}) = {
clss.newInstance
}
/**
* Either loads the Corpus with the given class name, or searches the repositories for the corpus with the (lowercase) name.
*/
def load[T](name: String):Corpus[T] = {
val clss = cForName[T](name) orElse cForName[T]("breeze.corpora."+name)
clss match {
case None => locateCorpus(name)
case Some(c) => load(c.asSubclass(classOf[Corpus[T]]))
}
}
/**
* Searches for a corpus in the available repositories.
*/
def locateCorpus[T](name: String) : Corpus[T] = {
val jarLocation = findJar(name)
val cl = new URLClassLoader(Array(jarLocation))
val className = {
val strm = cl.getResourceAsStream("breeze-corpus.xml")
val xml = XML.load(strm)
println(xml)
strm.close()
(xml \ "classname").text.trim()
}
println(className + " " + jarLocation)
load(cl.loadClass(className).asSubclass(classOf[Corpus[T]]))
}
private def findJar(name: String) = {
val lowerName = name.toLowerCase
val urls = repositories.iterator
var jar: URL = null
while(urls.hasNext && (jar eq null)) {
val url = urls.next
try { // is it a directory?
val f = new File(url.toURI)
val location = new File(new File(f, "org/breeze/corpora/"), lowerName)
println(location)
if(location.exists && location.isDirectory) {
val latestVersion = location.listFiles.filter(_.isDirectory).reduceLeft( (x,y) =>
new File(lexicographicOrder(x.getName,y.getName))
)
println(latestVersion)
jar = latestVersion.listFiles.filter(_.getName.endsWith(".jar"))(0).toURI.toURL
}
} catch { // not a file, so we'll hope it's a maven repo
case _:Exception =>
val location = url.toString +"/org/breeze/corpora/"+lowerName
val metadata = new URL(location + "/maven-metadata.xml")
try {
val stream = metadata.openStream()
try {
val xml = XML.load(stream);
val latestVersion = (xml \ "version").text
jar = new URL(location + "/" + latestVersion + "/"+lowerName + "-"+latestVersion + ".jar")
} finally {
stream.close()
}
} catch {
case _:Exception => ()
}
}
}
if(jar == null) throw new RuntimeException("Couldn't find the corpus " + name)
jar
}
private def lexicographicOrder(f1: String, f2: String) = {
if(f1.split(".").zip(f2.split(".")).forall { case (s1,s2) =>
try {
s1.toInt > s2.toInt
} catch {
case _:Exception => s1 > s2
}
}) f1 else f2
}
private def cForName[T](name: String) = try {
Some(Class.forName(name).asInstanceOf[Class[Corpus[T]]])
} catch {
case _: Exception => None
}
}
object Corpora extends Corpora("http://repo.breeze.org/repo/") {
}
object CorpusUtils {
protected[corpora] def defaultCorpusRepo = {
val repoFile = {
var a = System.getenv("SCALANLP_CORPORA")
if(a == null) a = System.getenv("MAVEN_HOME")
if(a == null) a = System.getenv("HOME") + "/.m2/"
a } + "repository/"
new File(repoFile).toURI.toURL
}
}

复制代码

板凳

ReneeBK 发表于 2016-4-21 10:39:30

package chalk.corpora;
/*
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
import scala.io._;
import java.io.Reader;
import java.net._;
import java.io._;
import scala.collection.mutable.ArrayBuffer;
/**
* Represents a corpus of something.
*
* @author dlwh
*/
trait Corpus[+T] {
def name : String;
/** Splits could be "train" "test", or "small", "large", or whatever. */
def splits : Map[String,Seq[T]];
def license = "Unknown";
def author = "Unknown";
}
/**
* Pretty useful for creating a corpus from a jar file where the
* resources are packaged in the jar file. Simply specify a mapFun
* to transform the data and and put a "data/categories" file in the
* jar that has "splitName resourceName" pairs separated by a space,
* where the resourceName points to another resource with a list of
* files to be used in that split. It's easier than it sounds.
*
* @author dlwh
*/
trait JarCorpus[+T] extends Corpus[T] {
protected def mapFun(category:String, path:String, s: InputStream):T
// Utility to read "k v" pairs from a resource.
protected def stringsFromFile(cl: ClassLoader, s : String) = {
val strm = cl.getResourceAsStream(s)
val src = Source.fromInputStream(strm);
val result = (for{ line <- src.getLines();
trimmed = line.trim}
yield trimmed).toIndexedSeq;
strm.close();
result;
}
// place for top level categories files
protected def categoryFile: String = "data/categories";
/**
* Provides a list of (categoryName,resourceName), where the
* categoryName is the name of the split (like "train") and the
* resourceName is a resource path that has a list of the files
* to load.
*/
protected def categories: Seq[(String,String)] = stringsFromFile(classLoader, categoryFile) map { x=>
val res = x.split(" ");
(res(0),res(1));
};
lazy val splits = {
Map[String,Seq[T]]() ++ categories.map { case(cat,res) =>
val paths = stringsFromFile(classLoader,res).toStream;
val data = for (path <- paths) yield {
val strm = classLoader.getResourceAsStream(path)
val result = mapFun(cat,path,new BufferedInputStream(strm));
strm.close();
result;
}
(cat,data)
}
}
protected lazy val classLocation = {
this.getClass.getProtectionDomain.getCodeSource.getLocation;
}
protected lazy val classLoader = {
this.getClass.getClassLoader;
}
}

复制代码

报纸

ReneeBK 发表于 2016-4-21 10:40:53

package chalk.data;
/*
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
import scala.collection.mutable.StringBuilder;
import scala.io.Source;
import java.io.File;
import nak.data.{Observation,Example}
/**
* Represents a sequence of text. This is just a string with an id, and a nice map method.
*
* @author dlwh
*/
case class Text(id:String, contents: String) extends Observation[String] {
def features = contents;
def withLabel[L](l:L) = new LabeledText[L](id,l,contents);
}
object Text {
def fromFile(f :File) = {
new Text(f.getName,Source.fromFile(f).getLines().toSeq.foldLeft(new StringBuilder)( _ append _).toString);
}
}
/**
* A text with a label.
* @author dlwh
*/
class LabeledText[L](id:String,val label:L, contents: String) extends Text(id,contents) with Example[L,String];
object LabeledText {
/**
* Creates a new text with the file as the id, and the directory name as the label.
*/
def fromFile(f:File) = {
new Text(f.getName,Source.fromFile(f).getLines().toSeq.foldLeft(new StringBuffer)(_ append _).toString).withLabel(f.getParentFile.getName);
}
}

复制代码

地板

ReneeBK 发表于 2016-4-21 10:41:28

package chalk.data;
/*
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
import scala.collection.mutable.StringBuilder;
import scala.io.Source;
import java.io.File;
import nak.data.{Observation,Example}
/**
* Represents a sequence of text. This is just a string with an id, and a nice map method.
*
* @author dlwh
*/
case class Text(id:String, contents: String) extends Observation[String] {
def features = contents;
def withLabel[L](l:L) = new LabeledText[L](id,l,contents);
}
object Text {
def fromFile(f :File) = {
new Text(f.getName,Source.fromFile(f).getLines().toSeq.foldLeft(new StringBuilder)( _ append _).toString);
}
}
/**
* A text with a label.
* @author dlwh
*/
class LabeledText[L](id:String,val label:L, contents: String) extends Text(id,contents) with Example[L,String];
object LabeledText {
/**
* Creates a new text with the file as the id, and the directory name as the label.
*/
def fromFile(f:File) = {
new Text(f.getName,Source.fromFile(f).getLines().toSeq.foldLeft(new StringBuffer)(_ append _).toString).withLabel(f.getParentFile.getName);
}
}

复制代码

7楼

ReneeBK 发表于 2016-4-21 10:43:11

package chalk.slab
import akka.actor.{Actor,ActorLogging,ActorSystem,Props}
import akka.pattern.{ask,pipe}
import akka.util.Timeout
import scala.collection.mutable.ListBuffer
import scala.concurrent.duration._
import scala.concurrent.Future
/**
* An actor that mixes-in an AnalysisFunction and hands Slabs contained in Process messages over
* to the function.
*/
trait AnalysisComponent[C,B,I<:B,O<:B] extends Actor with ActorLogging with AnalysisFunction[C,B,I,O] {
import AnalysisComponent._
def receive = {
case Process(slab) => sender ! apply(slab.asInstanceOf[Slab[C,B,I]])
}
}
/**
* Companion object, e.g. to hold messages that can be processed by an AnalysisComponent actor.
*/
object AnalysisComponent {
case class Process[C,B,I<:B](slab: Slab[C,B,I])
}
trait StringAnalysisComponent[I<:Span,O<:Span]
extends AnalysisComponent[String,Span,I,O]
/**
* An actor that uses SentenceSegmenter.
*/
class SentenceSegmenterActor extends SentenceSegmenter[Span]
with StringAnalysisComponent[Span,Sentence]
/**
* An actor that uses Tokenizer.
*/
class TokenizerActor extends Tokenizer[Sentence] with StringAnalysisComponent[Sentence, Token]
/**
* An analysis engine that runs Slabs through a pipeline of AnalysisComponents. It currently
* requires explicit declaration of the analyzers, but this would ideally be done through
* configuration. No compile-time consistency check for the types in the pipeline is performed.
* Anyhoo, this gives the basic idea for how an actor based AnalysisEngine might work, so
* it should be a good starting point.
*/
class AnalysisEngine extends Actor with ActorLogging {
import AnalysisComponent._
import AnalysisEngine._
import Span._
import Slab.StringSlab
implicit val ec = context.dispatcher
implicit val timeout = Timeout(10 seconds)
val sentenceSegmenter = context.system.actorOf(Props[SentenceSegmenterActor])
val tokenizer = context.system.actorOf(Props[TokenizerActor])
def receive = {
case Process(slab) =>
log.info("Processing slab:\n " + slab.content)
(for {
slab1 <- (sentenceSegmenter ? Process(slab)).mapTo[StringSlab[Sentence]]
slab2 <- (tokenizer ? Process(slab1)).mapTo[StringSlab[Sentence with Token]]
} yield {
slab2
}) pipeTo sender
case ProcessCorpus(texts) =>
Future.traverse(texts)(text => self ? Process(Slab(text))) pipeTo sender
}
}
/**
* Example application doing actor based Slab processing.
*/
object AnalysisEngine {
case class ProcessCorpus(corpus: Iterator[String])
import AnalysisComponent._
import Span._
import Slab.StringSlab
val text1 = "Here is an example text. It has four sentences and it mentions Jimi Hendrix and Austin, Texas! In this third sentence, it also brings up Led Zeppelin and Radiohead, but does it ask a question? It also has a straggler sentence that doesn't end with punctuation"
val text2 = "How much wood can a woodchuck chuck? Can a woodchuck chuck wood?"
val text3 = "The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof. Aravind Joshi of the University of Pennsylvania. Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950. He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960. Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001. Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics."
def main(args: Array[String]) {
val system = ActorSystem("ChalkSystem")
implicit val ec = system.dispatcher
implicit val timeout = Timeout(10 seconds)
val engine = system.actorOf(Props[AnalysisEngine])
val corpus = Iterator(text1,text2,text3)
for {
slabs <- (engine ? ProcessCorpus(corpus)).mapTo[Iterator[StringSlab[Sentence with Token]]]
slab <- slabs
} {
// Notice that the last sentence (lacking EOS char) is missing.
val sentences = slab.iterator[Sentence].toList
println("\nSENTENCES\n\n" + sentences.map(_.in(slab).content).mkString("\n"))
val tokens = slab.iterator[Token].toList
println("\nTOKENS\n\n" + tokens.map(_.in(slab).content).mkString("\n"))
}
Thread.sleep(3000)
system.shutdown
}
}

复制代码

8楼

neuroexplorer 发表于 2016-4-21 11:39:46

thanks for sharing.

9楼

garfiny 发表于 2016-5-10 13:48:13

感谢分享

Chalk: Scala Library for Natural Language Processing Library [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

Chalk: Scala Library for Natural Language Processing Library [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群