人大经济论坛 › 论坛 › 计量经济学与统计论坛五区 › 计量经济学与统计软件 › LATEX论坛 › Counting the number of words in a LaTeX file

发帖

楼主: janyiyi

1129 0

Counting the number of words in a LaTeX file [推广有奖]

3关注
17粉丝

已卖：190份资源

讲师

27%

还不是VIP/贵宾

威望: 0 级
论坛币: 3236 个
通用积分: 5056.8150
学术水平: 539 点
热心指数: 537 点
信用等级: 538 点
经验: 10157 点
帖子: 300
精华: 2
在线时间: 90 小时
注册时间: 2010-10-3
最后登录: 2024-4-6

楼主

janyiyi 发表于 2016-10-18 10:43:03 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Here's one of such jolly features. Many LaTeX users may find it very useful.

Loading a text file with encoding auto-detection

Here's a LaTeX document consisting of a Polish poem. Probably, most of you wouldn't have been able to guess the file's character encoding if I hadn't left some hints. But it's OK, we have a little challenge.

Let's use some (currently experimental) stringi functions to guess the file's encoding.

First of all, we should read the file as a raw vector (anyway, each text file is a sequence of bytes).

library(stringi)
# experimental function (as per stringi_0.2-5):
download.file("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex",
dest = "powrot_taty_latin2.tex")
file <- stri_read_raw("powrot_taty_latin2.tex")
head(file, 15)
## [1] 25 25 20 45 4e 43 4f 44 49 4e 47 20 3d 20 49

复制代码

Let's try to detect the file's character encoding automatically.

stri_enc_detect(file)[[1]] # experimental function
## $Encoding
## [1] "ISO-8859-2" "ISO-8859-1" "ISO-8859-9"
##
## $Language
## [1] "pl" "pt" "tr"
##
## $Confidence
## [1] 0.46 0.19 0.07

复制代码

Encoding detection is, at best, an imprecise operation using statistics and heuristics. ICU indicates that most probably we deal with Polish text in ISO-8859-2 (a.k.a. latin2) here. What a coincidence: it's true.

Let's re-encode the file. Our target encoding will be UTF-8, as it is a “superset'' of all 8-bit encodings. We really love portable code:

file <- stri_conv(file, stri_enc_detect(file)[[1]]$Encoding[1], "UTF-8")
file <- stri_split_lines1(file) # split a string into text lines
print(file[22:28]) # text sample
## [1] ",,Pójdźcie, o dziatki, pójdźcie wszystkie razem"
## [2] ""
## [3] "Za miasto, pod słup na wzgórek,"
## [4] ""
## [5] "Tam przed cudownym klęknijcie obrazem,"
## [6] ""
## [7] "Pobożnie zmówcie paciórek."

复制代码

Of course, if we knew a priori that the file is in ISO-8859-2, we'd just call:

file <- stri_conv(readLines("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex"),
"ISO-8859-2", "UTF-8")

复制代码

So far so good.

Word count

LaTeX word counting is a quite complicated task and there are many possible approaches
to perform it. Most often, they rely on running some external tools (which may be a bit inconvenient for some users). Personally, I've always been most satisfied with the output produced by the KileLaTeX IDE for KDE desktop environment.

As not everyone has Kile installed, I've had decided to grab Kile's algorithm (the power of open source!), made some not-too-invasive stringi-specific tweaks and here we are:

stri_stats_latex(file)
## CharsWord CharsCmdEnvir CharsWhite Words Cmds
## 2283 335 576 461 32
## Envirs
## 2

复制代码

Some other aggregates are also available (they are meaningful in case of any text file):

stri_stats_general(file)
## Lines LinesNEmpty Chars CharsNWhite
## 232 122 3308 2930

复制代码

Finally, here's the word count for my R programming book (in Polish). Importantly, each chapter is stored in a separate .tex file (there are 30 files), so "clicking out” the answer in Kile would be a bit problematic:

apply(
sapply(
list.files(path="~/Publikacje/ProgramowanieR/rozdzialy/",
pattern=glob2rx("*.tex"), recursive=TRUE, full.names=TRUE),
function(x)
stri_stats_latex(readLines(x))
), 1, sum)
## CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs
## 718755 458403 281989 120202 37055 6119

复制代码

Notably, my publisher was satisfied with the above estimate.

Next time we'll take a look at ICU's very powerful transliteration services.

More information

For more information check out the stringi package website and its on-line documentation.

For bug reports and feature requests visit our GitHub profile.

Any comments and suggestions are warmly welcome.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：counting Number Count LaTeX words number file

Counting the number of words in a LaTeX file [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

Counting the number of words in a LaTeX file [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群