楼主: janyiyi
816 0

Counting the number of words in a LaTeX file [推广有奖]

  • 3关注
  • 17粉丝

讲师

27%

还不是VIP/贵宾

-

威望
0
论坛币
3206 个
通用积分
5056.6800
学术水平
539 点
热心指数
537 点
信用等级
538 点
经验
10157 点
帖子
300
精华
2
在线时间
90 小时
注册时间
2010-10-3
最后登录
2024-4-6

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

Here's one of such jolly features. Many LaTeX users may find it very useful.

Loading a text file with encoding auto-detection

Here's a LaTeX document consisting of a Polish poem. Probably, most of you wouldn't have been able to guess the file's character encoding if I hadn't left some hints. But it's OK, we have a little challenge.

Let's use some (currently experimental) stringi functions to guess the file's encoding.

First of all, we should read the file as a raw vector (anyway, each text file is a sequence of bytes).

  1. library(stringi)
  2. # experimental function (as per stringi_0.2-5):
  3. download.file("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex",
  4.     dest = "powrot_taty_latin2.tex")
  5. file <- stri_read_raw("powrot_taty_latin2.tex")
  6. head(file, 15)
  7. ##  [1] 25 25 20 45 4e 43 4f 44 49 4e 47 20 3d 20 49
复制代码

Let's try to detect the file's character encoding automatically.

  1. stri_enc_detect(file)[[1]]  # experimental function
  2. ## $Encoding
  3. ## [1] "ISO-8859-2" "ISO-8859-1" "ISO-8859-9"
  4. ##
  5. ## $Language
  6. ## [1] "pl" "pt" "tr"
  7. ##
  8. ## $Confidence
  9. ## [1] 0.46 0.19 0.07
复制代码

Encoding detection is, at best, an imprecise operation using statistics and heuristics. ICU indicates that most probably we deal with Polish text in ISO-8859-2 (a.k.a. latin2) here. What a coincidence: it's true.

Let's re-encode the file. Our target encoding will be UTF-8, as it is a “superset'' of all 8-bit encodings. We really love portable code:

  1. file <- stri_conv(file, stri_enc_detect(file)[[1]]$Encoding[1], "UTF-8")
  2. file <- stri_split_lines1(file)  # split a string into text lines
  3. print(file[22:28])  # text sample
  4. ## [1] ",,Pójdźcie, o dziatki, pójdźcie wszystkie razem"
  5. ## [2] ""                                               
  6. ## [3] "Za miasto, pod słup na wzgórek,"               
  7. ## [4] ""                                               
  8. ## [5] "Tam przed cudownym klęknijcie obrazem,"         
  9. ## [6] ""                                               
  10. ## [7] "Pobożnie zmówcie paciórek."
复制代码

Of course, if we knew a priori that the file is in ISO-8859-2, we'd just call:

  1. file <- stri_conv(readLines("http://www.rexamine.com/manual_upload/powrot_taty_latin2.tex"),
  2.     "ISO-8859-2", "UTF-8")
复制代码

So far so good.

Word count

LaTeX word counting is a quite complicated task and there are many possible approaches
to perform it. Most often, they rely on running some external tools (which may be a bit inconvenient for some users). Personally, I've always been most satisfied with the output produced by the KileLaTeX IDE for KDE desktop environment.

As not everyone has Kile installed, I've had decided to grab Kile's algorithm (the power of open source!), made some not-too-invasive stringi-specific tweaks and here we are:

  1. stri_stats_latex(file)
  2. ##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds
  3. ##          2283           335           576           461            32
  4. ##        Envirs
  5. ##             2
复制代码

Some other aggregates are also available (they are meaningful in case of any text file):

  1. stri_stats_general(file)
  2. ##       Lines LinesNEmpty       Chars CharsNWhite
  3. ##         232         122        3308        2930
复制代码

Finally, here's the word count for my R programming book (in Polish). Importantly, each chapter is stored in a separate .tex file (there are 30 files), so "clicking out” the answer in Kile would be a bit problematic:

  1. apply(
  2.    sapply(
  3.       list.files(path="~/Publikacje/ProgramowanieR/rozdzialy/",
  4.          pattern=glob2rx("*.tex"), recursive=TRUE, full.names=TRUE),
  5.       function(x)
  6.       stri_stats_latex(readLines(x))
  7.    ), 1, sum)
  8. ## CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs
  9. ##    718755        458403        281989        120202         37055          6119
复制代码

Notably, my publisher was satisfied with the above estimate.

Next time we'll take a look at ICU's very powerful transliteration services.

More information

For more information check out the stringi package website and its on-line documentation.

For bug reports and feature requests visit our GitHub profile.

Any comments and suggestions are warmly welcome.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:counting Number Count LaTeX words number file

已有 1 人评分经验 论坛币 学术水平 热心指数 信用等级 收起 理由
oliyiyi + 40 + 40 + 10 + 10 + 10 精彩帖子

总评分: 经验 + 40  论坛币 + 40  学术水平 + 10  热心指数 + 10  信用等级 + 10   查看全部评分

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 19:04