楼主: liuxf666
758 5

[学习笔记] 【学习笔记】How to Design a Spelling Corrector [推广有奖]

  • 1关注
  • 3粉丝

已卖:70份资源

学科带头人

54%

还不是VIP/贵宾

-

威望
0
论坛币
13005 个
通用积分
409.9229
学术水平
109 点
热心指数
112 点
信用等级
103 点
经验
71218 点
帖子
1079
精华
0
在线时间
1538 小时
注册时间
2016-7-19
最后登录
2024-6-8

楼主
liuxf666 发表于 2019-5-15 10:59:13 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Probability TheoryProblem: find the correction c, out of all possible candidate corrections, that maximizes the probability that c is the intended correction, given the original word w:
argmaxc ∈ candidates P(c|w)  <==> argmaxc ∈ candidates P(c) P(w|c) / P(w)


The four parts of this expression are:
  • Selection Mechanism: argmax
    We choose the candidate with the highest combined probability.
  • Candidate Model: c ∈ candidates
    This tells us which candidate corrections, c, to consider.
  • Language Model: P(c)
    The probability that c appears as a word of English text. For example, occurrences of "the" make up about 7% of English text, so we should have P(the) = 0.07.
  • Error Model: P(w|c)
    The probability that w would be typed in a text when the author meant c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.
One obvious question is: why take a simple expression like P(c|w) and replace it with a more complex expression involving two models rather than one? The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and while adding a "w" seems like a larger, less probable change, perhaps the typist's finger slipped off the "e". The point is that to estimate P(c|w) we have to consider both the probability of cand the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.
How It Works in PythonThe four parts of the program are:
1. Selection Mechanism: In Python, max with a key argument does 'argmax'.


2. Candidate Model: First a new concept: a simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter). The function edits1 returns a set of all the edited strings (whether words or not) that can be made with one simple edit:


This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example,
However, if we restrict ourselves to words that are known—that is, in the dictionary— then the set is much smaller:


We'll also consider corrections that require two simple edits. This generates a much bigger set of possibilities, but usually only a few of them are known words.


We say that the results of edits2(w) have an edit distance of 2 from w.
3. Language Model: We can estimate the probability of a word, P(word), by counting the number of times each word appears in a text file of about a million words, big.txt. It is a concatenation of public domain book excerpts from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. The function words breaks text into words, then the variable WORDS holds a Counter of how often each word appears, and P estimates the probability of each word, based on this Counter.


4. Error Model: When I started to write this program, sitting on a plane in 2007, I had no data on spelling errors, and no internet connection (I know that may be hard to imagine today). Without data I couldn't build a good spelling error model, so I took a shortcut: I defined a trivial, flawed error model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. So we can make candidates(word) produce the first non-empty list of candidates in order of priority:
  • The original word, if it is known; otherwise
  • The list of known words at edit distance one away, if there are any; otherwise
  • The list of known words at edit distance two away, if there are any; otherwise
  • The original word, even though it is not known.


    Then we don't need to multiply by a P(w|c) factor, because every candidate at the chosen priority will have the same probability (according to our flawed model). That gives us:



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:correct correc Design Corr sign

已有 1 人评分论坛币 学术水平 热心指数 信用等级 收起 理由
经管之家编辑部 + 100 + 3 + 3 + 3 精彩帖子

总评分: 论坛币 + 100  学术水平 + 3  热心指数 + 3  信用等级 + 3   查看全部评分

本帖被以下文库推荐

沙发
经管之家编辑部 在职认证  发表于 2019-5-15 11:32:51
为您点赞!

藤椅
充实每一天 发表于 2019-5-15 11:36:38 来自手机
点赞

板凳
tianwk 发表于 2019-5-15 14:08:19
加油加油

报纸
jessie68us 发表于 2019-5-15 23:27:52
Selection Mechanism: argmax
Candidate Model: c ∈ candidates
Language Model: P(c)
Error Model: P(w|c)
四个模型很经典。为您点赞!

地板
从1万到一亿 在职认证  发表于 2019-5-16 15:10:01

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
jg-xs1
拉您进交流群
GMT+8, 2025-12-31 00:18