[学习笔记] 【学习笔记】How to Design a Spelling Corrector [推广有奖]

1关注
3粉丝

已卖：70份资源

学科带头人

54%

还不是VIP/贵宾

威望: 0 级
论坛币: 13005 个
通用积分: 409.9229
学术水平: 109 点
热心指数: 112 点
信用等级: 103 点
经验: 71218 点
帖子: 1079
精华: 0
在线时间: 1538 小时
注册时间: 2016-7-19
最后登录: 2024-6-8

楼主

liuxf666 发表于 2019-5-15 10:59:13 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Probability TheoryProblem: find the correction c, out of all possible candidate corrections, that maximizes the probability that c is the intended correction, given the original word w:

argmaxc ∈ candidates P(c|w) <==> argmaxc ∈ candidates P(c) P(w|c) / P(w)

The four parts of this expression are:

Selection Mechanism: argmax
We choose the candidate with the highest combined probability.
Candidate Model: c ∈ candidates
This tells us which candidate corrections, c, to consider.
Language Model: P(c)
The probability that c appears as a word of English text. For example, occurrences of "the" make up about 7% of English text, so we should have P(the) = 0.07.
Error Model: P(w|c)
The probability that w would be typed in a text when the author meant c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.

One obvious question is: why take a simple expression like P(c|w) and replace it with a more complex expression involving two models rather than one? The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and while adding a "w" seems like a larger, less probable change, perhaps the typist's finger slipped off the "e". The point is that to estimate P(c|w) we have to consider both the probability of cand the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.
How It Works in PythonThe four parts of the program are:
1. Selection Mechanism: In Python, max with a key argument does 'argmax'.

2. Candidate Model: First a new concept: a simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter). The function edits1 returns a set of all the edited strings (whether words or not) that can be made with one simple edit:

This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example,
However, if we restrict ourselves to words that are known—that is, in the dictionary— then the set is much smaller:

We'll also consider corrections that require two simple edits. This generates a much bigger set of possibilities, but usually only a few of them are known words.

We say that the results of edits2(w) have an edit distance of 2 from w.
3. Language Model: We can estimate the probability of a word, P(word), by counting the number of times each word appears in a text file of about a million words, big.txt. It is a concatenation of public domain book excerpts from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. The function words breaks text into words, then the variable WORDS holds a Counter of how often each word appears, and P estimates the probability of each word, based on this Counter.

4. Error Model: When I started to write this program, sitting on a plane in 2007, I had no data on spelling errors, and no internet connection (I know that may be hard to imagine today). Without data I couldn't build a good spelling error model, so I took a shortcut: I defined a trivial, flawed error model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. So we can make candidates(word) produce the first non-empty list of candidates in order of priority:

The original word, if it is known; otherwise
The list of known words at edit distance one away, if there are any; otherwise
The list of known words at edit distance two away, if there are any; otherwise
The original word, even though it is not known.

Then we don't need to multiply by a P(w|c) factor, because every candidate at the chosen priority will have the same probability (according to our flawed model). That gives us: