楼主: 宽客老丁
4632 20

[其他] Using Modularization to Handle Complex Projects [推广有奖]

11
宽客老丁 发表于 2022-1-19 09:16:41
Dictionary Learning
Onesuch method is dictionary learning, which learns the sparse representation of the original data. The resulting matrix is known as the dictionary, and the vectors in the dictionary are known as atoms
Assumingthere are dfeatures in the original data and natoms in the dictionary, we can have a dictionary that is either undercomplete, where n < d, or overcomplete, where n > d

12
宽客老丁 发表于 2022-1-19 09:17:07
Dictionary Learning
Onesuch method is dictionary learning, which learns the sparse representation of the original data. The resulting matrix is known as the dictionary, and the vectors in the dictionary are known as atoms
Assumingthere are dfeatures in the original data and natoms in the dictionary, we can have a dictionary that is either undercomplete, where n < d, or overcomplete, where n > d

13
宽客老丁 发表于 2022-1-19 09:17:32
Search for the Optimal Number of Principal Components
Now, let’s perform a few experiments by reducing the number of principal components PCA generates and evaluate the fraud detection results. We need the PCA-based fraud detection solution to have enough error on the rare cases that it can meaningfully separate fraud cases from the normal ones. But the error cannot be so low or so high for all the transactions that the rare and normal transactions are virtually indistinguishable.

14
宽客老丁 发表于 2022-1-19 09:17:55
Results using normal PCA and 27 principal components
As you can see, we are able to catch 80% of the fraud with 75% precision. This is very impressive considering that we did not use any labels. To make these results more tangible, consider that there are 190,820 transactions in the training set and only 330 are fraudulent.
Using PCA, we calculated the reconstruction error for each of these 190,820 transactions. If we sort these transactions by highest reconstruction error (also referred to as anomaly score) in descending order and extract the top 350 transactions from the list, we can see that 264 of these transactions are fraudulent.

15
宽客老丁 发表于 2022-1-19 09:18:16
Results using normal PCA and 27 principal components
As you can see, we are able to catch 80% of the fraud with 75% precision. This is very impressive considering that we did not use any labels. To make these results more tangible, consider that there are 190,820 transactions in the training set and only 330 are fraudulent.
Using PCA, we calculated the reconstruction error for each of these 190,820 transactions. If we sort these transactions by highest reconstruction error (also referred to as anomaly score) in descending order and extract the top 350 transactions from the list, we can see that 264 of these transactions are fraudulent.

16
宽客老丁 发表于 2022-1-19 09:18:32
Kernel PCA Anomaly Detection
Nowlet’s design a fraud detection solution using kernel PCA, which is a nonlinear form of PCA and is useful if the fraud transactions are not linearly separable from the nonfraud transactions.
Weneed to specify the number of components we would like to generate, the kernel (we will use the RBF kernel as we did in the previous chapter), and the gamma (which is set to 1/n_features by default, so 1/30 in our case). Wealso need to set the fit_inverse_transformto trueto apply the built-in inverse_transformfunction provided by Scikit-Learn.
Finally, because kernel PCA is so expensive to train with, we will train on just the first two thousand samples in the transactions dataset. This is not ideal but it is necessary to perform experiments quickly.

17
宽客老丁 发表于 2022-1-19 09:18:55
Dictionary Learning Anomaly Detection
Let’s use dictionary learningto develop a fraud detection solution. Recall that, in dictionary learning, the algorithm learns the sparse representation of the original data. Using the vectors in the learned dictionary, each instance in the original data can be reconstructed as a weighted sum of these learned vectors.
For anomaly detection, we want to learn an undercomplete dictionary so that the vectors in the dictionary are fewer in number than the original dimensions. With this constraint, it will be easier to reconstruct the more frequently occurring normal transactions and much more difficult to construct the rarer fraud transactions.
In our case, we will generate 28 vectors (or components).

18
宽客老丁 发表于 2022-1-19 09:19:19
Dictionary Learning Anomaly Detection
Let’s use dictionary learningto develop a fraud detection solution. Recall that, in dictionary learning, the algorithm learns the sparse representation of the original data. Using the vectors in the learned dictionary, each instance in the original data can be reconstructed as a weighted sum of these learned vectors.
For anomaly detection, we want to learn an undercomplete dictionary so that the vectors in the dictionary are fewer in number than the original dimensions. With this constraint, it will be easier to reconstruct the more frequently occurring normal transactions and much more difficult to construct the rarer fraud transactions.
In our case, we will generate 28 vectors (or components).

19
宽客老丁 发表于 2022-1-19 09:19:37
Rules-Based vs. Machine Learning
Usinga rules-based approach, we can design a spam filter with explicit rules to catch spam such as flag emails with “u” instead of “you,” “4” instead of “for,” “BUY NOW,” etc. But this system would be difficult to maintain over time as bad guys change their spam behavior to evade the rules. If we used a rules-based system, we would have to frequently adjust the rules manually just to stay up-to-date. Also, it would be very expensive to set up—think of all the rules we would need to create to make this a well-functioning system.

20
宽客老丁 发表于 2022-1-19 09:19:57
Rules-Based vs. Machine Learning
Usinga rules-based approach, we can design a spam filter with explicit rules to catch spam such as flag emails with “u” instead of “you,” “4” instead of “for,” “BUY NOW,” etc. But this system would be difficult to maintain over time as bad guys change their spam behavior to evade the rules. If we used a rules-based system, we would have to frequently adjust the rules manually just to stay up-to-date. Also, it would be very expensive to set up—think of all the rules we would need to create to make this a well-functioning system.

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2026-1-11 14:03