楼主: oliyiyi
904 1

Machine Learning for Drug Adverse Event Discovery [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

We can use unsupervised machine learning to identify which drugs are associated with which adverse events. Specifically, machine learning can help us to create clusters based on gender, age, outcome of adverse event, route drug was administered, purpose the drug was used for, body mass index, etc. This can help for quickly discovering hidden associations between drugs and adverse events.

Clustering is a non-supervised learning technique which has wide applications. Some examples where clustering is commonly applied are market segmentation, social network analytics, and astronomical data analysis. Clustering is grouping of data into sub-groups so that objects within a cluster have high similarity in comparison to other objects in that cluster, but are very dissimilar to objects in other classes. For clustering, each pattern is represented as a vector in multidimensional space and a distance measure is used to find the dissimilarity between the instances. In this post, we will see how we can use hierarchical clustering to identify drug adverse events. You can read about hierarchical clustering from Wikipedia.

Data

Let’s create fake drug adverse event data where we can visually identify the clusters and see if our machine learning algorithm can identify the clusters. If we have millions of rows of adverse event data, clustering can help us to summarize the data and get insights quickly.

Let’s assume a drug AAA results in adverse events shown below. We will see in which group (cluster) the drug results in what kind of reactions (adverse events).
In the table shown below, I have created four clusters:

  • Route=ORAL, Age=60s, Sex=M, Outcome code=OT, Indication=RHEUMATOID ARTHRITIS and Reaction=VASCULITIC RASH + some noise
  • Route=TOPICAL, Age=early 20s, Sex=F, Outcome code=HO, Indication=URINARY TRACT INFECTION and Reaction=VOMITING + some noise
  • Route=INTRAVENOUS, Age=about 5, Sex=F, Outcome code=LT, Indication=TONSILLITIS and Reaction=VOMITING + some noise
  • Route=OPHTHALMIC, Age=early 50s, Sex=F, Outcome code=DE, Indication=Senile osteoporosis and Reaction=Sepsis + some noise

Below is a preview of my data. You can download the data here

  1. head(my_data)
  2.   route age sex outc_cod              indi_pt              pt
  3. 1  ORAL  63   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH
  4. 2  ORAL  66   F       OT RHEUMATOID ARTHRITIS VASCULITIC RASH
  5. 3  ORAL  66   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH
  6. 4  ORAL  57   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH
  7. 5  ORAL  66   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH
  8. 6  ORAL  66   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH
复制代码

Hierarchical Clustering

To perform hierarchical clustering, we need to change the text to numeric values so that we can calculate distances. Since age is numeric, we will remove it from the rest of the variables and change the character variables to multidimensional numeric space.

  1. age = my_data$age
  2. my_data = select(my_data,-age)
复制代码

Create a Matrix
  1. my_matrix = as.data.frame(do.call(cbind, lapply(mydata, function(x) table(1:nrow(mydata), x))))
复制代码

Now, we can add the age column:

  1. my_matrix$Age=age
  2. head(my_matrix)
  3. INTRAVENOUS OPHTHALMIC ORAL TOPICAL F M DE HO LT OT RHEUMATOID ARTHRITIS Senile osteoporosis
  4. 1           0          0    1       0 0 1  0  0  0  1                    1                   0
  5. 2           0          0    1       0 1 0  0  0  0  1                    1                   0
  6. 3           0          0    1       0 0 1  0  0  0  1                    1                   0
  7. 4           0          0    1       0 0 1  0  0  0  1                    1                   0
  8. 5           0          0    1       0 0 1  0  0  0  1                    1                   0
  9. 6           0          0    1       0 0 1  0  0  0  1                    1                   0
  10.   TONSILLITIS URINARY TRACT INFECTION Sepsis VASCULITIC RASH VOMITING Age
  11. 1           0                       0      0               1        0  63
  12. 2           0                       0      0               1        0  66
  13. 3           0                       0      0               1        0  66
  14. 4           0                       0      0               1        0  57
  15. 5           0                       0      0               1        0  66
  16. 6           0                       0      0               1        0  66
复制代码

Let’s normalize our variables using caret package.

  1. library(caret)
  2. preproc = preProcess(my_matrix)
  3. my_matrixNorm = as.matrix(predict(preproc, my_matrix))
复制代码

Next, let’s calculate distance and apply hierarchical clustering and plot the dendrogram.

  1. distances = dist(my_matrixNorm, method = "euclidean")
  2. clusterdrug = hclust(distances, method = "ward.D")  
  3. plot(clusterdrug, cex=0.5, labels = FALSE,cex=0.5,xlab = "", sub = "",cex=1.2)
复制代码

You will get this plot:

[color=rgb(255, 255, 255) !important]


From the dendrogram shown above, we see that four distinct clusters can be created from the fake data we created. Let’s use different colors to identify the four clusters.

  1. library(dendextend)
  2. dend <- as.dendrogram(clusterdrug)  
  3. install.packages("dendextend")  
  4. library(dendextend)
  5. # Color the branches based on the clusters:
  6. dend <- color_branches(dend, k=4) #, groupLabels=iris_species)
  7. # We hang the dendrogram a bit:
  8. dend <- hang.dendrogram(dend,hang_height=0.1)
  9. # reduce the size of the labels:
  10. # dend <- assign_values_to_leaves_nodePar(dend, 0.5, "lab.cex")
  11. dend <- set(dend, "labels_cex", 0.5)
  12. plot(dend)
复制代码

Here is the plot:

[color=rgb(255, 255, 255) !important]


Now, let’s create cluster groups with four clusters.

  1. clusterGroups = cutree(clusterdrug, k = 4)
复制代码

Now, let’s add the clusterGroups column to the original data.

  1. my_data= cbind(data.frame(Cluster=clusterGroups), my_data, age)
  2. head(my_data)
  3. Cluster route sex outc_cod              indi_pt              pt age
  4. 1       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  63
  5. 2       1  ORAL   F       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  66
  6. 3       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  66
  7. 4       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  57
  8. 5       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  66
  9. 6       1  ORAL   M       OT RHEUMATOID ARTHRITIS VASCULITIC RASH  66
复制代码

Number of Observations in Each Cluster
  1. observationsH=c()
  2. for (i in seq(1,4)){
  3.   observationsH=c(observationsH,length(subset(clusterdrug, clusterGroups==i)))
  4. }
  5. observationsH =as.data.frame(list(cluster=c(1:4),Number_of_observations=observationsH))
  6. observationsH  
  7. cluster Number_of_observations
  8. 1       1                     20
  9. 2       2                     13
  10. 3       3                     15
  11. 4       4                     24
复制代码

What is the most common observation in each cluster?

Let’s calculate column average for each cluster.

  1. z=do.call(cbind,lapply(1:4, function(i) round(colMeans(subset(my_matrix,clusterGroups==i)),2)))
  2. colnames(z)=paste0('cluster',seq(1,4))
  3. z
  4. cluster1 cluster2 cluster3 cluster4
  5. INTRAVENOUS                 0.00     0.00     1.00     0.00
  6. OPHTHALMIC                  0.00     0.00     0.00     0.92
  7. ORAL                        1.00     0.08     0.00     0.08
  8. TOPICAL                     0.00     0.92     0.00     0.00
  9. F                           0.10     0.85     0.80     1.00
  10. M                           0.90     0.15     0.20     0.00
  11. DE                          0.00     0.00     0.00     0.83
  12. .....
复制代码

Next, most common observation in each cluster:

  1. Age=z[nrow(z),]
  2. z=z[1:(nrow(z)-1),]
  3. my_result=matrix(0,ncol=4,nrow=ncol(mydata))
  4. for(i in seq(1,4)){
  5.     for(j in seq(1,ncol(mydata))){
  6. q = names(mydata)[j]
  7. q = as.vector(as.matrix(unique(mydata[q])))
  8. my_result[j,i]=names(sort(z[q,i],decreasing = TRUE)[1])
  9.     }}

  10. colnames(my_result)=paste0('Cluster',seq(1,4))
  11. rownames(my_result)=names(mydata)
  12. my_result=rbind(Age,my_result)
  13. my_result <- cbind(Attribute =c("Age","Route","Sex","Outcome Code","Indication preferred term","Adverse event"), my_result)
  14. rownames(my_result) <- NULL
  15. my_result

  16. Attribute        cluster1        cluster2        cluster3        cluster4
  17. Age                61.8                  17.54                  5.8                  44.62
  18. Route                ORAL                  TOPICAL          INTRAVENOUS           OPHTHALMIC
  19. Sex                 M                  F                  F                   F
  20. Outcome Code         OT                  HO                  LT                   DE
  21. Indication  
  22. preferred term        RHEUMATOID ARTHRITIS   URINARY TRACT INFECTION        TONSILLITIS Senile osteoporosis
  23. Adverse event VASCULITIC RASH         VOMITING         VOMITING           Sepsis
复制代码

Summary

We see that we have created the clusters using hierarchical clustering. From cluster 1, for male in the 60s, the drug results in vasculitic rash when taken for rheumatoid arthritis. We can interpret the other clusters similarly. Remember, this data is not real data. It is fake data made to show the application of clustering for drug adverse event study. From, this short post, we see that clustering can be used for knowledge discovery in drug adverse event reactions. Specially in cases where the data has millions of observations, where we cannot get any insight visually, clustering becomes handy for summarizing our data, for getting statistical insights and for discovering new knowledge.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Discovery Discover Learning Adverse machine learning examples identify between machine

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
hyq2003 发表于 2016-11-10 10:41:46 |只看作者 |坛友微信交流群

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-26 00:45