楼主: 暗示效应184
272 0

[英文文献] Robustness Analysis of a Website Categorization Procedure based on Machine ... [推广有奖]

  • 0关注
  • 0粉丝

等待验证会员

学前班

0%

还不是VIP/贵宾

-

威望
0
论坛币
0 个
通用积分
0
学术水平
0 点
热心指数
0 点
信用等级
0 点
经验
10 点
帖子
0
精华
0
在线时间
0 小时
注册时间
2020-9-21
最后登录
2020-9-21

楼主
暗示效应184 发表于 2005-6-15 00:48:19 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
英文文献:Robustness Analysis of a Website Categorization Procedure based on Machine Learning-基于机器学习的网站分类过程的鲁棒性分析
英文文献作者:Renato Bruni,Gianpiero Bianchi
英文文献摘要:
Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used to accomplish statistical surveys, saving the cost of the surveys, or to validate already surveyed data. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a dicult task in practice. This work describes techniques that can be used to convert website categorization into a supervised classification problem. To do so, each data record should summarize the content of an entire website. We generate this kind of records by using web scraping and optical character recognition, followed by a number of automated feature engineering steps. When such records have been produced, we apply to them state-of-the-art classification techniques to categorize the websites according to the aspect of interest. We use Support Vector Machines, Random Forest and Logistic classifiers. Since in many applicative cases the labels available for the training set may be noisy, we analyze the robustness of our procedure with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities.

最近,网站分类已经成为一项非常重要的任务。通过网站可以免费获得大量的信息,可以用来完成统计调查,节省调查成本,或者验证已经调查的数据。然而,特定分类的相关信息必须在如此巨大的数量中挖掘。这在实践中是一项艰难的任务。这项工作描述了可以用来将网站分类转换为监督分类问题的技术。要做到这一点,每个数据记录应该总结整个网站的内容。我们通过web抓取和光学字符识别来生成这种记录,然后是一些自动特征工程步骤。当这些记录被制作出来后,我们会应用最先进的分类技术,根据感兴趣的方面对网站进行分类。我们使用支持向量机,随机森林和逻辑分类器。由于在许多应用的情况下,训练集可用的标签可能是有噪声的,我们分析我们的程序的鲁棒性关于存在错误分类的训练记录。我们为检测提供电子商务设施的网站的问题在真实世界的数据上提出结果。
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝


您需要登录后才可以回帖 登录 | 我要注册

本版微信群
扫码
拉您进交流群
GMT+8, 2026-1-29 06:18