| 所在主题: | |
| 文件名: Data_Mining_the_Web_Uncovering_Patterns_in_Web_Content__Structure_and_Usage.pdf | |
| 资料下载链接地址: https://bbs.pinggu.org/a-611305.html | |
| 附件大小: | |
|
DATA MINING
THE WEB Uncovering Patterns in Web Content, Structure, and Usage 出版信息: Copyright C 2007 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, 201-748–6011, fax 201-748–6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at 877-762-2974, outside the United States at 317- 572-3993 or fax 317-572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Wiley Bicentennial Logo: Richard J. Pacifico Library of Congress Cataloging-in-Publication Data: Markov, Zdravko, 1956– Data-mining the Web : uncovering patterns in Web content, structure, and usage / by Zdravko, Markov & Daniel T. Larose. p. cm. Includes index. 978-0-471-66655-4 (cloth) 1. Data mining. 2. Web databases. I. Larose, Daniel T. II. Title. QA76.9.D343M38 2007 005.74 – dc22 2006025099 以下是目录: CONTENTS PREFACE xi PARTI WEB STRUCTURE MINING 1 INFORMATION RETRIEVAL AND WEB SEARCH 3 Web Challenges 3 Web Search Engines 4 Topic Directories 5 Semantic Web 5 Crawling the Web 6 Web Basics 6 Web Crawlers 7 Indexing and Keyword Search 13 Document Representation 15 Implementation Considerations 19 Relevance Ranking 20 Advanced Text Search 28 Using the HTML Structure in Keyword Search 30 Evaluating Search Quality 32 Similarity Search 36 Cosine Similarity 36 Jaccard Similarity 38 Document Resemblance 41 References 43 Exercises 43 2 HYPERLINK-BASED RANKING 47 Introduction 47 Social Networks Analysis 48 PageRank 50 Authorities and Hubs 53 Link-Based Similarity Search 55 Enhanced Techniques for Page Ranking 56 References 57 Exercises 57 viiviii CONTENTS PARTII WEB CONTENT MINING 3 CLUSTERING 61 Introduction 61 Hierarchical Agglomerative Clustering 63 k-Means Clustering 69 Probabilty-Based Clustering 73 Finite Mixture Problem 74 Classification Problem 76 Clustering Problem 78 Collaborative Filtering (Recommender Systems) 84 References 86 Exercises 86 4 EVALUATING CLUSTERING 89 Approaches to Evaluating Clustering 89 Similarity-Based Criterion Functions 90 Probabilistic Criterion Functions 95 MDL-Based Model and Feature Evaluation 100 Minimum Description Length Principle 101 MDL-Based Model Evaluation 102 Feature Selection 105 Classes-to-Clusters Evaluation 106 Precision, Recall, and F-Measure 108 Entropy 111 References 112 Exercises 112 5 CLASSIFICATION 115 General Setting and Evaluation Techniques 115 Nearest-Neighbor Algorithm 118 Feature Selection 121 Naive Bayes Algorithm 125 Numerical Approaches 131 Relational Learning 133 References 137 Exercises 138 PARTIII WEB USAGE MINING 6 INTRODUCTION TO WEB USAGE MINING 143 Definition of Web Usage Mining 143 Cross-Industry Standard Process for Data Mining 144 Clickstream Analysis 147CONTENTS ix Web Server Log Files 148 Remote Host Field 149 Date/Time Field 149 HTTP Request Field 149 Status Code Field 150 Transfer Volume (Bytes) Field 151 Common Log Format 151 Identification Field 151 Authuser Field 151 Extended Common Log Format 151 Referrer Field 152 User Agent Field 152 Example of a Web Log Record 152 Microsoft IIS Log Format 153 Auxiliary Information 154 References 154 Exercises 154 7 PREPROCESSING FOR WEB USAGE MINING 156 Need for Preprocessing the Data 156 Data Cleaning and Filtering 158 Page Extension Exploration and Filtering 161 De-Spidering the Web Log File 163 User Identification 164 Session Identification 167 Path Completion 170 Directories and the Basket Transformation 171 Further Data Preprocessing Steps 174 References 174 Exercises 174 8 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING 177 Introduction 177 Number of Visit Actions 177 Session Duration 178 Relationship between Visit Actions and Session Duration 181 Average Time per Page 183 Duration for Individual Pages 185 References 188 Exercises 188 9 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION 191 Introduction 191 Modeling Methodology 192 Definition of Clustering 193 The BIRCH Clustering Algorithm 194 Affinity Analysis and the A Priori Algorithm 197x CONTENTS Discretizing the Numerical Variables: Binning 199 Applying the A Priori Algorithm to the CCSU Web Log Data 201 Classification and Regression Trees 204 The C4.5 Algorithm 208 References 210 Exercises 211 INDEX 213 |
|
熟悉论坛请点击新手指南
|
|
| 下载说明 | |
|
1、论坛支持迅雷和网际快车等p2p多线程软件下载,请在上面选择下载通道单击右健下载即可。 2、论坛会定期自动批量更新下载地址,所以请不要浪费时间盗链论坛资源,盗链地址会很快失效。 3、本站为非盈利性质的学术交流网站,鼓励和保护原创作品,拒绝未经版权人许可的上传行为。本站如接到版权人发出的合格侵权通知,将积极的采取必要措施;同时,本站也将在技术手段和能力范围内,履行版权保护的注意义务。 (如有侵权,欢迎举报) |
|
京ICP备16021002号-2 京B2-20170662号
京公网安备 11010802022788号
论坛法律顾问:王进律师
知识产权保护声明
免责及隐私声明