楼主: oliyiyi
2821 7

K-means Clustering with Tableau [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
272066 个
通用积分
31322.8677
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383805 点
帖子
9608
精华
66
在线时间
5472 小时
注册时间
2007-5-21
最后登录
2024-5-9

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

By Rathnadevi Manivannan, Treselle Systems.

Objective

In this blog, we will discuss about clustering of customer activities for 24 hours by using K-means clustering feature in Tableau 10. Tableau 10 clustering feature automatically groups similar data points together. This type of clustering helps you create statistically-based segments that provide insights about similarities in different groups and performance of the groups when compared to each other.

You can use clustering on any type of visualization ranging from scatter plots to text tables and even maps.

In our previous blog post – “Call Detail Record Analysis – K-means Clustering with R”, we have discussed about CDR analysis using unsupervised K-means clustering algorithm.

Data Description

A daily activity file from Dandelion API is used as a data source, where the file contains CDR records generated by the Telecom Italia cellular network over the city of Milano. The daily CDR activity file contains information for 10,000 grids about SMS in and out, Call in and out, and Internet activity.

This dataset has 5 Million records and the size of the dataset is 314 MB.

The below table, created in Tableau, shows the total activity of SMS, Call, and Internet activity by hours and total number of records per hour. The Grand Total section shows the cumulative total activity for SMS, call, and Internet.

Data Preprocessing

To preprocess data, perform the following steps:

  • Derive new fields such as “activity_start_time”, “activity_date”, and “activity hour” from “time interval” field.
  • Find total activity, which is the sum of SMS in and out activity, call in and out activity, and Internet traffic activity.
  • Find total SMS activity, which is the sum of SMS in and out activity.
  • Find total call activity, which is the sum of call in and out activity.

Calculation of the above new fields can be done in Tableau easily with “Create Calculated Field” features.

Calculated Fields in Tableau

The below screenshot shows the formula used in Tableau for calculating “activity_start_time”.

If the epoch time is in milliseconds, then divide the value by 1000 to convert into seconds.

The formula for calculating other fields are as follows:

activity_date : DATE ([activity_start_time])

activity_hour : DATEPART (‘hour’, [activity_start_time])

total_activity : SUM(IFNULL([call_in_activity],0) + IFNULL([call_out_activity],0) + IFNULL([sms_in_activity],0) + IFNULL([sms_out_activity],0) + IFNULL([internet_traffic_activity],0))

total_sms_activity : SUM (IFNULL ([sms_in_activity], 0) + IFNULL ([sms_out_activity], 0))

total_call_activity : SUM (IFNULL ([call_in_activity], 0) + IFNULL ([call_out_activity], 0))

Note: IFNULL () is used to replace null value with zero while doing SUM ().

The below screenshot shows the derived fields in Tableau:

CDR Exploratory Data Analysis (EDA)

Tableau is bundled with rich set of visualizations to analyze the data. Exploratory Data Analysis is the process of analyzing the data visually. It involves outlier detection, anomaly detection, missing values detection, aggregating the values, and producing the meaningful insights.

The following visualizations are created as part of EDA on 5 million data:

Total Activity by Activity Hours

This visualization is used to find out:

  • Total activities by hour.
  • Hours producing more traffic respective to total activity.
  • Hours producing less traffic respective to total activity.

From the above visualization, it is evident that most of the activities happened in the hour of 23 and very less activity happened in the hours of 5 and 6.

Top and Bottom 10 Square Grids by Total Activity

This visualization is used to find out:

  • Top 10 Square grids producing more traffic with total activity in those grids.
  • Bottom 10 Square grids producing more traffic with total activity in those grids.

From the above visualization, it is evident that most of the activities happened in the square grid ID 5059 and less activities happened in the square grid ID 497.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Clustering Tableau Cluster k-means Table

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
oliyiyi 发表于 2017-6-19 10:04:55 |只看作者 |坛友微信交流群
Top and Bottom 10 Countries by Total Activity

This visualization is used to find out:

  • Top 10 Countries producing more traffic with total activity.
  • Bottom 10 Countries producing more traffic with total activity.

From the above visualization, it is evident that the country code 39 has the highest activity and no activities in the country codes 257 and 1819.

Note: Finding out top and bottom N in Tableau is pretty simple. We have created parameter, newly calculated field’s index, and Top and Bottom N filter to achieve that.

Call Detail Record Clustering in TableauClustering in Tableau

If the number of clusters is not specified by a user, Tableau picks the number of clusters corresponding to the first local maximum of the Calinski-Harabasz index. Tableau uses the following:

  • K-means algorithm for clustering.
  • Lloyd’s algorithm with squared Euclidean distances to compute the k-means clustering for each k.
  • Calinski-Harabasz criterion to assess cluster quality.

By default, K-means will be run for up to 25 clusters if the first local maximum of the index is not reached for a smaller value of k. A maximum value of 50 clusters can be set.

To determine optimal number of clusters (k), consider Between-group sum of squares (SSB), within group sum of squares (SSW), and total sum of squares from the cluster result.

Total Activity by Activity Hours Cluster

Sometimes, data groupings make immediate sense. “total_activity” measure is used as a variable for clustering and K value is provided as 8.

The Describe clusters dialog box provides information about the models that Tableau computed for clustering. These statistics can be used to assess clustering quality.

Describe Clusters – Models Tab

Analysis of variance (ANOVA) is a collection of statistical models and associated procedures useful for analyzing variation within and between observations that have been partitioned into groups or clusters.

In this use case, ANOVA is computed for “total_activity” variable and the resulting analysis of variance table is used to determine “total_activity” variable effectiveness to distinguish clusters.

Describe Clusters – Summary Tab

The Summary tab identifies the inputs used to generate the clusters and provides some statistics characterizing the clusters.

The clusters are ranked using “Total Activity by Activity Hours” cluster from more to less traffic as follows:

  • Cluster 1 produced more traffic activities, which include only activity hour 23.
  • Cluster 7 is second and includes activity hours 11, 15, 16, 17, and 18.
  • Cluster 2 is third and includes activity hours 0, 10, 12, 13, 14, 19, and 20.
  • Cluster 8 is fourth and includes activity hour 21.
  • Cluster 3 is fifth and includes activity hours 1, 9, and 22.
  • Cluster 4 is sixth and includes activity hour 2.
  • Cluster 5 is seventh and includes activity hours 3 and 8.
  • Cluster 6 is less traffic and includes activity hours 4, 5, 6, and 7.
Total SMS Activity by Activity Hours Cluster

This cluster is based on “total_sms_activity” measure as the variable for clustering and K value provided as 8.

Total Call Activity by Activity Hours Cluster

This cluster is based on “total_call_activity” measure as the variable for clustering and K value provided as 8.

Internet Activity by Activity Hours Cluster

This cluster is based on “sum (internet_traffic_activity)” measure as the variable for clustering and K value provided as 8.

SMS In and Out Activity by Activity Hours Cluster

This cluster is based on “sum (sms_in_activity) and sum (sms_out_activity)” measures as the variable for clustering and K value provided as 6.

Call In and Out Activity by Activity Hours Cluster

This cluster is based on “sum (call_in_activity) and sum (call_out_activity)” measures as the variable for clustering and K value provided as 6.

CDR Analysis (by Treselle Systems)

https://public.tableau.com/profile/tresellesystems#!/vizhome/CDRAnalysisbyTreselleSystems/CDRAnalysisbyTreselleSystems

Conclusion

By using this clustering mechanism, you can find the clusters making more traffic to the telecom network in the measure of total activity. Similarly, you can obtain more information like square grid and country code information to understand the square grid likely creating more revenue and more traffic to the telecom network and to target high customers based on their geo location.

References

Tableau workbook is available in the below GitHub location: https://github.com/treselle-systems/CDR_analysis_using_K_means_in_Tableau

How Clustering Works in Tableau: https://onlinehelp.tableau.com/current/pro/desktop/en-us/clustering_howitworks.html

Statistics for Clustering: http://onlinehelp.tableau.com/v10.2/pro/desktop/en-us/help.html#clustering_description.html

Bio: Rathnadevi Manivannan is working as a Senior Technical Writer in Treselle Systems, experienced and passionate about writing on different technologies and domains such as Big Data, Cloud Computing, Virtualization, Storage, Data Analytics, Business Analytics.


缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html

使用道具

藤椅
whiteice 发表于 2017-6-19 10:08:32 |只看作者 |坛友微信交流群
好好好好
已有 1 人评分论坛币 收起 理由
oliyiyi + 5 精彩帖子

总评分: 论坛币 + 5   查看全部评分

使用道具

板凳
zlhai 在职认证  发表于 2017-6-19 10:22:13 |只看作者 |坛友微信交流群
thanks for sharing
已有 1 人评分论坛币 收起 理由
oliyiyi + 5 精彩帖子

总评分: 论坛币 + 5   查看全部评分

使用道具

报纸
MouJack007 发表于 2017-6-19 11:26:16 |只看作者 |坛友微信交流群
谢谢楼主分享!

使用道具

地板
MouJack007 发表于 2017-6-19 11:27:00 |只看作者 |坛友微信交流群

使用道具

7
edmcheng 发表于 2017-6-19 19:03:38 |只看作者 |坛友微信交流群
Thanks a lot

使用道具

8
minixi 发表于 2017-6-20 10:13:05 |只看作者 |坛友微信交流群
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-5-11 06:42