楼主: oliyiyi
1477 17

Data Science Dictionary [推广有奖]

版主

泰斗

0%

还不是VIP/贵宾

-

TA的文库  其他...

计量文库

威望
7
论坛币
271951 个
通用积分
31269.3519
学术水平
1435 点
热心指数
1554 点
信用等级
1345 点
经验
383775 点
帖子
9598
精华
66
在线时间
5468 小时
注册时间
2007-5-21
最后登录
2024-4-18

初级学术勋章 初级热心勋章 初级信用勋章 中级信用勋章 中级学术勋章 中级热心勋章 高级热心勋章 高级学术勋章 高级信用勋章 特级热心勋章 特级学术勋章 特级信用勋章

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

Top 50 data science / big data techniques, described in less than 40 words, for decision makers. Please help us: any definition that you fill will have your name attached to it: send your definition or new term and definition to vincentg@datashaping.com. See also The Data Science Alphabet.

Adjusted R^2 (R-Square): The method preferred by statisticians for determining which variables to include in a model. It is a modified version of R^2 which penalizes each new variable on the basis of how many have already been admitted. Due to its construct, R^2 will always increase as you add new variables, which result in models that over-fit the data and have poor predictive ability. Adjusted R^2 results in more parsimonious models that admit new variables only if the improvement in fit is larger than the penalty, which improves the ultimate goal of out-of-sample prediction. (Submitted by Santiago Perez)
Bayesian Networks
Boosted Models
Cluster Analysis: Methods to assign a set of objects into groups. These groups are called clusters and objects in a cluster are more similar to each other than to those in other clusters. Well known algorithms are hierarchical clustering, k-means, fuzzy clustering, supervised clustering. (submitted by Markus Schmidberger)
Cross Validation
Decision Trees: A tree of questions to guide an end user to a conclusion based on values from a single vector of data. The classic example is a medical diagnosis based on a set of symptoms for a particular patient. A common problem in data science is to automatically or semi-automatically generate decision trees based on large sets of data coupled to known conclusions. Example algorithms are CART and ID3. (Submitted by Michael Malak)
Design of Experiments
EM Algorithm
Ensemble Methods
Factor Analysis: used as a variable reduction technique to identify groups of clustered variables. (submitted by Vincent Granville)
Feature Selection
General Linear Model
Goodness of Fit: The degree to which the predicted values created by a model minimizes errors in cross-validation tests. However, over-fitting the data can be dangerous, as it results in a model that will have no predictive power on fresh data. True Goodness of Fit is determined by how the model fits new data, ie its predictive ability. (submitted by Santiago Perez)
Hadoop: Hadoop is an Open Source framework that supports large scale data analysis by allowing one to decompose questions into discrete chunks that can be executed independently very close to slices of the data in question and ultimately reassembled into an answer to the question posed. (submitted by Philip Best)
Hidden Decision Trees
Hierarchical Bayesian Models
K-Means: Popular clustering algorithm where for a given (a priori) K, finds K clusters by iteratively moving cluster centers to the cluster centers of gravity and adjusting the cluster set assignments. (Submitted by Michael Malak)
Kernel Density estimator
Linear Discrimination
Logistic Regression
Mahout: Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification, often leveraging, but not limited to, the Hadoop platform.
MapReduce: Model for processing large amounts of data efficiently. Original problem is "mapped" to smaller problems (which may themselves become "original" problems). Smaller problems are processed in parallel. Results of smaller problems are combined, or "reduced", into solution to original problem. (submitted by Melanie Jutras)
Maximum Likelihood
MCMC
Mixture Models
Model Fitting
Monte-Carlo Simulations: Computing expectations and probabilities in models of random phenomena using many randomly sampled values. Akin to compute probability of winning a given roulette bet (say black) by repeatedly placing it and counting success ratio. Useful in complex models characterized by uncertainty. (submitted by Renato Vitolo)
No SQL: "Not only SQL" is a group of database management systems. Data is not stored in tables like a relational database and is not based on the mathematical relationship between tables. It is a way of storing and retrieving unstructured data quickly. (submitted by Markus Schmidberger)
Multidimensional Scaling: reduce space dimension by projecting a N*N  (N = number of observations) similarity matrix into a 2-dimensional visual representation. Classical example is producing a geographic map with cities, when the only data available is travel times between any pair of cities. (submitted by Vincent Granville)
Naive Bayes
Non-parametric Statistics
Pig: Pig is a scripting interface to Hadoop, meaning a lack of MapReduce programming experience won't hold you back. It's also known for being able to process a large variety of different data types.
Principal Component Analysis
Sensitivity Analysis'
Six Sigma
Stepwise Regression: Variable selection process for multivariate regression. In forward stepwise selection, a seed variable is selected and each additional variable is inputed into the model, but only kept if it significantly improves goodness of fit (as measured by increases in R^2). Backwards selection starts with all variables, and removes them one by one until removing an additional one decreases R^2 by a non-trivial amount. Two deficiencies of this method are that the seed chosen disproportionately impacts which variables are kept, and that the decision is made using R^2, not Adjusted R^2. (submitted by Santiago Perez)
Supervised Clustering
Support Vector Machines
Time Series: A set of (t, x) values where x is usually a scalar (though could be a vector) and the t values are usually sampled at regular intervals (though some time series are irregularly sampled). In the case of regularly sampled time series, the t is usually dropped from the actual data, replaced with just a t0 (start time) and delta-t that apply to the whole series. (Submitted by Michael Malak)


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Science Dictionary Science SCIE Data

缺少币币的网友请访问有奖回帖集合
https://bbs.pinggu.org/thread-3990750-1-1.html
沙发
oushirencai 发表于 2017-9-20 07:50:06 |只看作者 |坛友微信交流群

使用道具

藤椅
liqiangtsu 发表于 2017-9-20 07:54:03 |只看作者 |坛友微信交流群
kankan

使用道具

板凳
maxine2001 发表于 2017-9-20 07:59:37 |只看作者 |坛友微信交流群

使用道具

报纸
ekscheng 发表于 2017-9-20 08:27:31 |只看作者 |坛友微信交流群

使用道具

地板
钱学森64 发表于 2017-9-20 08:29:59 |只看作者 |坛友微信交流群
谢谢分享

使用道具

7
MouJack007 发表于 2017-9-20 09:24:15 |只看作者 |坛友微信交流群
谢谢楼主分享!

使用道具

8
MouJack007 发表于 2017-9-20 09:24:47 |只看作者 |坛友微信交流群

使用道具

9
mib25 发表于 2017-9-20 10:51:21 |只看作者 |坛友微信交流群
Thanks

使用道具

10
minixi 发表于 2017-9-20 11:37:19 |只看作者 |坛友微信交流群
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-25 18:39