Data Science Dictionary

1关注
185
粉丝

版主

已卖：2995份资源

泰斗

1%

还不是VIP/贵宾

-

TA的文库 其他...

计量文库

0%

威望: 7 级
论坛币: 66190 个
通用积分: 31671.1867
学术水平: 1454 点
热心指数: 1573 点
信用等级: 1364 点
经验: 384134 点
帖子: 9629
精华: 66
在线时间: 5508 小时
注册时间: 2007-5-21
最后登录: 2025-7-8

楼主

oliyiyi 发表于 2017-9-20 07:45:58 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

本帖隐藏的内容

Top 50 data science / big data techniques, described in less than 40 words, for decision makers. Please help us: any definition that you fill will have your name attached to it: send your definition or new term and definition to vincentg@datashaping.com. See also The Data Science Alphabet.

Adjusted R^2 (R-Square): The method preferred by statisticians for determining which variables to include in a model. It is a modified version of R^2 which penalizes each new variable on the basis of how many have already been admitted. Due to its construct, R^2 will always increase as you add new variables, which result in models that over-fit the data and have poor predictive ability. Adjusted R^2 results in more parsimonious models that admit new variables only if the improvement in fit is larger than the penalty, which improves the ultimate goal of out-of-sample prediction. (Submitted by Santiago Perez)
Bayesian Networks
Boosted Models
Cluster Analysis: Methods to assign a set of objects into groups. These groups are called clusters and objects in a cluster are more similar to each other than to those in other clusters. Well known algorithms are hierarchical clustering, k-means, fuzzy clustering, supervised clustering. (submitted by Markus Schmidberger)
Cross Validation
Decision Trees: A tree of questions to guide an end user to a conclusion based on values from a single vector of data. The classic example is a medical diagnosis based on a set of symptoms for a particular patient. A common problem in data science is to automatically or semi-automatically generate decision trees based on large sets of data coupled to known conclusions. Example algorithms are CART and ID3. (Submitted by Michael Malak)
Design of Experiments
EM Algorithm
Ensemble Methods
Factor Analysis: used as a variable reduction technique to identify groups of clustered variables. (submitted by Vincent Granville)
Feature Selection
General Linear Model
Goodness of Fit: The degree to which the predicted values created by a model minimizes errors in cross-validation tests. However, over-fitting the data can be dangerous, as it results in a model that will have no predictive power on fresh data. True Goodness of Fit is determined by how the model fits new data, ie its predictive ability. (submitted by Santiago Perez)
Hadoop: Hadoop is an Open Source framework that supports large scale data analysis by allowing one to decompose questions into discrete chunks that can be executed independently very close to slices of the data in question and ultimately reassembled into an answer to the question posed. (submitted by Philip Best)
Hidden Decision Trees
Hierarchical Bayesian Models
K-Means: Popular clustering algorithm where for a given (a priori) K, finds K clusters by iteratively moving cluster centers to the cluster centers of gravity and adjusting the cluster set assignments. (Submitted by Michael Malak)
Kernel Density estimator
Linear Discrimination
Logistic Regression
Mahout: Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification, often leveraging, but not limited to, the Hadoop platform.
MapReduce: Model for processing large amounts of data efficiently. Original problem is "mapped" to smaller problems (which may themselves become "original" problems). Smaller problems are processed in parallel. Results of smaller problems are combined, or "reduced", into solution to original problem. (submitted by Melanie Jutras)
Maximum Likelihood
MCMC
Mixture Models
Model Fitting
Monte-Carlo Simulations: Computing expectations and probabilities in models of random phenomena using many randomly sampled values. Akin to compute probability of winning a given roulette bet (say black) by repeatedly placing it and counting success ratio. Useful in complex models characterized by uncertainty. (submitted by Renato Vitolo)
No SQL: "Not only SQL" is a group of database management systems. Data is not stored in tables like a relational database and is not based on the mathematical relationship between tables. It is a way of storing and retrieving unstructured data quickly. (submitted by Markus Schmidberger)
Multidimensional Scaling: reduce space dimension by projecting a N*N (N = number of observations) similarity matrix into a 2-dimensional visual representation. Classical example is producing a geographic map with cities, when the only data available is travel times between any pair of cities. (submitted by Vincent Granville)
Naive Bayes
Non-parametric Statistics
Pig: Pig is a scripting interface to Hadoop, meaning a lack of MapReduce programming experience won't hold you back. It's also known for being able to process a large variety of different data types.
Principal Component Analysis
Sensitivity Analysis'
Six Sigma
Stepwise Regression: Variable selection process for multivariate regression. In forward stepwise selection, a seed variable is selected and each additional variable is inputed into the model, but only kept if it significantly improves goodness of fit (as measured by increases in R^2). Backwards selection starts with all variables, and removes them one by one until removing an additional one decreases R^2 by a non-trivial amount. Two deficiencies of this method are that the seed chosen disproportionately impacts which variables are kept, and that the decision is made using R^2, not Adjusted R^2. (submitted by Santiago Perez)
Supervised Clustering
Support Vector Machines
Time Series: A set of (t, x) values where x is usually a scalar (though could be a vector) and the t values are usually sampled at regular intervals (though some time series are irregularly sampled). In the case of regularly sampled time series, the t is usually dropped from the actual data, replaced with just a t0 (start time) and delta-t that apply to the whole series. (Submitted by Michael Malak)

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Data Science Dictionary Science SCIE Data

Data Science Dictionary [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Data Science Dictionary [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群