请选择 进入手机版 | 继续访问电脑版
楼主: Lisrelchen
2307 16

17 More Must-Know Data Science Interview Questions and Answers [推广有奖]

  • 0关注
  • 62粉丝

VIP

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
49957 个
通用积分
79.5487
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

Lisrelchen 发表于 2017-4-16 02:08:17 |显示全部楼层 |坛友微信交流群

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
17 More Must-Know Data Science Interview Questions and Answers
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Data Science questions Interview question Science Answers

本帖被以下文库推荐

Lisrelchen 发表于 2017-4-16 02:10:00 |显示全部楼层 |坛友微信交流群
  1. Q1. What are Data Science lessons from failure to predict 2016 US Presidential election (and from Super Bowl LI comeback)
  2. Gregory Piatetsky answers:

  3. nytimes-upshot-forecast-trump-15
  4. Just before the Nov 8, 2016 election, most pollsters gave Hillary Clinton an edge of ~3% in popular vote and 70-95% chance of victory in electoral college. Nate Silver's FiveThirtyEight had the highest chances of Trump Victory at ~30%, while New York Times Upshot and Princeton Election Consortium estimated only ~15%, and other pollsters like Huffington Post gave Trump only 2% chance of victory. Still, Trump won. So what are the lessons for Data Scientists?

  5. To make a statistically valid prediction we need

  6. 1) enough historical data and

  7. 2) assumption that past events are sufficiently similar to current event we are trying to predict.

  8. Events can placed on the scale from deterministic (2+2 will always equal to 4) to strongly predictable (e.g. orbits of planets and moons, avg. number of heads when tossing a fair coin) to weakly predictable (e.g. elections and sporting events) to random (e.g. honest lottery).

  9. If we toss a fair coin 100 million times, we have the expected number of heads (mean) as 50 million, the standard deviation =10,000 (using formula 0.5 * SQRT(N)), and we can predict that 99.7% of the time the expected number of heads will be within 3 standard deviations of the mean.

  10. But using polling to predict the votes of 100 million people is much more difficult. Pollsters need to get a representative sample, estimate the likelihood of a person actually voting, make many justified and unjustified assumptions, and avoid following their conscious and unconscious biases.

  11. In the case of US Presidential election, correct prediction is even more difficult because of the antiquated Electoral college system when each state (except for Maine and Nebraska) awards the winner all its votes in the electoral college, and the need to poll and predict results for each state separately.

  12. The chart below shows that in 2016 US presidential elections pollsters were off the mark in many states. They mostly underestimated the Trump vote, especially in 3 critical states of Michigan, Wisconsin, and Pennsylvania which all flipped to Trump.

  13. Us Elections 2016 Poll Shift, according to 538

  14. Source: @NateSilver538 tweet, Nov 9, 2016.

  15. A few statisticians like Salil Mehta @salilstatistics were warning about unreliability of polls, and David Wasserman of 538 actually described this scenario in Sep 2016 How Trump Could Win The White House While Losing The Popular Vote, but most pollsters were way off.

  16. So a good lesson for Data Scientists is to question their assumptions and to be very skeptical when predicting a weakly predictable event, especially when based on human behavior.

  17. Other important lessons are

  18. Examine data quality - in this election polls were not reaching all likely voters
  19. Beware of your own biases: many pollsters were likely Clinton supporters and did not want to question the results that favored their candidate. For example, Huffington Post had forecast over 95% chance of Clinton Victory.
  20. See also other analyses of 2016 polling failures:

  21. Wired: Trump’s Win Isn’t the Death of Data—It Was Flawed All Along.
  22. NYTimes How Data Failed Us in Calling an Election
  23. Datanami Six Data Science Lessons from the Epic Polling Failure
  24. InformationWeek Trump's Election: Poll Failures Hold Data Lessons For IT
  25. Why I Had to Eat a Bug on CNN, by Sam Wang, Princeton, whose Princeton Election Consortium gave Trump 15% to win.
  26. (Note: this answer is based on a previous KDnuggets post: http://www.kdnuggets.com/2016/11/trump-shows-limits-prediction.html)
复制代码

使用道具

Lisrelchen 发表于 2017-4-16 02:11:35 |显示全部楼层 |坛友微信交流群
  1. Q2. What problems arise if the distribution of the new (unseen) test data is significantly different than the distribution of the training data?
  2. Gregory Piatetsky and Thuy Pham answer:

  3. The main problem is that the predictions will be wrong !

  4. If the new test data is sufficiently different in key parameters of the prediction model from the training data, then predictive model is no longer valid.

  5. The main reasons this can happen are sample selection bias, population drift, or non-stationary environment.

  6. a) Sample selection bias
  7. Here the data is static, but the training examples have been obtained through a biased method, such as non-uniform selection or non-random split of data into train and test.

  8. If you have a large static dataset, then you should randomly split it into train/test data, and the distribution of test data should be similar to training data.

  9. Covariate shift
  10. b) Covariate shift aka population drift
  11. Here the data is not static, with one population used as a training data, and another population used for testing.
  12. (Figure from http://iwann.ugr.es/2011/pdf/InvitedTalk-FHerrera-IWANN11.pdf).

  13. Sometimes the training data and test data are derived via different processes - eg a drug tested on one population is given to a new population that may have significant differences. As a result, a classifier based on training data will perform poorly.

  14. One proposed solution is to apply a statistical test to decide if the probabilities of target classes and key variables used by the classifier are significantly different, and if they are, to retrain the model using new data.

  15. c) Non-stationary environments
  16. Training environment is different from the test one, whether it's due to a temporal or a spatial change.

  17. This is similar to case b, but applies to situation when data is not static -  we have a stream of data and we periodically sample it to develop predictive models of future behavior.  This happens in adversarial classification problems, such as spam filtering and network intrusion detection, where spammers and hackers constantly change their behavior in response. Another typical case is customer analytics where customer behavior changes over time.  A telephone company develops a model for predicting customer churn or a credit card company develops a model to predict transaction fraud.  Training data is historical data, while (new) test data is the current data.

  18. Such models periodically need to be retrained and to determine when you can compare the distribution of key variables in the predictive model in the old data (training set) and the new data, and if there is a sufficiently significant difference, the model needs to be retrained.

  19. For a more detailed and technical discussion, see references below.
复制代码

使用道具

Lisrelchen 发表于 2017-4-16 02:12:19 |显示全部楼层 |坛友微信交流群
  1. Q3. What are bias and variance, and what are their relation to modeling data?
  2. Matthew Mayo answers:

  3. Bias is how far removed a model's predictions are from correctness, while variance is the degree to which these predictions vary between model iterations.

  4. Bias vs Variance

  5. Bias vs Variance, Image source

  6. As an example, using a simple flawed Presidential election survey as an example, errors in the survey are then explained through the twin lenses of bias and variance: selecting survey participants from a phonebook is a source of bias; a small sample size is a source of variance.

  7. Minimizing total model error relies on the balancing of bias and variance errors. Ideally, models are the result of a collection of unbiased data of low variance. Unfortunately, however, the more complex a model becomes, its tendency is toward less bias but greater variance; therefore an optimal model would need to consider a balance between these 2 properties.

  8. The statistical evaluation method of cross-validation is useful in both demonstrating the importance of this balance, as well as actually searching it out. The number of data folds to use -- the value of k in k-fold cross-validation -- is an important decision; the lower the value, the higher the bias in the error estimates and the less variance.

  9. Bias variance total error
  10. Bias and variance contributing to total error, Image sourceConversely, when k is set equal to the number of instances, the error estimate is then very low in bias but has the possibility of high variance.
  11. The most important takeaways are that bias and variance are two sides of an important trade-off when building models, and that even the most routine of statistical evaluation methods are directly reliant upon such a trade-off.
复制代码

使用道具

Lisrelchen 发表于 2017-4-16 02:13:33 |显示全部楼层 |坛友微信交流群
  1. Q4. Why might it be preferable to include fewer predictors over many?
  2. Anmol Rajpurohit answers:

  3. Here are a few reasons why it might be a better idea to have fewer predictor variables rather than having many of them:

  4. Redundancy/Irrelevance:

  5. If you are dealing with many predictor variables, then the chances are high that there are hidden relationships between some of them, leading to redundancy. Unless you identify and handle this redundancy (by selecting only the non-redundant predictor variables) in the early phase of data analysis, it can be a huge drag on your succeeding steps.

  6. It is also likely that not all predictor variables are having a considerable impact on the dependent variable(s). You should make sure that the set of predictor variables you select to work on does not have any irrelevant ones – even if you know that data model will take care of them by giving them lower significance.

  7. Note: Redundancy and Irrelevance are two different notions –a relevant feature can be redundant due to the presence of other relevant feature(s).

  8. Overfitting:

  9. Even when you have a large number of predictor variables with no relationships between any of them, it would still be preferred to work with fewer predictors. The data models with large number of predictors (also referred to as complex models) often suffer from the problem of overfitting, in which case the data model performs great on training data, but performs poorly on test data.

  10. Productivity:

  11. Let’s say you have a project where there are a large number of predictors and all of them are relevant (i.e. have measurable impact on the dependent variable). So, you would obviously want to work with all of them in order to have a data model with very high success rate. While this approach may sound very enticing, practical considerations (such of amount of data available, storage and compute resources, time taken for completion, etc.) make it nearly impossible.

  12. Thus, even when you have a large number of relevant predictor variables, it is a good idea to work with fewer predictors (shortlisted through feature selection or developed through feature extraction). This is essentially similar to the Pareto principle, which states that for many events, roughly 80% of the effects come from 20% of the causes.

  13. Focusing on those 20% most significant predictor variables will be of great help in building data models with considerable success rate in a reasonable time, without needing non-practical amount of data or other resources.


  14. Training error & test error vs model complexity (Source: Posted on Quora by Sergul Aydore)

  15. Understandability:

  16. Models with fewer predictors are way easier to understand and explain. As the data science steps will be performed by humans and the results will be presented (and hopefully, used) by humans, it is important to consider the comprehensive ability of human brain. This is basically a trade-off – you are letting go of some potential benefits to your data model’s success rate, while simultaneously making your data model easier to understand and optimize.

  17. This factor is particularly important if at the end of your project you need to present your results to someone, who is interested in not just high success rate, but also in understanding what is happening “under the hood”.
复制代码

使用道具

Lisrelchen 发表于 2017-4-16 02:14:04 |显示全部楼层 |坛友微信交流群
  1. Q5. What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups?
  2. Prasad Pore answers:

  3. Binary classification involves classifying the data into two groups, e.g. whether or not a customer buys a particular product or not (Yes/No), based on independent variables such as gender, age, location etc.

  4. As the target variable is not continuous, binary classification model predicts the probability of a target variable to be Yes/No. To evaluate such a model, a metric called the confusion matrix is used, also called the classification or co-incidence matrix. With the help of a confusion matrix, we can calculate important performance measures:

  5. True Positive Rate (TPR) or Hit Rate or Recall or Sensitivity = TP / (TP + FN)
  6. False Positive Rate(FPR) or False Alarm Rate = 1 - Specificity = 1 - (TN / (TN + FP))
  7. Accuracy = (TP + TN) / (TP + TN + FP + FN)
  8. Error Rate = 1 – accuracy or (FP + FN) / (TP + TN + FP + FN)
  9. Precision = TP / (TP + FP)
  10. F-measure: 2 / ( (1 / Precision) + (1 / Recall) )
  11. ROC (Receiver Operating Characteristics) = plot of FPR vs TPR
  12. AUC (Area Under the Curve)
  13. Kappa statistics
  14. You can find more details about these measures here: The Best Metric to Measure Accuracy of Classification Models.

  15. All of these measures should be used with domain skills and balanced, as, for example, if you only get a higher TPR in predicting patients who don’t have cancer, it will not help at all in diagnosing cancer.

  16. In the same example of cancer diagnosis data, if only 2% or less of the patients have cancer, then this would be a case of class imbalance, as the percentage of cancer patients is very small compared to rest of the population. There are main 2 approaches to handle this issue:

  17. Use of a cost function: In this approach, a cost associated with misclassifying data is evaluated with the help of a cost matrix (similar to the confusion matrix, but more concerned with False Positives and False Negatives). The main aim is to reduce the cost of misclassifying. The cost of a False Negative is always more than the cost of a False Positive. e.g. wrongly predicting a cancer patient to be cancer-free is more dangerous than wrongly predicting a cancer-free patient to have cancer.
  18. Total Cost = Cost of FN * Count of FN + Cost of FP * Count of FP

  19. Use of different sampling methods: In this approach, you can use over-sampling, under-sampling, or hybrid sampling. In over-sampling, minority class observations are replicated to balance the data. Replication of observations leading to overfitting, causing good accuracy in training but less accuracy in unseen data. In under-sampling, the majority class observations are removed causing loss of information. It is helpful in reducing processing time and storage, but only useful if you have a large data set.
  20. Find more about class imbalance here.

  21. If there are multiple classes in the target variable, then a confusion matrix of dimensions equal to the number of classes is formed, and all performance measures can be calculated for each of the classes. This is called a multiclass confusion matrix. e.g. there are 3 classes X, Y, Z in the response variable, so recall for each class will be calculated as below:

  22. Recall_X = TP_X/(TP_X+FN_X)

  23. Recall_Y = TP_Y/(TP_Y+FN_Y)

  24. Recall_Z = TP_Z/(TP_Z+FN_Z)
复制代码

使用道具

Lisrelchen 发表于 2017-4-16 02:14:42 |显示全部楼层 |坛友微信交流群
  1. Q6. What are some ways I can make my model more robust to outliers?
  2. Thuy Pham answers:

  3. There are several ways to make a model more robust to outliers, from different points of view (data preparation or model building). An outlier in the question and answer is assumed being unwanted, unexpected, or a must-be-wrong value to the human’s knowledge so far (e.g. no one is 200 years old) rather than a rare event which is possible but rare.

  4. Outliers are usually defined in relation to the distribution. Thus outliers could be removed in the pre-processing step (before any learning step), by using standard deviations (for normality) or interquartile ranges (for not normal/unknown) as threshold levels.


  5. Outliers. Image source

  6. Moreover, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. This type of transformation (named after Charles P. Winsor (1895–1951)) has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values).  Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.

  7. For model building, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Similar to the median effect, tree models divide each node into two in each split. Thus, at each split, all data points in a bucket could be equally treated regardless of extreme values they may have. The study [Pham 2016] proposed a detection model that incorporates interquartile information of data to predict outliers of the data.
复制代码

使用道具

auirzxp 学生认证  发表于 2017-4-16 02:16:31 |显示全部楼层 |坛友微信交流群
感谢分享

使用道具

Lisrelchen 发表于 2017-4-16 02:16:36 |显示全部楼层 |坛友微信交流群
  1. Q13. What makes a good data visualization?

  2. Gregory Piatetsky answers:

  3. Note: This answer contains excerpts from the recent post What makes a good data visualization – a Data Scientist perspective.

  4. Data Science is more than just building predictive models - it is also about explaining the models and using them to help people to understand data and make decisions. Data visualization is an integral part of presenting data in a convincing way.

  5. There is a ton of research of good data visualization and how people best perceive information - see work by Stephen Few and many others.

  6. Guidelines on improving human perception include:

  7. position data along a common scale
  8. bars are more effective than circles or squares in communicating size
  9. color is more discernible than shape in scatterplots
  10. avoid pie chart unless it is for showing proportions
  11. avoid 3D charts and reduce chartjunk
  12. Sunburst visualization is more effective for hierarchical plots
  13. use small multiples (even though animation looks cool, it is less effective for understanding changing data.)
  14. See 39 studies about human perception, by Washington Post graphics editor for a lot more detail.

  15. From Data Science point of view, what makes visualization important is highlighting the key aspects of data - what are the most important variables, what is their relative importance, what are the changes and trends.

  16. Chart Junk
  17. Data visualization should be visually appealing but not at the expense of loading a chart with unnecessary junk, like in this extreme example on the right.

  18. How do we make a good data visualization?

  19. To do that, choose the right type of chart for your data:

  20. Line Charts to track changes or trends over time and show the relationship between two or more variables.
  21. Bar Charts to compare quantities of different categories.
  22. Scatter Plots show joint variation of two data items.
  23. Pie Charts to compare parts of a whole - used them sparingly since people have hard time comparing the area of pie slices
  24. You can show additional variables on a 2-D plot using color, shape, and size
  25. Use interactive dashboards to allow experiments with key variables
  26. Here is an example of visualization of US Presidential Elections, 1976-2016, that shows multiple variables at once: the electoral college votes difference (y-axis), the % popular vote difference (X-axis), the size of the popular vote (circle area), winner party (color), and winner name and year (label). See my post on What makes a good data visualization for more details.

  27. Visualization Us Elections 1976 2016

  28. US Presidential Elections, 1976-2016.

  29. References:

  30. What makes a good visualization, David McCandless, Information is Beautiful
  31. 5 Data Visualization Best Practices, GoodData
  32. 39 studies about human perception in 30 minutes, Kenn Elliott
  33. Data Visualization for Human Perception, landmark work by Stephen Few (key ideas summarized here)
复制代码

使用道具

Lisrelchen 发表于 2017-4-16 02:17:29 |显示全部楼层 |坛友微信交流群
  1. Q14. What are some of the common data quality issues when dealing with Big Data? What can be done to avoid them or to mitigate their impact?

  2. Anmol Rajpurohit answers:

  3. The most common data quality issues observed when dealing with Big Data can be best understood in terms of the key characteristics of Big Data – Volume, Velocity, Variety, Veracity, and Value.

  4. Volume:

  5. In the traditional data warehouse environment, comprehensive data quality assessment and reporting was at least possible (if not, ideal). However, in the Big Data projects the scale of data makes it impossible. Thus, the data quality measurements can at best be approximations (i.e. need to be described in probability and confidence intervals, and not in terms of absolute values). We also need to re-define most of the data quality metrics based on the specific characteristics of the Big Data project so that those metrics can have a clear meaning, be measured (good approximation) and be used for evaluating the alternative strategies for data quality improvement.

  6. Despite the great volume of underlying data, it is not uncommon to find out that some desired data was not captured or is not available for other reasons (such as high cost, delay in getting it, etc.). It is ironical but true that data availability continues to be a prominent data quality concern in the Big Data era.

  7. Velocity:

  8. The tremendous pace of data generation and collection makes it incredibly hard to monitor data quality within a reasonable overhead on time and resources (storage, compute, human effort, etc.). So, by the time data quality assessment completes, the output might be outdated and of little use, particularly if the Big Data project is to serve any real-time or near real-time business needs. In such scenarios, you would need to re-define data quality metrics so that they are relevant as well as feasible in the real-time context.

  9. Sampling can help you gain speed for the data quality efforts, but this comes at the cost of a bias (which eventually makes the end result less useful) because of the fact that samples are rarely an accurate representation of the entire data. Lesser samples will give higher speed, but with a bigger bias.

  10. Another impact of velocity is that you might have to do data quality assessments on-the-fly, i.e. somewhere plugged-in within the data collection/transfer/storage processes; as the critical time-constraint does not give you the privilege of making a copy of a selected data subset, storing it elsewhere and running data quality assessments on it.

  11. Variety:

  12. One of the biggest data quality issues in Big Data is that the data includes several data types (structured, semi-structured, and unstructured) coming in from different data sources. Thus, often a single data quality metric will not be applicable for the entire data and you would need to separately define data quality metrics for each data type. Moreover, assessing and improving the data quality of unstructured or semi-structured data is way more tricky and complex than that of structured data. For example, when mining the physician notes from medical records across the world (related to a particular medical condition) even if the language (and the grammar) is same the meaning might be very different due to local dialects and slang. This leads to low data interpretability, another data quality measure.

  13. Data from different sources often has serious semantic differences. For example, “profit” can have widely varied definitions across the business units of an organization or external agencies. Thus, the fields with identical names may not mean the same thing. This problem is made worse by the lack of adequate and consistent meta-data from each data source. In order to make sense of data, you need reliable metadata (such as to make sense of sales numbers from a store, you need other information such as date-time, items purchased, coupons used, etc.). Usually, a lot of these data sources are outside an organization and thus, it is very hard to ensure good metadata for such data.

  14. Another common issue is syntactic inconsistencies. For example, “time-stamp” values from different sources would be incompatible unless they are captured along with the time zone information.



  15. Image source.
  16. Veracity:

  17. Veracity, one of the most overlooked Big Data characteristics, is directly related to data quality, as it refers to the inherent biases, noise and abnormality in data. Because of veracity, the data values might not be exact real values, rather they might be approximations. In other words, the data might have some inherent impreciseness and uncertainty. Besides data inaccuracies, Veracity also includes data consistency (defined by the statistical reliability of data) and data trustworthiness (based on data origin, data collection and processing methods, security infrastructure, etc.). These data quality issues in turn impact data integrity and data accountability.

  18. While the other V’s are relatively well-defined and can be easily measured, Veracity is a complex theoretical construct with no standard approach for measurement. In a way this reflects how complex the topic of “data quality” is within the Big Data context.

  19. Data users and data providers are often different organizations with very different goals and operational procedures. Thus, it is no surprise that their notions of data quality are very different. In many cases, the data providers have no clue about the business use cases of data users (data providers might not even care about it, unless they are getting paid for the data). This disconnect between data source and data use is one of the prime reasons behind the data quality issues symbolized by Veracity.

  20. Value:

  21. The Value characteristic connects directly to the end purpose. Organizations are harnessing Big Data for many diverse business pursuits, and those pursuits are the real drivers of how data quality is defined, measured, and improved.

  22. A common and old definition of data quality is that it is the “fitness of use” for the data consumer. This means that data quality is dependent on what you plan to do with the data. Thus, for a given data two different organizations with different business goals will most likely have widely different measurements of data quality.This nuance is often not well understood – data quality is a “relative” term. A Big Data project might involve incomplete and inconsistent data, however, it is possible that those data quality issues do not impact the utility of data towards the business goal. In such a case, the business would say that the data quality is great (and will not be interested in investing in data quality improvements). For example, for a producer of mashed potato cans a batch of small potatoes would be of same quality as a batch of big potatoes. However, for a fast food restaurant making fries, the quality of the two batches would be radically different.

  23. The Value aspect also brings in the “cost-benefit” perspective to data quality – whether it would be worth to resolve a given data quality issue, which issues should be resolved on priority, etc.

  24. Putting it all together:

  25. Data quality in Big Data projects is a very complex topic, where the theory and practice often differ. I haven’t come across any standard theory yet that is widely-accepted. Rather, I see little interest in the industry towards this goal.In practice, data quality does play an important role in the design of Big Data architecture. All the data quality efforts must start from a solid understanding of high-priority business use cases, and use that insight to navigate various trade-offs (samples given below) to optimize the quality of the final output.

  26. Sample trade-offs related to data quality:

  27. Is it worth improving the timeliness of data at the expense of data completeness and/or inadequate assessment of accuracy?
  28. Should we select data for cleaning based on cost of cleaning effort or based on how frequently the data is used or based on its relative importance within the data models consuming it? Or, a combination of those factors? What sort of combination?
  29. Is it a good idea to improve data accuracy through getting rid of incomplete or erroneous data? While removing some data, how do we ensure that no bias is getting introduced?
  30. Given the magnanimous scope of work and very limited resources (relatively!), one common way for data quality efforts on Big Data projects is to adopt the baseline approach, in which, the data users are surveyed to identify and document the bare minimum data quality needed to ensure that the business processes they support are not disrupted. These minimum satisfactory levels of data quality are referred to as the baseline, and the data quality efforts are focused on ensuring that data quality for each data does not fall beyond its baseline level. It looks like a good starting point and you may later move into more advanced endeavors (based on business needs and available budget).

  31. Summary of Recommendations to improve data quality in Big Data projects:

  32. Identify and prioritize the business use cases (then, use them to define data quality metrics, measurement methodology, improvement goals, etc.)
  33. Based on a strong understanding of the business use cases and the Big Data architecture implemented to achieve them, design and implement an optimal layer of data governance (data definitions, metadata requirements, data ownership, data flow diagrams, etc.)
  34. Document baseline quality levels for key data (think of “critical-path” diagram and “throughput-bottleneck” assessment)
  35. Define ROI for data quality efforts (in order to create feedback loop on the ROI metric to improve efficiency and to sustain funding for data quality efforts)
  36. Integrate data quality efforts (to achieve efficiency through minimizing redundancy)
  37. Automate data quality monitoring (to reduce cost as well as to let employees stay focused on complex tasks)
  38. Do not rely on machine learning to automatically take care of poor data quality (machine learning is science and not magic!)

  39. 3 more interesting answers on the next page - read on ...
复制代码

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-18 14:02