[/url][url=]
The data that you use, and how you use it, will likely define the success of your predictive modeling problem.
Data and the framing of your problem may be the point of biggest leverage on your project.
Choosing the wrong data or the wrong framing for your problem may lead to a model with poor performance or, at worst, a model that cannot converge.
It is not possible to analytically calculate what data to use or how to use it, but it is possible to use a trial-and-error process to discover how to best use the data that you have.
In this post, you will discover to get the most from your data on your machine learning project.
After reading this post, you will know:
- The importance of exploring alternate framings of your predictive modeling problem.
- The need to develop a suite of “views” on your input data and to systematically test each.
- The notion that feature selection, engineering, and preparation are ways of creating more views on your problem.
How to Get the Most From Your Machine Learning Data
Photo by Jean-Marc Bolfing, some rights reserved.
OverviewThis post is divided into 8 parts; they are:
- Problem Framing
- Collect More Data
- Study Your Data
- Training Data Sample Size
- Feature Selection
- Feature Engineering
- Data Preparation
- Go Further
The framing of the problem means the combination of:
- Inputs
- Outputs
- Problem Type
- Can you use more or less data as inputs to the model?
- Can you predict something else instead?
- Can you change the problem to be regression/classification/sequence/etc.?
Use ideas from other projects, papers, and the domain itself.
Brainstorm. Write down all of the ideas, even if they are crazy.
I have some frameworks that will help with brainstorming the framing here:
I talk a little about changing the problem type in this post:
2. Collect More DataGet more data than you need, even data that is tangentially related to the outcome being predicted.
We cannot know how much data will be needed.
Data is the currency spent during model development. It is the oxygen needed by the project to breathe. Each time you use some data, it is less data available for other tasks.
You need to spend data on tasks like:
- Model training.
- Model evaluation.
- Model tuning.
- Model validation.
3. Study Your DataUse every data visualization you can think of to look at your data from every angle.
- Looking at raw data helps. You will notice things.
- Looking at summary statistics helps. Again, you will notice things.
- Data visualization is like a beautiful combination of these two ways of learning. You will notice a lot more things.
Use every data visualization you can think of and glean from books and papers on your data.
- Review plots.
- Save plots.
- Annotate plots.
- Show plots to domain experts.
4. Training Data Sample SizePerform a sensitivity analysis with your data sample to see how much (or little) data you actually need.
You do not have all observations. If you did, you would not need to make predictions for new data.
Instead, you are working with a sample of the data. Therefore, there is an open question as to how much data will be needed to fit the model.
Don’t assume that more is better. Test.
- Design experiments to see how model skill changes with sample size.
- Use statistics to see how important trends and tendencies change with sample size.
Learn more about sample size in this post:
5. Feature SelectionCreate many different views of your input features and test each one.
You don’t know what variables will be helpful or most helpful in your predictive modeling problem.
- You can guess.
- You can use advice from domain experts.
- You can even use suggestions from feature selection methods.
Each set of suggested input features is a “view” on your problem. An idea on what features might be useful for modeling and predicting the output variable.
Brainstorm, compute, and collect as many different views of your input data as you can.
Design experiments and carefully test and compare each view. Use data to inform you which features and which view are the most predictive.
For more on feature selection, see this post:
6. Feature EngineeringUse feature engineering to create additional features and views on your predictive modeling problem.
Sometimes you have all of the data you can get, but a given feature or set of features locks up knowledge that is too dense for the machine learning methods to learn and map to the outcome variable.
Examples include:
- Date/Times.
- Transactions.
- Descriptions.
Make things as simple as you can for the modeling process.
For more on feature engineering, see the post: