Let’s face it – cleaning data is a waste of time.
If only the data had been collected and entered carefully in the first place, you wouldn’t be faced with days of data cleaning to do. Worse still, your boss probably doesn’t understand why you can’t just do it in a few minutes. After all, you only need to click a few buttons in Excel, don’t you?
Yeah, right…
Well, we all hate data cleaning, but if we get organised and learn a few tricks there are ways to fast-track it and get it done in a fraction of the time.
In fact, there are just 5 steps to getting your data clean and analysis-ready quickly and painlessly.
[color=rgb(255, 255, 255) !important]
Step 1: Plan, Plan, Plan – Then Plan Some More…
I’ve been involved in many studies at the data collection stage without being consulted in the planning of the study. In every single case, it turned out that the study had not been thought through properly, there were big problems with the study and we had to go back to the beginning to plan it all again. I guess they all thought that there’s no need to involve a data analyst until you actually have some data.
Oh, how wrong can they be…
You see, data analysts and statisticians start thinking about the end game right at the beginning, and that includes deciding which statistical tests will be used on the data, even before the data have been collected. They’ll consider which variables are important, which interactions should be interrogated and which statistical package will be used for the analysis.
Each statistical package has its own particular quirks, and if you know what they are you can arrange your data accordingly right from the beginning.
This is what I mean by planning. It’s not just about collecting your data. It’s about collecting your data to the necessary degree of precision, in the correct format, and making sure that it is fit-for-purpose and capable of answering your research questions.
[color=rgb(255, 255, 255) !important]
Plan, Plan, Plan - Then Plan Some More
Ask yourself if you’re sure that the data you plan to collect will fit into the nice, neat boxes and categories you’ve designed. If you’re not absolutely sure, then do a pilot study first – go out and collect some data. The data you collect might surprise you, and it might change the nature of your study.
So then you go back to the drawing board and plan some more. Keep doing it until you KNOW how your study will progress.
A wise man once said that ‘a well formed question contains its own answer’. As far as I’m concerned, if you’ve planned your study well enough you’ll already know what the outcome is likely to be.
Well, maybe, but at least there will be few surprises…
Step 2: Data Collection
Making sure that your data is as clean as it can be even before you start data cleaning is the best and easiest way to hold on to your sanity.
Of course, if your data is inherited from someone else there may be little you can do about it, but if you’re collecting your own data, deciding on a few standards before you get started will save a lot of pain later.
[color=rgb(255, 255, 255) !important]
It's Data Collection Jim, But Not As We Know It
Photo Credit: JD Hancock via photopin cc
For example, if your dataset is small enough to fit on a single Excel worksheet, then enter it into a single worksheet. If you enter it across multiple worksheets and then need to sort your data you’re likely to make mistakes that can’t easily be corrected. Oops – you’ve just screwed up your dataset and need to start again.
Most statistics and analysis packages require that your data is arranged so that each column is a single variable (Height, Weight, Inside Leg Measurement, etc.) and each row corresponds to a single sample (patient, test-tube, customer, etc.), so get into the habit of formatting your data like this. Oh yes, and row 1 – and only row 1, not 2, 3 or 4 – is reserved for the name of your variable.
I also highly recommend creating a unique ID column in column A, numbered in consecutive integers. You’re going to need to sort your data by different columns and you’ll need a way to restore the original order, and this is the best and easiest way to do it.
Also, did you know that Excel has a built in Data Entry Form that you can use to enter your data quickly and easily? It’s probably Excel’s best kept secret – hardly anybody knows about it, but it’s a really useful feature.
Step 3: Data Cleaning
Data cleaning isn’t really about data cleaning. It’s about being organised. Anybody can clean data, but not everybody can clean data quickly and efficiently. Organising your Excel workbook before you get started with your data collection or data entry is a skill that is worth learning.
You should create a worksheet for your Raw Data, another for Cleaning In Progress, a third worksheet for Cleaned Data and one for Data For Analysis. Each of these worksheets will show your dataset in various stages of preparation, and – if done correctly – when you discover an error in later worksheets you will be able to follow the trail back to the point at which the error was introduced. I guarantee you’ll feel a flush of satisfaction when that happens!
Other sheets that you’ll need in your workbook include a Codes sheet, a Notes sheet, Spare Sheets 1, 2, 3, etc., where you’ll clean your data in independent columns. Well, you don’t expect to do your data cleaning in the same worksheet where your data is stored do you? Does the Find & Replace feature work only on the column you’ve selected or does it apply to the whole worksheet? Are you sure? Really REALLY sure?
And what about the Invisible Man? I really hate this guy. He lurks around in your dataset looking smug and self-satisfied. Well, at least, that what he would look like if you could see him! Trailing and leading spaces can wreak havoc on your analyses, so finding and removing them is a critical skill to have. Fortunately, Excel has a few formulae – including TRIM, CLEAN and SUBSTITUTE – that when used in combination can remove trailing and leading spaces and all non-printing characters from your entire dataset in as little as 60 seconds. Yup, you read that right – 60 seconds, irrespective of the size of your dataset! Learning this little trick can save weeks of data cleaning all on its own.
[color=rgb(255, 255, 255) !important]
Data Cleaning Just Makes Me Want To Sing...
Excel also has a plethora of other data cleaning tools for your delectation that will help streamline the whole process, such as Remove Duplicates, Find & Replace, tools for standardising the case of your text data, such as LOWER, UPPER and PROPER.
Oh, and did anybody say Spell Checker?