Time series forecasting is a process, and the only way to get good forecasts is to practice this process. In this tutorial, you will discover how to forecast the number of monthly armed robberies in Boston with Python. Working through this tutorial will provide you with a framework for the steps and the tools for working through your own time series forecasting problems. After completing this tutorial, you will know: Let’s get started. In this tutorial, we will work through a time series forecasting project from end-to-end, from downloading the dataset and defining the problem to training a final model and making predictions. This project is not exhaustive, but shows how you can get good results quickly by working through a time series forecasting problem systematically. The steps of this project that we will work through are as follows: This will provide a template for working through a time series prediction problem that you can use on your own dataset. This tutorial assumes an installed and working SciPy environment and dependencies, including: I used Python 2.7. Are you on Python 3? let me know how you go in the comments. This script will help you check your installed versions of these libraries. The results on my workstation used to write this tutorial are as follows: The problem is to predict the number of monthly armed robberies in Boston, USA. The dataset provides the number of monthly armed robberies in Boston from January 1966 to October 1975, or just under 10 years of data. The values are a count and there are 118 observations. The dataset is credited to McCleary & Hay (1980). You can learn more about this dataset and download it directly from DataMarket. Download the dataset as a CSV file and place it in your current working directory with the filename “robberies.csv“. We must develop a test harness to investigate the data and evaluate candidate models. This involves two steps: The dataset is not current. This means that we cannot easily collect updated data to validate the model. Therefore we will pretend that it is October 1974 and withhold the last one year of data from analysis and model selection. This final year of data will be used to validate the final model. The code below will load the dataset as a Pandas Series and split into two, one for model development (dataset.csv) and the other for validation (validation.csv). Running the example creates two files and prints the number of observations in each. The specific contents of these files are: The validation dataset is 10% of the original dataset. Note that the saved datasets do not have a header line, therefore we do not need to cater to this when working with these files later. Model evaluation will only be performed on the data in dataset.csv prepared in the previous section. Model evaluation involves two elements: The observations are a count of robberies. We will evaluate the performance of predictions using the root mean squared error (RMSE). This will give more weight to predictions that are grossly wrong and will have the same units as the original data. Any transforms to the data must be reversed before the RMSE is calculated and reported to make the performance between different methods directly comparable. We can calculate the RMSE using the helper function from the scikit-learn library mean_squared_error() that calculates the mean squared error between a list of expected values (the test set) and the list of predictions. We can then take the square root of this value to give us an RMSE score. For example: Candidate models will be evaluated using walk-forward validation. This is because a rolling-forecast type model is required from the problem definition. This is where one-step forecasts are needed given all available data. The walk-forward validation will work as follows: Given the small size of the data, we will allow a model to be re-trained given all available data prior to each prediction. We can write the code for the test harness using simple NumPy and Python code. Firstly, we can split the dataset into train and test sets directly. We’re careful to always convert a loaded dataset to float32 in case the loaded data still has some String or Integer data types. Next, we can iterate over the time steps in the test dataset. The train dataset is stored in a Python list as we need to easily append a new observation each iteration and Numpy array concatenation feels like overkill. The prediction made by the model is called yhat for convention, as the outcome or observation is referred to as y and yhat (a ‘y‘ with a mark above) is the mathematical notation for the prediction of the y variable. The prediction and observation are printed each observation for a sanity check prediction in case there are issues with the model.本帖隐藏的内容
Time Series Forecast Case Study with Python – Monthly Armed Robberies in Boston
Photo by Tim Sackton, some rights reserved.
Overview
2. Problem Description
3.1 Validation Dataset
3.2.1 Performance Measure
3.2.2 Test Strategy