Validating machine learning time series models

Machine learning for data over time requires special methods for validation.

In many data mining situations, data arrives over time. If the world is stable and we just happen to collect data at different times, we can ignore this, or sometimes we want to add a variable such as day of the week. This makes sense, for example, for crime data over a one year span. But if we think the underlying world is probably changing (evolving) over time, we need to take account of this when we validate our predictions. An example would be tracing disease outbreaks or tracing crime over multiple years.

The simplest approach is split the data into two groups, but to split it based on time, rather than randomly. For example, if you have five years of data, use the first 4 years to train your model, and the last year to validate and to test. A more sophisticated version of this is a rolling forecast, where you use early data to train, and then test on the next period, with a continuously moving “window”.

In any case, if you just use a random selection of data from all time periods for validation and testing, you will be committing the sin of “time travel.” Training data from “the future” is being used to predict events in “the past.” If in fact change over time is slow, the errors won’t be too large, so you can use this as a best case analysis of errors.

This short blog post spells out R code for rolling forecasts. He uses the model Arima, but you can use any other machine learning model such as Lasso or random forest. He has a little more discussion of the issues here. . After reading these two, a more sophisticated discussion is at   . All three of these use R.


Author: Roger Bohn

Professor of Technology Management, UC San Diego. Visiting Stanford Medical School Twitter =Roger.Bohn