It’s fairly intuitive that a model should accurately represent the data. But what exactly does that mean?
Fitting the data
In order for a model to be useful at all, it should reasonably fit the data. What constitutes ‘reasonable’ depends on the fitting method as well as the purpose of the model.
- Prediction
If the goal of the model is prediction, the modeler should be extra careful so as to not over fit the data. This can be done by splitting the available data into a training and a testing set randomly, or by leaving out multiple portions of the data. This ensures that a handful of data points are not determining the behavior of the model. Assuming the data was not obtained in a biased manner, such a model should have good predictive capability.
- Inference
In contrast, over fitting is not as important when modeling for inference. Sometimes data may be difficult to obtain, or some data points may be more representative of the underlying system. Of course, over fitting implies that the model is actually starting to consider random chance as information. In a model for inference this random chance may be considered as part of the system as a whole, or random effects may contribute to only a small fraction of the data. In such a case, over fitting is a minor concern.