During my journey through machine learning (ML) and statistics, I was faced some many times with surprisingly different usage of terms. To improve the understanding of data scientists and statisticians, I present a dictionary and hope the humour does not get unnoticed.
| data scientist | statistician | comment |
| sample | observation | |
| (training) set | sample | |
| feature | covariate, predictor | many more terms |
| label | categorical response | |
| inference | prediction, forecast | |
| statistics | inference | |
| training | fitting | |
| training error | in-sample error | |
| test/validation set | hold-out sample | |
| regression | regression | |
| classification | regression (on categorical response) + decision making | thus the name logistic / multinomial regression! |
| supervised machine learning | regression | |
| AI | AI for funding, else regression | see EU AI Act article 3 |
| confidence score | predicted probability | confidence scores might not represent probabilities |
| (binary/multiclass) cross-entropy | (binomial/multinomial) log likelihood | a.k.a. log loss |
| unbalanced data problem | 🤷♂️what problem? | if any, a small data problem |
| SMOTE | devil’s work |
Statistics is about the honest interpretation of data, which is much less appealing than less honest interpretation.
by Prof. Simon Wood, a.k.a. Mr GAM/mgcv
Leave a Reply