During my journey through machine learning (ML) and statistics, I was faced some many times with surprisingly different usage of terms. To improve the understanding of data scientists and statisticians, I present a dictionary and hope the humour does not get unnoticed.
data scientist | statistician | comment |
sample | observation | |
(training) set | sample | |
feature | covariate, predictor | many more terms |
label | categorical response | |
inference | prediction, forecast | |
statistics | inference | |
training | fitting | |
training error | in-sample error | |
test/validation set | hold-out sample | |
regression | regression | |
classification | regression (on categorical response) + decision making | thus the name logistic / multinomial regression! |
supervised machine learning | regression | |
AI | AI for funding, else regression | see EU AI Act article 3 |
confidence score | predicted probability | confidence scores might not represent probabilities |
(binary/multiclass) cross-entropy | (binomial/multinomial) log likelihood | a.k.a. log loss |
unbalanced data problem | 🤷♂️what problem? | if any, a small data problem |
SMOTE | devil’s work |
Statistics is about the honest interpretation of data, which is much less appealing than less honest interpretation.
by Prof. Simon Wood, a.k.a. Mr GAM/mgcv
Leave a Reply