Dictionary for Data Scientists and Statisticians

During my journey through machine learning (ML) and statistics, I was faced some many times with surprisingly different usage of terms. To improve the understanding of data scientists and statisticians, I present a dictionary and hope the humour does not get unnoticed.

data scientist	statistician	comment
sample	observation
(training) set	sample
feature	covariate, predictor	many more terms
label	categorical response
inference	prediction, forecast
statistics	inference
training	fitting
training error	in-sample error
test/validation set	hold-out sample
regression	regression
classification	regression (on categorical response) + decision making	thus the name logistic / multinomial regression!
supervised machine learning	regression
AI	AI for funding, else regression	see EU AI Act article 3
confidence score	predicted probability	confidence scores might not represent probabilities
(binary/multiclass) cross-entropy	(binomial/multinomial) log likelihood	a.k.a. log loss
unbalanced data problem	🤷‍♂️what problem?	if any, a small data problem
SMOTE	devil’s work

Statistics is about the honest interpretation of data, which is much less appealing than less honest interpretation.

by Prof. Simon Wood, a.k.a. Mr GAM/mgcv

Comments

Leave a Reply Cancel reply

More posts

Model Diagnostics: Statistics vs Machine Learning

Fast Grouped Counts and Means in R

Dictionary for Data Scientists and Statisticians

Converting arbitrarily large CSVs to Parquet with R