Dtjytyk

Question

There seem to be two different worlds in statistics. On one hand, there are the practitioners which run the same tests again and again. On the other hand, there is this overwhelming and seemingly endless world of statistics and machine learning where one gets lost easily in specific questions - just like here on Cross Validated.

So my question is: What do you consider a person must to know about statistics and machine learning? I know there will be comments that it depends on the area where you work. But still, there are things all statisticians (should) know like multicollinearity, power analysis or linear regression. I really would love to have a profound foundation in statistics, but for me it is hard to tell where to go next. So if statistics and machine learning were a craft occupation what knowledge and what tests / methods would you put in your toolbox?

Why I think this question should be open

I can imagine that experienced statisticians are going to hate this question since it seems pretty broad and therefore naive. But at the same time I can imagine that there are many people like me who are wondering what topics are basic and should be elaborated.

I was already afraid this questions was going to be closed and that is why I anticipated the criticism in my question. I do understand the argument why this question should be on hold. On the other hand: where should I post this question if not on the best Q&A website for statistics? I am being serious about "the best" here. The argument that my question requires non-objective answers and thus doesn't belong on cross validated seems valid but why are there posts like: What is your favorite "data analysis" cartoon? That is a pretty highly rated question so you probably didn't simply miss it. But this question is perfectly subjective and I see no statistical insight in the answers at all. On the other hand, the answers to my question can give a feeling for what needs to be known to be a statistician to many people that are at the beginning of their career. The two answers so far are pretty helpful to me and I was looking forward to reading more and thus I hope that this question gets reopened.

I am voting to reopen this question and convert it to a wiki. — 22 hours ago
@igoR87 if you want to open a discussion about the closure of this question perhaps the CV Meta site is the better place? — 21 hours ago
You are referring to the 6th question when stats.stackexchange was still in its infancy. The standards have changed a lot over time. The StackExchange is a q&a website not a discussion website. This means that questions should be clear enough to be able to see how and why a certain answer is acceptable. Sure, there may be more and less useful answers, for instance, because of differences in the elegance or detail. In those aspects, answers may be rated in a subjective way. But that does not mean that a question can be such broad that it will be unclear whether an answer is correct or not. — 16 hours ago
@Ferdi Community Wiki is not a solution to off-topic questions. We've used it that way in the past, but it's discouraged. meta.stackexchange.com/questions/258006/… — 16 hours ago
@igoR87 You can review what is and is not on-topic in the help center. Part of what makes stats.SE a good website is that it's oriented as a Q&A, not a freewheeling discussion. If you like open-ended discussions, maybe reddit is more for you. Importantly, just because a question is about statistics does not mean that it is well-suited for stats.SE -- we don't host all statistics questions, just the ones that are suitable according to the help center. — 16 hours ago

score 11 · Accepted Answer · 2019-04-11 00:37:57Z

The two worlds that you describe aren't really two different kinds of statistician, but rather:

"statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

statistics proper, as understood by mathematicians, statisticians, data scientists, etc.

The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.

The problem is that without a fairly in-depth understanding, they:

are very likely to misuse statistics

can't stray from the garden path

Issue one is so common it even gets its own Wikipedia article, and issue two can only really be addressed by going back to fundamentals and explaining where those tests came from in the first place. Or by continually exhorting people to stay within the lines, follow the checklist, and consult with a statistician if anything seems weird.

The following poem comes to mind:

A little learning is a dangerous thing;

Drink deep, or taste not the Pierian spring:

There shallow draughts intoxicate the brain,

And drinking largely sobers us again.

- Alexander Pope, A Little Learning

I would liken the "on rails" version of statistics that you see in AP stats or early undergraduate classes for non-majors as the difference between WebMD articles and going to med school. The information in the WebMD article is the most essential conclusion and summary of current medical recommendations. But its not intended as a replacement for medical school, and I wouldn't call someone who had read an WebMD article "Doctor."

What do you consider as must to know in statistics and machine learning?

The Kolmogorov axioms, the definition of a random variable (including random vectors, matrices, etc.) the algebra of random variables, the concept of a distribution and the various theorems that tie these together. You should know about moments. You should know the law of large numbers, the various inequality theorems such as Chebyshev's inequality and the central limit theorems, although if you want to know how to prove them (optional) you will also need to learn about characteristic functions, which can occasionally be useful in their own right if you ever need to calculate exact closed form distributions for say, a ratio distribution.

This stuff would usually be covered in the first (or maybe second?) semester of a class on mathematical statistics. There is also a reasonably good and completely free online textbook which I mainly use for reference but which does develop the topic starting from first principles.

There are a few crucial distributions everyone must know: Normal, Binomial, Beta, Chi-Squared, F, Student's t, Multivariate Normal. Possibly also Poisson and Exponential for Poisson processes, Multivariate/Dirichlet if you work with multi-class data a lot, and others as needed. Oh, and Uniform - can't forget Uniform!

At this point, you're ready to learn the basic structure of a hypothesis test; which is to say, what a "sample" is, and about null hypothesis and critical values, etc. You will be able to use the algebra of random variables and integrals involving distributions to derive pretty much all of the statistical hypothesis tests you've seen in AP stats.

But you're not really done, in fact we're just getting to the good part: fitting models to data. There are various procedures, but the first one to learn is MLE. For me personally, this is the only reason why developed all the above machinery. The key thing to understand about fitting models is that we pose each one as an optimization problem where we (or rather, very powerful computers) find the "best" possible set of "parameters" for the model that "fit" a sample. The resulting model can be validated, examined and interpreted in various ways. The first two models to learn are linear regression and logistic regression, although if you've come through the hard way you might as well study the GLM (generalized linear model) which includes them both and more besides. A very good book on using logistic regression in practice is Hosmer et al.. Understanding these models in detail is very demanding, and encompasses ANVOA, regularization and many other useful techniques.

If you're going to go around calling yourself a statistician, you will definitely want to complement all that theoretical knowledge with a solid, thorough understanding of the design of experiments and power analysis. This is one of the most common thing statisticians are asked to provide input on.

Depending on how much model building you're doing, you may also need to know about cross validation, feature selection, model selection, etc. Although maybe I'm biased towards model building and you could get away without this stuff? In any case, a reasonably good book, especially if you're using R, is Applied Predictive Modeling by Max Kuhn.

At this point you'll have the "must know" knowledge you asked about. But you'll also have learned that inventing a new model is as easy as adding a new term to a loss function, and consequently a huge number of models and approaches exist. No one can learn them all. Sometimes it seems as if which ones are in fashion in a given field is completely arbitrary, or an accident of history. Instead of trying to learn them all, rest assured that you can you the foundation to built to understand any particular model you need if a few hours of study, and focus on those that are commonly used in your field or which seem promising to you.

What tests/ methods would you put in your toolbox?

All right, laundry list time! A lot of these come from The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman which is a very good book by three highly respected authors. Another good resource is scikit-learn, which tends to most of the most mature and popular models. Ditto for R's caret package, although it's really focused on predictive modeling. Others are just models I've seen mentioned and/or used frequently. In roughly descending order of popularity:

Ridge, Lasso, and ElasticNet Regression

Local Regression (LOESS)

Kernel Density Estimates

PCA

Factor Analysis

K-means

GMM (and other mixture models)

Decision Trees, Random Forest, and XGBoost

Time Series Analysis: ARIMA, possible exponential smoothing

SVM (Support Vector Machines)

Hidden Markov Models

GAM (General Additive Models)

Bayes Networks and Structual Equation Modeling

Neural Nets, CNNs (for images), RNN (for sequences). See the Deep Learning Book by Goodfellow, Bengio, and Courville.

Bayesian inference a la Stan

Survival Analysis (Cox PH, Kaplan-Meier estimator, etc.)

Extreme value theory

Vapnik–Chervonenkis theory

Causality

Pairwise/Perference modling e.g. Bradley-Terry

IRT (item response theory, used for surveys and tests)

This is a pretty idiosyncratic list. Certainly I don't know everything on that, and even where I do my knowledge level varies from superficial to long experience. That's going to be true for everyone. Everyone is going to have their own additions to this list, and above all their own priorities. Some people will tell you to dive right in to neural nets and ignore the rest. Some people (actuaries) spend their entire career focusing on survival analysis and extreme value theory. I can't give you any real guidance except to study techniques that are used in your field and apply to your problems.

Nice list; I just would not include the central limit theorem because it doesn't apply to analysis data since it doesn't "work well enough". We've moved past that with flexible Bayesian models, nonparametrics, and semi-parametric models, plus the bootstrap. — 17 hours ago

score 8 · Accepted Answer · 2019-04-11 09:38:56Z

Speaking from a professional perspective (not an academic one), and based on having interviewed several candidates and having been interviewed myself many times as well, I would argue that deep or wide knowledge in stats is not considered as a "must know", but having a very solid grasp of the basics (linear regression, hypothesis testing, probability 101, etc..) is essential, as well as some basic knowledge of algorithms (merging/joining tables, dynamic programming, search methods, etc...). I would rather have someone who understands very well how to apply Bayes’ rule and who knows how to unit test a python function, than someone who can give me a fancy explanation of how Bayesian optimization works and has experience with Tensorflow, but doesn't seem to grasp the concept of conditional probability or how to sort an array.

Beyond the basics, most good companies or teams will quiz you on what you claim you know, not what they think you should know. If you put SVM on your resume, make sure you truly understand SVM, and have some experience using it.

Also, good companies or teams will also test your hands experience more so than the depth of your theoretical knowledge.

It is incredibly unlikely that someone could explain in a fancy manner how Bayesian optimization works yet doesn't understand conditional probability or sorting an array. The theoretical knowledge helps guide a lot of applied problems where only applied knowledge keeps blinders on the horse and can limit quality of the work. — 17 hours ago
@LSC you'd be surprised. I have run into alot of candidates who fit exactly that description. I think it's one of the main reasons more and more companies put people through a hands on technical screening as the first step in their interview process. — 15 hours ago
I would be surprised. Although, I have met many "data scientists", "ML", and "big data"/"six sigma" people who use the big words/fancy methodology names but have no real statistical background and therefore don't understand the words they use or the big picture (like people using certain kinds of mixed models or LASSO but mistakenly believe a p-value is an error probability). I agree with exploring the depth of someone's claimed knowledge and experience. — 15 hours ago

score 5 · Accepted Answer · 2019-04-11 12:54:11Z

What a person needs to know is going to depend on a lot of things. I can only answer from my perspective. I've worked as a data analyst for 20 years, working with researchers in the social, behavioral and medical sciences. I say "data analyst" to make clear that I view my job as a practical one: I help people figure out what their data means. (In an ideal situation, I also help them figure out what data they need, but ... the world is not ideal).

What my clients need to know is to consult me (or someone else) early and often. I find it fascinating but rather odd that scientists with advanced degrees and a lot of experience in their fields will simultaneously

Say that statistics is hard

Admit that they have little training or expertise in it and

Do it on their own anyway.

No. This is the wrong way to proceed. And if this question is viewed as an attempt to figure out what a researcher needs to know, then I think the question is rather wrong-headed. It's like asking how much medicine you need to know in order to visit the doctor.

What I need to know is

When I am out of my depth. No one knows all this stuff, certainly I don't.

A whole lot about models, methods and such, when each can be applied, what each does, how it goes wrong, alternatives etc.

Also, how to run these models in some statistical package and read the results, detect bugs etc. (I use SAS and R, but other choices are fine).

How to ask questions. A good data analyst asks a lot of questions.

Enough matrix algebra and calculus to at least read articles. But that's not all that much.

Others will say that this is inadequate and that I should really have a full grasp of (some list of advanced math here). All I can say is that I have not felt the lack, nor have my clients. True, I cannot invent new methods but 1) I have rarely felt the need - there are a huge variety of existant methods and 2) Most of my client have a hard enough time recognizing that you can't always use OLS regression, trying to get them to accept a totally new method would be nearly impossible and, if they did accept it, their PHBs would not. (PHB = pointy haired boss, a la Dilbert and could be a committee chair, a journal editor, a colleague or an actual boss).

score 11 · Accepted Answer · 2019-04-11 00:37:57Z

The two worlds that you describe aren't really two different kinds of statistician, but rather:

"statistics on rails," to coin a phrase: an attempt to teach non-technical people enough to be able to use statistics in a few narrow contexts.

statistics proper, as understood by mathematicians, statisticians, data scientists, etc.

The deal is this. To understand statistics in even moderate depth, you need to know a considerable amount of mathematics. You need to be comfortable with set theory, outer product spaces, functions between high dimensional spaces, a bit of linear algebra, a bit of calculus, and a smidgen of measure theory. It's not as bad as it sounds: all this is usually covered adequately in the first 2-3 years of undergraduate for hard science majors. But for other majors... I can't even formally define a random variable or the normal distribution for someone who doesn't have those prerequisites. Yet, most people only need to know how to conduct a simple A/B test or the like. And the fact is, we can give someone without those prerequisites a set of formulas and look-up tables and tell them to plug-and-chug. Or today, more commonly a user-friendly GUI program like SPSS. As long as they follow some reasonable rules of experiment design and follow a step-by-step procedure, they will be able to accomplish what they need to.

The problem is that without a fairly in-depth understanding, they:

are very likely to misuse statistics

can't stray from the garden path