Some discussion of data mining is nonsense – like everything else on the internet

Criticizing an article that appeared on Data Science Central, about logistic regression.

I recently came across a Twitter discussion of an article on a site called Data Science Central. The article was  Why Logistic Regression should be the last thing you learn when becoming a Data Scientist. [TL;DR Don’t believe the headline!]

The article  purports to explain that logistic regression is a bad technique, and nobody should use it. The article is nonsense. I critiqued it in the comments, but I’m not sure the editor will allow my comment to stand.  Data Science Central appears to be a one-man site, with 90% of the material written by David Granville, and it’s hard not to conclude that he made a serious mistake in writing his attack on  logistic regression.

So here is my response to his article. For my students – if you read something about Data Analytics that does not make sense to you, or contradicts something you have been taught, be suspicious. You can see some of the Twitter criticism here.

I am sorry to report that this article is nonsense.  It’s not the conclusion – use it or don’t use it, there are now many alternatives to logistic regression. (Which inthe machine learning world is a “linear classifier.” )

The difficulty is that most of the discussion is Just Wrong. Analytically incorrect. No correspondence to the usual definitions, use, and interpretation of logistic regression.

  • The diagram is incomprehensible. If it is intended to be the standard representation of logistic regression, it has multiple errors.
    • LR maps from -infinity to +infinity (on the X scale), not from 0 to 1.
    • The y axisis correct.
    • The colors and the points show the curve (called the logistic curve or similar) as the boundary between positive and negative outcomes, for points defined by two independent variables (shown as x and y). That is not at allwhat the curve means. See e.g.
  • “There are hundreds of types of logistic regression.” Maybe in an aworld with a different definition, but the standard definition does not include Poisson models. Of courseas always there are a variety of possible algorithms that can be used to solvea logistic model.
    • From “Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
      In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).”
  • “If you transform your variable you can instead use linear regression.” Yes, and that is how logistic regressions are usually solved! That is, LRs are solved by transforming the variables (using alogit transform ) and solving the resulting equation, which is linear in the variables. In practice, many other transformation equations can be used instead, but the logit transform has a nice interpretation.
    • where 
  • “Coefficients are not easy to interpret.” I suppose that easy is in the eye of the beholder, but there is a standard and straightforward interpretation.
    • “The logistic regression coefficients show the change in the predicted logged odds of having the characteristic of interest for a one-unit change in the independent variables.” It does take a few examples to figure out what “log odds” means, unless you do a lot of horse racing. But after that, it is a clever and powerful way to think about changes in the probability of an outcome.
    • The (corrected) version of the logistic curve corresponds to an equivalent way to interpret the coefficient values.

There certainly are some mild criticisms of logistic regression, but in  situations where a linear model is reasonably accurate, it is a good quick model to try. Of course, if the situation is highly nonlinear, a tree model is going to be better. Furthermore, the particular logistic equation generally used should not be considered sacred.

My interpretation is that this article is an attack on a straw man, an undefined  and radically unconventional model that is here being called  “logistic regression.” It would be a shame if anyone took it seriously. We will see if the author/site manager leaves this comment up. If he does, I invite him to respond and explain  the meaning of his diagram.

By the way, I agree with  much of the discussion on the medcalcweb site I’m quoting, but not all of it.