Sign up for Wed. presentations

The goal of your presentation is to astound and interest your classmates. (Educating them is nice, also.) So think in terms of an “elevator pitch” for your research. Someone who has zero idea what you have done, but does know about data mining. Show a few viewgraphs.

I suggest being light on prose. Instead, show samples of the world you were investigating (e.g. real tweets), nice infographics about the problem or what you found, etc. This is practice for talking with an outside audience, and NOT for talking to academics.

Sign up for presentation sequence First come, first served, at https://doodle.com/poll/qyqwmfnaaf2ibaxw The times in this Doodle poll are wrong! Expect 5 minutes, so present 1 or 2 cool viewgraphs only.

Don’t forget guidance on final reports. It’s at https://bda2020.files.wordpress.com/2018/04/bda18-writing-your-final-report.pdf
It includes a rubric, checklist, etc. I am editing this document to make it more readable.

You can text me to ask about irregular office hours. Tonight (Monday) IFF anyone asks, and other times by arrangement. Be sure to tell me who you are. +1 858 381-2015

Advertisements

Some discussion of data mining is nonsense – like everything else on the internet

Criticizing an article that appeared on Data Science Central, about logistic regression.

I recently came across a Twitter discussion of an article on a site called Data Science Central. The article was  Why Logistic Regression should be the last thing you learn when becoming a Data Scientist. [TL;DR Don’t believe the headline!]

The article  purports to explain that logistic regression is a bad technique, and nobody should use it. The article is nonsense. I critiqued it in the comments, but I’m not sure the editor will allow my comment to stand.  Data Science Central appears to be a one-man site, with 90% of the material written by David Granville, and it’s hard not to conclude that he made a serious mistake in writing his attack on  logistic regression.

So here is my response to his article. For my students – if you read something about Data Analytics that does not make sense to you, or contradicts something you have been taught, be suspicious. You can see some of the Twitter criticism here.

I am sorry to report that this article is nonsense.  It’s not the conclusion – use it or don’t use it, there are now many alternatives to logistic regression. (Which inthe machine learning world is a “linear classifier.” )

The difficulty is that most of the discussion is Just Wrong. Analytically incorrect. No correspondence to the usual definitions, use, and interpretation of logistic regression.

  • The diagram is incomprehensible. If it is intended to be the standard representation of logistic regression, it has multiple errors.
    • LR maps from -infinity to +infinity (on the X scale), not from 0 to 1.
    • The y axisis correct.
    • The colors and the points show the curve (called the logistic curve or similar) as the boundary between positive and negative outcomes, for points defined by two independent variables (shown as x and y). That is not at allwhat the curve means. See e.g. https://en.wikipedia.org/wiki/File:Logistic-curve.svg
  • “There are hundreds of types of logistic regression.” Maybe in an aworld with a different definition, but the standard definition does not include Poisson models. Of courseas always there are a variety of possible algorithms that can be used to solvea logistic model.
    • From https://www.medcalc.org/manual/logistic_regression.php “Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
      In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).”
  • “If you transform your variable you can instead use linear regression.” Yes, and that is how logistic regressions are usually solved! That is, LRs are solved by transforming the variables (using alogit transform ) and solving the resulting equation, which is linear in the variables. In practice, many other transformation equations can be used instead, but the logit transform has a nice interpretation.
    • where 
  • “Coefficients are not easy to interpret.” I suppose that easy is in the eye of the beholder, but there is a standard and straightforward interpretation.
    • “The logistic regression coefficients show the change in the predicted logged odds of having the characteristic of interest for a one-unit change in the independent variables.” It does take a few examples to figure out what “log odds” means, unless you do a lot of horse racing. But after that, it is a clever and powerful way to think about changes in the probability of an outcome.
    • The (corrected) version of the logistic curve corresponds to an equivalent way to interpret the coefficient values.

There certainly are some mild criticisms of logistic regression, but in  situations where a linear model is reasonably accurate, it is a good quick model to try. Of course, if the situation is highly nonlinear, a tree model is going to be better. Furthermore, the particular logistic equation generally used should not be considered sacred.

My interpretation is that this article is an attack on a straw man, an undefined  and radically unconventional model that is here being called  “logistic regression.” It would be a shame if anyone took it seriously. We will see if the author/site manager leaves this comment up. If he does, I invite him to respond and explain  the meaning of his diagram.

By the way, I agree with  much of the discussion on the medcalcweb site I’m quoting, but not all of it.

 

Uber self-driving crash: Software set to ignore objects on road

A self-driving Uber that struck and killed an Arizona pedestrian in March may have been set to ignore “false positives,” according to a report about the company’s investigation.

Source: Uber self-driving crash: Software set to ignore objects on road

RB note: I’m posting this tragedy because of the last paragraph. Uber had adjusted the sensitivity threshold on its obstacle detection algorithm. Too high ==> too many false positives ==> car keeps jamming on the breaks ===> passengers are uncomfortable. So they lowered it (detect fewer events).
Too low ===> ignore a few obstacles on the road ===> hit a bike or pedestrian.

According to the Information’s report on Uber’s investigation, the company may have tuned the self-driving software to not be too sensitive to objects around it because it is trying to achieve a smooth self-driving ride. Other autonomous-vehicle rides can reportedly be jerky as the cars react to perceived threats — that are sometimes non-existent — in their way.

It’s not mentioned in this short article, but even with a less sensitive algorithm, the car should have been continuously updating the probability of an obstacle. So as they got closer, it should have eventually jammed on the brakes. It still would have hit the pedestrian, but at a lower speed, possibly not killing her. So not only did they set the threshold wrong, but perhaps their real-time updating method was nonexistent or ineffective.

The original article, with much more detail, is here. (email required) It includes the following paragraph:

Hiring better drivers and giving them better tools to avoid such accidents—such as visual or audio alerts when the system decides to ignore certain objects it doesn’t think they’re threats—also may be necessary. (emphasis added)

Frankly, that sounds pretty obvious. A detection system should have multiple levels of sensitivity

  • Least sensitive when the safety driver is clearly looking at the road.
  • Intermediate sensitive, and sound an alarm to the driver, when it detects a possible obstacle and he/she is not looking at the road.
  • Most sensitive when driver is heavily distracted, and a model of driver behavior suggests they will take a long time to respond after an alarm.

But as my friend Don Norman would point out, there are lots of feedback loops operating here, whose effects need to be determined. In the long run, drivers may be retraining cars. But even in a few hours of driving, an automated system will “train” a driver. If the system always beeps for obstacles, a driver might get even more complacent.

Lecture note supplements

From time to time I write  guides/tutorials on topics in lectures that people find confusing. Taken together, they add up to a supplemental textbook.

Data-related jobs

A few of a stream of job ads looking for people with data analytic skills.

I get a newsletter about R and data analytics  which includes a job board. https://www.r-users.com/jobs/  The skill levels being requested are highly variable. For example one ad, excerpted below,  fits many GPS grads.  It specifically mentions STATA, R, and ArcGIS. Remember that company ads aim higher than they realistically expect to find.

An alumni at Raytheon is looking for multiple people with higher skills. US citizens only. He writes:

Searching for a Data Engineer and a Data Scientist for my team at Raytheon Integrated Defense Systems (IDS) in Massachusetts. Let me know if you want to discuss or if you have people in mind that may be interested. Forward along at will. Thanks!

We seek an entrepreneurial data analyst capable of working across functional and business areas with minimal supervision in order to support the application of data science methods and statistical techniques to data for internal use at Raytheon. You will work directly with the Manager of Advanced Analytics at IDS to develop and execute analytics products for internal business users. Solutions will span manufacturing, engineering, business development, supply chain, and quality control functions.

Job from Blog site

  • Bachelor’s degree in social sciences or quantitative field required; Master’s preferred
  • Self-motivated and experience working in a highly collaborative, fast-paced environment
  • Excellent organization skills with strong attention to detail, ability to prioritize and capacity to handle multiple tasks simultaneously
  • Expert level of experience with Microsoft Office programs, including Excel, Word, PowerPoint, Outlook
  • Proficient in programming with R, STATA, SPSS, or other statistical software
  • Proficient in ArcGIS or similar mapping software

The newsletter/blog site is https://www.r-bloggers.com. It’s rather geeky. The jobs board is just a sideline.

 

Misc. announcements: homework, Sony projects, tutorial on R, etc.

Several notices about events tomorrow and later.

Student Question: What cutoff in homework problem 10.4i?

Dana write:

Hey everyone!

I have a question about 10.4 i. exercise. In Rattle we use default setting and cannot change the cutoff, so are we supposed to guess the cutoff for the most accurate classification? Or are we supposed to get it some other way?
Response: Good question. I have 3 levels of answer. At the simplest, should you push the cutoff up or down from 50%? (Be sure to specify which way is which – sometimes it can be ambiguous).
At the next level, Rattle produces an ROC curve, as we did with the airplane data on Wednesday. See textbook p 131. The ROC curve is traced out by moving the threshold all the way from 0 to 1.
Third level: Soon, you will learn how to grab the code produced by Rattle, and run it in RStudio. There you can change the cutoff parameter and calculate different confusion matrices. For example function confusion.matrix(obs, pred, threshold = 0.5) allows any threshold. Through trial and error, you can experiment with different thresholds. Of course, there are more specialized functions that can come up with the optimal answer.

Sony Playstation Network project

Anyone considering the Sony Playstation Network project, read this memo.BDA18 Sony PSN project update 4-19  It points you to some data, and requests a revised proposal as soon as possible. Or, switch to another topic. Make comments on this page if you are looking for a partner, or on the Final paper ‘dating site’

R Tutorial Friday in room 3201 at 1pm

Feiyang will provide several resources for learning R at the level we need it in BDA. She will demonstrate how functions work, and illustrate function use with examples from the textbook. Other topics will include using R Help, and good cheat sheets. This will all be useful next week when we move away from Rattle and toward straight R.

Next week: Linear continuous models (Linear Regression)

Next week, we will look at a method that everyone has seen in a different context, OLS linear regression. I don’t like the textbook treatment of the topic, so I’m assigning a supplemental book. Please read:
Gareth James et al, An Introduction to Statistical Learning with Applications in R.(Supplementary textbook.) It’s available from Springerlink. Review section 3.2 which should be familiar, and read section 3.3 on variations on the basic model linear. Also read DMBA main textbook, only sections 6.1 and 6.2
A more detailed assignment will be posted Saturday, and nothing is due Sunday.

Two Data mining application seminars

Two announcements about data mining applications. Either could be a model for a project, especially if the authors would share some data. The first one is clearly interested in hiring. I can’t predict what fraction of his talk will be about pure algorithms, and what fraction will be relevant to our course, but there will be some of each.

 

1. Tech Talk: Machine Learning and AI Monday night

Monday, April 16, 2018 6:00 – 7:00 PM.
Jacobs Hall – Qualcomm Conference Center
Speaker: Abhijeet Gulati M.S. (UCSD) Director, AI & Product Delivery
Learn how Mitchell International is using Machine Learning & AI to fundamentally disrupt automotive claims management and collision repair industry. Abhijeet Gulati, will discuss Mitchell’s path to Artificial Intelligence and its game changing future.
Machine Learning as a service [MLaaS]
  • Claims workflow lifecycle
  • First notice of loss and damage triage
  • Guided Estimating

 

Continue reading “Two Data mining application seminars”

Office hours are set!

Professor Roger Bohn
Office = RBC 1315. Phone and text (858) 381-2015.
Email Rbohn@ucsd.edu. Put #BDA18 in the subject line of all emails.

TA Feiyang.Chen@rady.ucsd.edu

Office Hours (final). There are office hours on Monday, Wednesday, Friday.

  • Wednesday 6:30 to 7:30 PM in Peet’s Coffee. Prof. will stay later if a long line.  Text (858) 381-2015 to reserve a late time.
  • Fridays 4:00 to 5:15 in my office RBC 1315. Prof. will stay as long as needed to talk to anyone who is in line at 5:05.
  • TA Office hours Monday 1 to 2 in RBC 3128 (shortly after class).
  •      TA R tutorial Fridays 1 to 2 in Gardner Auditorium