Wk 8: Feature engineering, other topics

Wednesday May 23
Assignment – Oversampling, feature engineering

  1. Read main textbook section 5.5 on oversampling. The credit card fraud case was an example where they used extreme oversampling of the fraud cases, because there were so few in the original data. Most real classification problems have uneven numbers of each case or uneven costs of errors. Therefore, oversampling is frequently important. Without it, or some other way to accomplish the same thing, you can get a model that is accurate but useless, such as predicting that all cases are the most commom outcome.
  2. We will continue discussing and practicing   feature engineering.
    Assignment: Email me 5 to 10 proposed new variables for the credit card fraud case. The actual case is posted above, and in my lecture notes. You can see what variables the authors decided to use. Devise some additional/better ones. Can they be calculated from the available 44 million transaction records?

    1. Notice that you do not have to know what effect a variable will have in order to decide that it’s a potentially good feature. A LASSO  or Random Forest can sort out whether it is useful, or not.
  3. Baseball homework from last week – the Hitters baseball data had a number of variables, but some of them seem  strange. For example, “lifetime hits” just rewards players who have been playing for a long time. We also have a variable that directly measures how long they have been playing. So it is not surprising if LASSO decides that only one of the time-dependent variables is important. What changes could make a more useful  measurement?
    We will discuss this on Wednesday.
    If you are not familiar with baseball, read about measurements of “batting performance,” i.e. how hitters are traditionally measured. For example, a key statistic is batting average. Note that many key statistics are ratios. Looking back at your homework from last week, what would have been good additional variables beyond what was in the original data?
  4. For messing around. Google is sponsoring an open-source project for quick data exploration. This one seems to be able to display about 5 dimensions of data at once, by using:
    Positions on x,y axes
    Facets on x, y axes. (especially for categorical variables)I may give it a demo on Wednesday. It seems to be only for exploration, but exploration is well worth one to several hours of project time.
  5. Monday handouts — feature engineering,

Handouts, resources

Cheat sheet on performance measurements: Sensitivity, Specificity, and many other measures. contingency table defns SAVE Keep this for reference.

Handout: some key plots for exploring data. Specifically for Hitters data. R Graphics Cookbook – Excerpt

Lecture notes: The medical testing paradox and how to solve it. BDA18 feature engineering case study   In the same lecture is  “Feature engineering for detecting credit card fraud”.
An abbreviated version of the paper, which is all you need for this week. “Data mining for credit card fraud: A comparative study” by Siddhartha Bhattacharyya et al. Credit card fraud – excerpt.  The original article for credit card fraud detection. Includes their feature engineering solution. The full article is available here. (On-campus or VPN only.)

Explaining test results: article in Science. In class we applied this to catching terrorists in the country of Blabia. Risk literacy in medical decision-making: How can we better represent the statistical structure of risk?   By Joachim T. Operskalski  and Aron K. Barbey.  Science, 22 APRIL 2016 • VOL 352 ISSUE 6284. Risk literacy in medical decision-making

Lecture notes, Wednesday   BDA18 Lecture feature engineering


Uber self-driving crash: Software set to ignore objects on road

A self-driving Uber that struck and killed an Arizona pedestrian in March may have been set to ignore “false positives,” according to a report about the company’s investigation.

Source: Uber self-driving crash: Software set to ignore objects on road

RB note: I’m posting this tragedy because of the last paragraph. Uber had adjusted the sensitivity threshold on its obstacle detection algorithm. Too high ==> too many false positives ==> car keeps jamming on the breaks ===> passengers are uncomfortable. So they lowered it (detect fewer events).
Too low ===> ignore a few obstacles on the road ===> hit a bike or pedestrian.

According to the Information’s report on Uber’s investigation, the company may have tuned the self-driving software to not be too sensitive to objects around it because it is trying to achieve a smooth self-driving ride. Other autonomous-vehicle rides can reportedly be jerky as the cars react to perceived threats — that are sometimes non-existent — in their way.

It’s not mentioned in this short article, but even with a less sensitive algorithm, the car should have been continuously updating the probability of an obstacle. So as they got closer, it should have eventually jammed on the brakes. It still would have hit the pedestrian, but at a lower speed, possibly not killing her. So not only did they set the threshold wrong, but perhaps their real-time updating method was nonexistent or ineffective.

The original article, with much more detail, is here. (email required) It includes the following paragraph:

Hiring better drivers and giving them better tools to avoid such accidents—such as visual or audio alerts when the system decides to ignore certain objects it doesn’t think they’re threats—also may be necessary. (emphasis added)

Frankly, that sounds pretty obvious. A detection system should have multiple levels of sensitivity

  • Least sensitive when the safety driver is clearly looking at the road.
  • Intermediate sensitive, and sound an alarm to the driver, when it detects a possible obstacle and he/she is not looking at the road.
  • Most sensitive when driver is heavily distracted, and a model of driver behavior suggests they will take a long time to respond after an alarm.

But as my friend Don Norman would point out, there are lots of feedback loops operating here, whose effects need to be determined. In the long run, drivers may be retraining cars. But even in a few hours of driving, an automated system will “train” a driver. If the system always beeps for obstacles, a driver might get even more complacent.

Data-related jobs

A few of a stream of job ads looking for people with data analytic skills.

I get a newsletter about R and data analytics  which includes a job board. https://www.r-users.com/jobs/  The skill levels being requested are highly variable. For example one ad, excerpted below,  fits many GPS grads.  It specifically mentions STATA, R, and ArcGIS. Remember that company ads aim higher than they realistically expect to find.

An alumni at Raytheon is looking for multiple people with higher skills. US citizens only. He writes:

Searching for a Data Engineer and a Data Scientist for my team at Raytheon Integrated Defense Systems (IDS) in Massachusetts. Let me know if you want to discuss or if you have people in mind that may be interested. Forward along at will. Thanks!

We seek an entrepreneurial data analyst capable of working across functional and business areas with minimal supervision in order to support the application of data science methods and statistical techniques to data for internal use at Raytheon. You will work directly with the Manager of Advanced Analytics at IDS to develop and execute analytics products for internal business users. Solutions will span manufacturing, engineering, business development, supply chain, and quality control functions.

Job from Blog site

  • Bachelor’s degree in social sciences or quantitative field required; Master’s preferred
  • Self-motivated and experience working in a highly collaborative, fast-paced environment
  • Excellent organization skills with strong attention to detail, ability to prioritize and capacity to handle multiple tasks simultaneously
  • Expert level of experience with Microsoft Office programs, including Excel, Word, PowerPoint, Outlook
  • Proficient in programming with R, STATA, SPSS, or other statistical software
  • Proficient in ArcGIS or similar mapping software

The newsletter/blog site is https://www.r-bloggers.com. It’s rather geeky. The jobs board is just a sideline.


Two Data mining application seminars

Two announcements about data mining applications. Either could be a model for a project, especially if the authors would share some data. The first one is clearly interested in hiring. I can’t predict what fraction of his talk will be about pure algorithms, and what fraction will be relevant to our course, but there will be some of each.


1. Tech Talk: Machine Learning and AI Monday night

Monday, April 16, 2018 6:00 – 7:00 PM.
Jacobs Hall – Qualcomm Conference Center
Speaker: Abhijeet Gulati M.S. (UCSD) Director, AI & Product Delivery
Learn how Mitchell International is using Machine Learning & AI to fundamentally disrupt automotive claims management and collision repair industry. Abhijeet Gulati, will discuss Mitchell’s path to Artificial Intelligence and its game changing future.
Machine Learning as a service [MLaaS]
  • Claims workflow lifecycle
  • First notice of loss and damage triage
  • Guided Estimating


Continue reading “Two Data mining application seminars”

Homework, project datasets, other announcements

I have re-posted next week’s homework on TritonEd This is the same information as in the syllabus, but it is broken down day by day. Look in “Content” section.

I have updated a list of sources of datasets.  Projects: easily available data sets.  Google, the US Government, Github, Kagle, and other organizations have pages that are devoted to listing freely available datasets. Many of these datasets are big enough for the course, and are well documented. This list means that shortage of interesting data are no longer a constraint. It may still take time to find interesting data on a particular topic you want – for that, you still need to do Internet searches and ask a librarian for help.

Project deliverables are coming up. The first proposal is due Saturday April 14. Look for more details soon.

Rattle, job hunting, other announcements

On Thursday our TA had a session on installing Rattle on the Mac. She may repeat it on Friday. Send her an email if you are interested.

Please refer to this page for instructions. Installing Rattle on Mac

Data sources pages:

Data sets from Google and Kagle.    https://www.kaggle.com/datasets

A page of useful links

Data sources and project ideas related to pollution.

Projects: easily available data sets

Five strategies for locating interesting data sets. (From Dataquest)

Some data projects that encourage other people to use the data they collected.

Past student papers:


Job Hunting Opportunity

In 2 weeks, the Jacobs School of Engineering is running a day with lots of employers visiting. Student passes are $10, although you can probably sneak in if you want to. http://jacobsschool.ucsd.edu/re/




Preparing for first week of BDA

As of today March 29, the course is oversubscribed. Come to the first class anyway, because by the third week lots of people will drop the course, for various reasons. See the page on Should I take Big Data Analytics in 2018? for more information.
Class will probably meet in RBC 3203, but there is a chance it will move to the Gardner Room. 

Here are steps that you need to do by Tuesday, April 3. If you can get most of it done before the first class, even better.  Most important: get the software installed. Some students will run into problems, and we don’t want to wait to discover them.

  • Installing R is straightforward and covered in many places. We will use  R version 3.4.4 (Someone to Lean On) which was released on 2018-03-15. Here is a Coursera video on how to install. The official web site for latest versions is https://cran.r-project.org.
  • Start R, make sure it runs. Set up at least one folder/directory where your R programs will go.
  • Install Rattle. Its home page is https://rattle.togaware.com . To install Rattle, start up R, then follow the instructions on the Rattle page.
  • Download the Rattle textbook from Springerlink.com. It also has instructions for installing Rattle. https://link.springer.com/book/10.1007/978-1-4419-9890-3
  • Get the main textbook. You can try using the library’s online copies, especially for the first chapters. Instructions are on this web site (BDA2020.wordpress.com).
    • Read chapter 1 on your own.
    • Start on Chapter 2.

The TA for the course will be Feiyang Chan. She will hold informal office hours before and after the Wednesday class, for anyone who is having trouble installing the software. So 10:30 to 11AM, and again 12:30 onward. In the main classroom, RBC 3203.

What is Big Data Analytics?

This confuses students every year, and for excellent reasons. A variety of terms are thrown around without clear definitions, or clear distinctions among them. The concepts and applications are evolving so fast that there is no consensus. You should think of all of the following as closely related, and all covered by this course:

  • Data Analytics
  • Data Mining
  • Machine Learning
  • Business Analytics
  • Data Science
  • at least 5 others.

It is very helpful to look at a range of case studies where these ideas have been used successfully. Here are a few.  Some may be bogus – as we will try to discuss during the course.

Assignment: send me other examples. Either put them in the comments, or email them to me and I will post them.

Big Data At Caesars Entertainment – A One Billion Dollar Asset? – Forbes

BDA examples: Pollution and health

Popular Press Articles

Analyzing 170,000,000 NYC Taxi trips


Articles about “data science”

The next iteration of my course starts on April 2, 2018. For people who are baffled by all the buzz words and conflicting advice (and who isn’t?), I’m going to post some article links here. It will be a potpourri. When the course starts, I may go back and reorganize the material by topic.

I will also reactivate my Twitter tag #BDA, in my twitter feed. But #BDA seems to refer to something in Spanish (does anyone know what?), so I will pair it with #DataMining. So look for tweets:  @RogerBohn #BDA #DataMining I won’t put anything mission-critical in the tweets.

Udacity: 4 Types of Data Science Jobs 

Although it’s useful for discussing the range of “technical” skills that are useful, this article  ignores the  business and application side of data science. If you can’t help answer the “So what?” question, you won’t be very useful. And the skills for “So what?” are quite different than the technical skills in the article. MBA versus computer science,  basically.