Wednesday May 23
Assignment – Oversampling, feature engineering
- Read main textbook section 5.5 on oversampling. The credit card fraud case was an example where they used extreme oversampling of the fraud cases, because there were so few in the original data. Most real classification problems have uneven numbers of each case or uneven costs of errors. Therefore, oversampling is frequently important. Without it, or some other way to accomplish the same thing, you can get a model that is accurate but useless, such as predicting that all cases are the most commom outcome.
- We will continue discussing and practicing feature engineering.
Assignment: Email me 5 to 10 proposed new variables for the credit card fraud case. The actual case is posted above, and in my lecture notes. You can see what variables the authors decided to use. Devise some additional/better ones. Can they be calculated from the available 44 million transaction records?
- Notice that you do not have to know what effect a variable will have in order to decide that it’s a potentially good feature. A LASSO or Random Forest can sort out whether it is useful, or not.
- Baseball homework from last week – the Hitters baseball data had a number of variables, but some of them seem strange. For example, “lifetime hits” just rewards players who have been playing for a long time. We also have a variable that directly measures how long they have been playing. So it is not surprising if LASSO decides that only one of the time-dependent variables is important. What changes could make a more useful measurement?
We will discuss this on Wednesday.
If you are not familiar with baseball, read about measurements of “batting performance,” i.e. how hitters are traditionally measured. For example, a key statistic is batting average. Note that many key statistics are ratios. Looking back at your homework from last week, what would have been good additional variables beyond what was in the original data?
- For messing around. Google is sponsoring an open-source project for quick data exploration. This one seems to be able to display about 5 dimensions of data at once, by using:
Positions on x,y axes
Facets on x, y axes. (especially for categorical variables)I may give it a demo on Wednesday. It seems to be only for exploration, but exploration is well worth one to several hours of project time.
- Monday handouts — feature engineering,
Cheat sheet on performance measurements: Sensitivity, Specificity, and many other measures. contingency table defns SAVE Keep this for reference.
Handout: some key plots for exploring data. Specifically for Hitters data. R Graphics Cookbook – Excerpt
Lecture notes: The medical testing paradox and how to solve it. BDA18 feature engineering case study In the same lecture is “Feature engineering for detecting credit card fraud”.
An abbreviated version of the paper, which is all you need for this week. “Data mining for credit card fraud: A comparative study” by Siddhartha Bhattacharyya et al. Credit card fraud – excerpt. The original article for credit card fraud detection. Includes their feature engineering solution. The full article is available here. (On-campus or VPN only.)
Explaining test results: article in Science. In class we applied this to catching terrorists in the country of Blabia. Risk literacy in medical decision-making: How can we better represent the statistical structure of risk? By Joachim T. Operskalski and Aron K. Barbey. Science, 22 APRIL 2016 • VOL 352 ISSUE 6284. Risk literacy in medical decision-making
Lecture notes, Wednesday BDA18 Lecture feature engineering