Wk 8: Feature engineering, other topics

Wednesday May 23
Assignment – Oversampling, feature engineering

  1. Read main textbook section 5.5 on oversampling. The credit card fraud case was an example where they used extreme oversampling of the fraud cases, because there were so few in the original data. Most real classification problems have uneven numbers of each case or uneven costs of errors. Therefore, oversampling is frequently important. Without it, or some other way to accomplish the same thing, you can get a model that is accurate but useless, such as predicting that all cases are the most commom outcome.
  2. We will continue discussing and practicing   feature engineering.
    Assignment: Email me 5 to 10 proposed new variables for the credit card fraud case. The actual case is posted above, and in my lecture notes. You can see what variables the authors decided to use. Devise some additional/better ones. Can they be calculated from the available 44 million transaction records?

    1. Notice that you do not have to know what effect a variable will have in order to decide that it’s a potentially good feature. A LASSO  or Random Forest can sort out whether it is useful, or not.
  3. Baseball homework from last week – the Hitters baseball data had a number of variables, but some of them seem  strange. For example, “lifetime hits” just rewards players who have been playing for a long time. We also have a variable that directly measures how long they have been playing. So it is not surprising if LASSO decides that only one of the time-dependent variables is important. What changes could make a more useful  measurement?
    We will discuss this on Wednesday.
    If you are not familiar with baseball, read about measurements of “batting performance,” i.e. how hitters are traditionally measured. For example, a key statistic is batting average. Note that many key statistics are ratios. Looking back at your homework from last week, what would have been good additional variables beyond what was in the original data?
    http://m.mlb.com/glossary/standard-stats/batting-average
  4. For messing around. Google is sponsoring an open-source project for quick data exploration. This one seems to be able to display about 5 dimensions of data at once, by using:
    Colors
    Positions on x,y axes
    Facets on x, y axes. (especially for categorical variables)I may give it a demo on Wednesday. It seems to be only for exploration, but exploration is well worth one to several hours of project time.
  5. Monday handouts — feature engineering,

Handouts, resources

Cheat sheet on performance measurements: Sensitivity, Specificity, and many other measures. contingency table defns SAVE Keep this for reference.

Handout: some key plots for exploring data. Specifically for Hitters data. R Graphics Cookbook – Excerpt

Lecture notes: The medical testing paradox and how to solve it. BDA18 feature engineering case study   In the same lecture is  “Feature engineering for detecting credit card fraud”.
An abbreviated version of the paper, which is all you need for this week. “Data mining for credit card fraud: A comparative study” by Siddhartha Bhattacharyya et al. Credit card fraud – excerpt.  The original article for credit card fraud detection. Includes their feature engineering solution. The full article is available here. (On-campus or VPN only.)

Explaining test results: article in Science. In class we applied this to catching terrorists in the country of Blabia. Risk literacy in medical decision-making: How can we better represent the statistical structure of risk?   By Joachim T. Operskalski  and Aron K. Barbey.  Science, 22 APRIL 2016 • VOL 352 ISSUE 6284. Risk literacy in medical decision-making

Lecture notes, Wednesday   BDA18 Lecture feature engineering

Advertisements

Resources for Mining + R language

Lots of books and web sites to use.

[Updated 05/08/2018] There are a lot of good resources for the R programming language, and for data mining/machine learning/AI/BDA. There are video courses, books, reference sites, discussion boards, and plenty more. The single best place to look for resources is probably Computerworld, and its guide 60+ R resources to improve your data skills.  Each has a few sentences of description. Another good place for information is the UCSD library has a well-organized  UCSD Guide to Business Analytics, just as it has guides for political science, international studies, etc.

Remember that you don’t need to learn much R in order to use it for analytics. What you need in the course is enough to 1) glue together pieces of R code that do particular tasks, and 2) read guides to specialized topics, such as web scraping, text mining, or particular algorithms, that you need for an individual project.

Cheat Sheets

Good reference books about R

Several books and web sites contain “recipes,” meaning chunks of R code to do particular tasks. These are big time-savers, although they are not a good way to learn the language. Everyone should get at least one of these, as e-book, physical book, or permanent bookmark in your notes! Here are a few:

http://proquest.safaribooksonline.com/book/programming/r/9780596809287

  • Data Wrangling with R – on the library’s Springerlink site for complete downloading. Springerlink.com, but only from VPN or on campus.
  • There are several other good books, but they are expensive. If you are not on a budget, ask me.
  • The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available, but the same material is on a web site http://r4ds.had.co.nz/

 

Library downloadable books on Data Mining using R

These books are the ones to study when you want to learn a data mining technique. All of them use R as the primary language. These books are about machine learning, and are textbooks.  The earlier books are about the R language, and are written as reference books. 

    Rattle, the  second textbook for BDA. Used  because it has an easy interface.     http://link.springer.com/book/10.1007/978-1-4419-9890-3   

 ISLR = Introduction to Statistical Learning with R   http://link.springer.com/book/10.1007/978-1-4614-7138-7.  Course textbook #3. More theoretical than other books in this list, it has good explanations of how and why important algorithms work. 

 ggplot2 http://link.springer.com/book/10.1007%2F978-0-387-98141-3

The main graphics system  we use. This book was written by ggplot2’s developer, and covers the early version of the software. A new edition is due out in 2018

  http://link.springer.com/book/10.1007/978-1-4419-1318-0

If you know Stata and are learning R, this book is good for looking up “how do I do that?”

http://link.springer.com/book/10.1007%2F978-3-319-12066-9

A short book that covers the basics of data mining, with everything written in R 

R for specific kinds of analysis (networks, GIS, marketing, ….)

Springerlink publishes a series of more than 60 books on different uses of R. https://link.springer.com/bookseries/6991 They are at the intermediate level, about right for refining your knowledge of special techniques needed for a  project. Examples: Spatial analysis in R, R for Market Research, Data Wrangling with R,  ggplot2 (several books), Political analysis using R, Analyzing Networks using R, Phylogenetics with R, etc. All of them are free to download, or you can buy them as paperback books for $25.

Because it is so trendy, practically every business and textbook publisher has books on data mining and related topics. You can search them through the UCSD book catalog, UCSD.worldcat.org. For example, here are 2000+ e-books about ‘Machine Learning’. That is not a misprint, and all are available through UCSD in some form.

Last, there are literally dozens of books about R/statistics written for a particular audience, or exploring a particular applied statistics topic.  The following lists books I have found especially relevant to this course. Note that many of them are specifically for reference: when you need  to do something specific, look it up in one of these books. Others are intended for learning from.

For searching on your own e.g. on Google Scholar, good phrases are data mining, machine learning, data analytics (broader), data science, and specific topics for your application, such as fraud detection. Use quotation marks around these phrases! Finally, there are many 20 to 50 page articles that cover the basics of particular R topics. These are often more up to date than books, and better ways to get a start on a topic.

Mining Text Data    

R for Marketing Research and Analytics

A User’s Guide to Network Analysis in R

Statistical Analysis of Network Data with R

Applied Spatial Data Analysis with R (Geographic Info systems)

Graphical Models with R

Six Sigma with R

Introductory Time Series with R

Applied Econometrics with R

Nonlinear Regression in R

Data Manipulation with R