Wk 8: Feature engineering, other topics

Wednesday May 23
Assignment – Oversampling, feature engineering

  1. Read main textbook section 5.5 on oversampling. The credit card fraud case was an example where they used extreme oversampling of the fraud cases, because there were so few in the original data. Most real classification problems have uneven numbers of each case or uneven costs of errors. Therefore, oversampling is frequently important. Without it, or some other way to accomplish the same thing, you can get a model that is accurate but useless, such as predicting that all cases are the most commom outcome.
  2. We will continue discussing and practicing   feature engineering.
    Assignment: Email me 5 to 10 proposed new variables for the credit card fraud case. The actual case is posted above, and in my lecture notes. You can see what variables the authors decided to use. Devise some additional/better ones. Can they be calculated from the available 44 million transaction records?

    1. Notice that you do not have to know what effect a variable will have in order to decide that it’s a potentially good feature. A LASSO  or Random Forest can sort out whether it is useful, or not.
  3. Baseball homework from last week – the Hitters baseball data had a number of variables, but some of them seem  strange. For example, “lifetime hits” just rewards players who have been playing for a long time. We also have a variable that directly measures how long they have been playing. So it is not surprising if LASSO decides that only one of the time-dependent variables is important. What changes could make a more useful  measurement?
    We will discuss this on Wednesday.
    If you are not familiar with baseball, read about measurements of “batting performance,” i.e. how hitters are traditionally measured. For example, a key statistic is batting average. Note that many key statistics are ratios. Looking back at your homework from last week, what would have been good additional variables beyond what was in the original data?
  4. For messing around. Google is sponsoring an open-source project for quick data exploration. This one seems to be able to display about 5 dimensions of data at once, by using:
    Positions on x,y axes
    Facets on x, y axes. (especially for categorical variables)I may give it a demo on Wednesday. It seems to be only for exploration, but exploration is well worth one to several hours of project time.
  5. Monday handouts — feature engineering,

Handouts, resources

Cheat sheet on performance measurements: Sensitivity, Specificity, and many other measures. contingency table defns SAVE Keep this for reference.

Handout: some key plots for exploring data. Specifically for Hitters data. R Graphics Cookbook – Excerpt

Lecture notes: The medical testing paradox and how to solve it. BDA18 feature engineering case study   In the same lecture is  “Feature engineering for detecting credit card fraud”.
An abbreviated version of the paper, which is all you need for this week. “Data mining for credit card fraud: A comparative study” by Siddhartha Bhattacharyya et al. Credit card fraud – excerpt.  The original article for credit card fraud detection. Includes their feature engineering solution. The full article is available here. (On-campus or VPN only.)

Explaining test results: article in Science. In class we applied this to catching terrorists in the country of Blabia. Risk literacy in medical decision-making: How can we better represent the statistical structure of risk?   By Joachim T. Operskalski  and Aron K. Barbey.  Science, 22 APRIL 2016 • VOL 352 ISSUE 6284. Risk literacy in medical decision-making

Lecture notes, Wednesday   BDA18 Lecture feature engineering

Wk. 7 LASSO and Regularization

  1. Readings and homework assignment on LASSO. BDA18 HW Lasso May 14-18e  To be done in class on Wednesday May 16. The data file is here. Be ready with your set of reference books, old code, and cheat sheets. Note the new method called cross-validation. Most algorithms we use have cross-validation functions pre-written for them.
    • Hand in a memorandum version of this assignment on Friday. Also hand in some additional notes on R. Details are on TritonEd as usual.
  2. Lecture notes from Monday. RB Lasso lecture 2018
  3. Lecture notes for Wednesday. Most of Wednesday was devoted to coding the homework problem. These notes include: list of functions, some material from ISLR Chapter 6 (best description of LASSO), discussion of debugging strategies. Notes for wk 7 LASSO
  4. I will not be providing complete R code for the homework.
  5. NEW!  How to write a great final report. This includes a suggested table of contents, a checklist, the grading template I use for final reports, and other information. BDA18 Writing your final report

Week 6 Notes Text Mining

This page is the weekly summary of relevant material for assignments, projects, etc. Edited May 8.

  1. Projects: all projects should have downloaded live data by this weekend. If you have trouble with that deadline, it is time to drastically prune your goals. You can always add more elaborate analysis after you do the basics, but the opposite is not true.
    • If you have not submitted a project update last weekend (nothing since #3, April 29), come to Wednesday night office hours, or text me to suggest another time.
    • Every team should expect to meet with me at least twice after your basic project is approved. The first meeting usually discusses whether you have formulated the problem in a way that is solvable with the data you have. Roughly 30% of the projects e need to change the problem statement at this stage, which is  easier than getting more data.
    • The second meeting generally looks at your results, and finds ways to boost them / make them more interesting.
  2. The next homework is due Friday May 11. It is problem 20.3 from the main DMBA textbook. Data files are available on the book’s website. Details are on the linked assignment from the syllabus page. Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego
    1. Error in file name! The file AutoElectronics.zip has been scrambled by the book authors. On their web site they call it AutoAndElectronics.zip. (They have yet a third name in the textbook!!) You can DL it from them at http://www.dataminingbook.com/system/files/AutoAndElectronics.zip or at the course Google page under either name.
  3. Key resource list  Reference books on R and on specialized data mining methods. Resources for Mining + R language. Information to solve 98% of your R and data mining algorithm problems are available from this page!  (You still need to figure out how to formulate your business problem as Data Mining. Many of the references give advice about this, but it is not reducible to a “cookbook.”)
  4. Text-mining specific resources. Text-mining resources for projects
  5.  Check out the seminar next Monday on careers. Careers in Data Analytics discussion Monday May 14, 12:30
  6. coverOne new book is particularly worth checking out if you are struggling with messy data sets. It discusses packages that have been developed specifically for common data manipulation problems in machine learning/data mining. Several of Feiyang’s tutorials use this book.
    R for Data Science: The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available for $20 or from the library, but the same material is on a web site http://r4ds.had.co.nz/
  7. Weekly notes. Text-Mining Bohn Day 1+2.
  8. The file we used for a tm tutorial on Monday. Called Basic Text Mining in R. It uses slightly different techniques than the textbook, such as no Latent Semantic Analysis, but instead taking out the rare words.
  9. Files for Wednesday’s class, section 20.5
    1. May 9 list of functions.               Keep this as a “cheat sheet,” and add to it as you read and do mining.
    2.    Assignment:  Text mining #2 2018b
    3. lecture notes.  Text-Mining Bohn Day 1+2
  10. The R files cannot be stored on WordPress. Get them on the course’s Google drive page. You can reconstruct my work with two files:
    • Files ending in .R contain the code which I wrote and kept in Rstudio’s Source window (upper left). This code can be run.  Note: if I actually ran additional code but did not save it in Source file, it appears in History, but not in Source.
    • Files ending in .RData contain the entire Global Environment of variables.
    • By loading both of these into RStudio, you can recreate the status at the time I saved them. Thus all calculations are preserved.

Week 5: Random Forests, R, debugging

The ostensible material this week is Random Forests. They are a generalization of Classification/Regression Trees. No assignment for Monday; everything will be done in class. Redo the class material and hand it in on Wednesday (not Friday – we will go over the results in class.)

Here is the assignment for Wednesday, including the readings on Random Forests.  BDA18 Random Forest Assign May 2, 2018.

Here is the material used in class. BDA18 Random Forests2018B

The other agenda this week is to develop your skills in debugging. This is the key skill of writing code: figuring out what is wrong and how to fix it.

For Week 6, text mining. Here is the assignment for next Monday BDA18 Text mining #1 assign. (Short assignment handed in Sunday night.)

Homework week 4: Linear regression

This week we have 3 learning goals. It will take the entire week to do them.

  1. Linear regression for prediction. How it differs from hypothesis testing.
  2. Showing how to use R instead of, or in conjunction with, Rattle.
  3. Many specific tricks and issues that come up with linear regression, such as word equations and creating interactive variables.

Please see the attached document, which includes the readings, specific homework due Friday, and supplemental information about various useful ideas and techniques.

BDA18 Week 4 Readings + assign

You can get  data files here:  https://bda2020.wordpress.com/data-sets/

There is nothing due on Monday.


Comments on early project proposals

Thanks to everyone who submitted these project proposals on time. I have returned half (Saturday night); the others are coming. A few are still missing.

Here is some general advice that applies to some proposals. If you already know it, please help other BDA students with it.

  1. For this course, all projects need to do prediction or classification of individual-level characteristics of some kind. You have already had numerous courses where the goal is to test a hypothesis or to measure the strength of the causal relationships among variables. These are important, but in BDA we are learning a new skill, and indeed a new purpose for analyzing quantitative data. Mathematically, they is closely related. But it requires a different mind-set. We will talk more about this in the weeks to come.
  2. Internet search: Everyone knows how to do a simple search on Google. But that is a low skill level. It is possible to do research far better in 2018. Before you leave UCSD, get good at it. Here are some simple questions to test yourself.
    1. What is the difference between Yelp reviews and “Yelp Reviews” (with or without quotes and capital letters).
    2. Find pages  that mention dates between 1985 and 1995. (Not that were written then, but that contain those dates.)
    3. Almost everything on Google Scholar can be found in a regular Google search. Why, then, is it so useful to search on Google Scholar instead of (or sometimes in addition to) Google.com?
    4. What extra benefits do you get on Google Scholar when you are on campus or VPNed to campus?**
      Where on this listing should you click to get the paper most easily? 2018-04-14_20-48-57
    5. Name two good databases that have massive collections of non-public  business-related papers, reports, magazines, etc. Material that is not available through either Google or Google Scholar.
  3. A few people forgot the syllabus request that all written work be turned in with a professional appearance, as if you were in a company or organization. I got more than one file called “proposal.pdf.”
    I have therefore updated and reposted the syllabus section on homework and reports. BDA18 Syllabus = HW formats + grading 2018-04-14. Parts of it, such as discussion of figures, are overkill for these early project descriptions.



** This one is pretty obscure. The answer is in the picture. If you are on campus, you get the UC-eLinks entry. For recent published papers, that is often the only way to download it.

Week 2 #BDA assignments

Feiyang and I are working to get everyone up to speed for next week. Here is a list of items to be aware of, in no particular order. I will get this material organized better over the weekend. As always, you can post comments on this message if you have questions.

  1. Rattle now works on the Mac! see Installing Rattle on Mac
  2. The homework for Monday does use Rattle. Do it in teams, and if only one of you has a working version of Rattle that is ok as long as you physically work together. In class, I will call on someone randomly for your solutions.
  3. You can find the homework at the end of the syllabus. Currently, it is version 1.05, but I expect  to revise it late today. Latest syllabus, assignments, + notes The Monday assignment is on page 12, or search for “Early Assignments”. Update: individual assignments are also in TritonEd. 
  4. The assignment due Tuesday night takes some people a long time because it requires using R, Rattle, and the first data mining algorithm, called CART. It also asks you to do some data manipulation. So set aside time, and do it with a teammate. Homeworks are due 11pm.
  5. To get the course information immediately, subscribe to this web site. Look for the subscribe button on the bottom right. (BDA2020.wordpress.com)
  6. Feiyang can provide assistance with Rattle by email (and then phone etc.) When asking for computer assistance with problems, provide basic information for debugging. Do not say “it didn’t work,” unless you don’t need any assistance. The more complete, the better.’ Give the exact error message. It is even ok to Copy and Paste an entire stream of activities and resulting error messages, into the bottom of an email. She won’t read it all, but it gives important clues.
  7. I have not confirmed this, but she should be available before or after class on Monday for anyone still having trouble with Rattle.
  8. For R in general, get in the habit of googling  error messages. Often this will send you to the Stack Overflow site.
    • There are various tricks involved in googling errors, which I will discuss in class.
    • A few newcomers to UCSD may not know how to do compound searches on Google. This is a basic life skill.

      “Data Mining” Rattle “text of error message” 

      is a much better search than
      Data Mining Rattle text of error message.  Why?

    • To start with, use Google’s Advanced Search page.
    • Search tips for dates, for example: https://www.makeuseof.com/tag/6-ways-to-search-by-date-on-google/

Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego

This page links to the latest versions of course material. Some PDF, some HTML. Update May 29, 2018

Lecture Notes (chronological order)

  1. BDA18-D3 Chap9_CART RB.  For the class of April 9, on CART
  2. BDA18 Class 4 Lecture notes Toyota  For the class of April 11, on CART + Toyota
  3. Logistic Regression 2018  Class of April 16 on classification using linear models aka logistic regression.
  4. Class of April 18 on linear categorical models aka logistic regression. BDA18 illustration of Rattle use 04-18
  5. Notes on Linear Regression, Week 4,  April 23, 25  BDA18 regression 04/25.pdf.   BDA18 regression slides 4-23. Use primarily the April 25 version; 4/23 has a few additional  slides.
  6. How to go from Rattle to RBDA18 Rattle to R code 4-25.pdf
  7.   Lecture Notes Week 5 Random Forests BDA18 Random Forests2018B
  8. Lecture Notes Week 6 Text Mining, Day 1
    Tutorial worked through in class. Basic Text Mining in R 2017 version
  9. Week 6 Text mining #2 2018b 
  10. Week 7 LASSO, Monday May 14.
  11. Week 8  lecture notes. Monday May 21. BDA18 feature engineering case study

Advice, tutorials, reference books, other useful material

Special topics – for specific papers

The Big Data Analytics course introduces data mining with techniques and concepts that are broadly applicable. Individual topics and projects have specific techniques, needs, and resources. In keeping with the theme “Borrow and re-use, don’t invent anything yourself,” here are some resources that are especially suited to particular topics.

Don’t forget to try to site’s Search window  (usually near the upper right) to look up possible keywords. Many of these topics also have entire books about them, such as on Springerlink.

Other links:

Google folder for the course.  There you will find all datasets for the textbook,

The official textbook web site is http://www.dataminingbook.com/book/r-edition
Once you register, you can get these datasets, and the R Code. (It’s better to type the R Code by hand, the first time.)

Contact Information

Personal web site: Art2science.org