Additional tips; turning in report

(updated 11am June 12) Where to turn in your paper: You may put it under the door of my office (room 1315, 3rd floor), or in my faculty mailbox. Submit the PDF version on TritonEd, in the TurnitIn link.

Improving the course:  I think I have approximately the right mix of topics in Big Data Analytics, but there are other areas that I would like to improve. For example, how can I better smooth the workload in the projects, to reduce the end-of-year crunch? Here is a BDA End-of-year questionaire that asks a series of questions about what I should change.

Thanks for taking the class – I certainly enjoy teaching it.  Enjoy graduation! Enjoy life after the university!

===========================

We are finished with formal classes, so I will post here some additional advice for your projects. Most of this is based on things I noticed either in interim reports or in the final presentations on June 6. It’s going to take a few days to write and edit all of these notes, so keep checking this page through Saturday.

Some of this advice was written with one specific project or team in mind, but all of it applies to multiple projects.

  1. Don’t use loops in R. Most computer languages make heavy use of FOR loops. R avoids almost all uses of loops, and it runs faster and is easier to write and debug without them. Here is an example of how to rewrite code without needing a loop, taken from one of this year’s projects. In some cases, R code that avoids loops will run 100x faster (literally). BDA18 Avoid loops in R
  2. Don’t use CSV data when working with large datasets. CSV  (Comma separated value) files have become a lingua franca for exchanging data among different computer languages and environments. For that purpose, they are decent. But they are very inefficient, in terms of both speed and using up memory. One team mentioned that they were running into memory limits, but the problem was most likely due to their keeping CSV files around! Solution: Use CSV files with read.csv to get data into R. But after that, store your data as R objects (dataframes or some other kind). If you want to store intermediate results in a file, create the object inside R, then use RStudio to save it as an R object. (File name ends in .Rdata.)  When you want it again, use RStudio File/Open File command to load it. No additional conversion will be needed.
    I will add something about this to the notes on Handling Big Data. Dealing with the “big” in Big Data.
  3. Fix unbalanced accuracy in confusion matrices. Yesterday, I noticed several confusion matrices with much higher accuracy for one case than the other. Reminder: We spent 1.5 classes on this topic, and there are multiple solutions. It’s usually due to having much more data of one type than the other. See:
    1. Week 8 assignments and notes
  4. Good graphics. Because graphics are a concise way to communicate, even with non-specialists, I recommend having at least one superb image in every report. (See BDA18 Writing your final report ). I will try to post some examples from your projects, with comments on how to make them even better. On Friday.
  5. Revised discussion on how to write a good final report. Writing your final report June 8
Advertisements

Wk 8: Feature engineering, other topics

Wednesday May 23
Assignment – Oversampling, feature engineering

  1. Read main textbook section 5.5 on oversampling. The credit card fraud case was an example where they used extreme oversampling of the fraud cases, because there were so few in the original data. Most real classification problems have uneven numbers of each case or uneven costs of errors. Therefore, oversampling is frequently important. Without it, or some other way to accomplish the same thing, you can get a model that is accurate but useless, such as predicting that all cases are the most commom outcome.
  2. We will continue discussing and practicing   feature engineering.
    Assignment: Email me 5 to 10 proposed new variables for the credit card fraud case. The actual case is posted above, and in my lecture notes. You can see what variables the authors decided to use. Devise some additional/better ones. Can they be calculated from the available 44 million transaction records?

    1. Notice that you do not have to know what effect a variable will have in order to decide that it’s a potentially good feature. A LASSO  or Random Forest can sort out whether it is useful, or not.
  3. Baseball homework from last week – the Hitters baseball data had a number of variables, but some of them seem  strange. For example, “lifetime hits” just rewards players who have been playing for a long time. We also have a variable that directly measures how long they have been playing. So it is not surprising if LASSO decides that only one of the time-dependent variables is important. What changes could make a more useful  measurement?
    We will discuss this on Wednesday.
    If you are not familiar with baseball, read about measurements of “batting performance,” i.e. how hitters are traditionally measured. For example, a key statistic is batting average. Note that many key statistics are ratios. Looking back at your homework from last week, what would have been good additional variables beyond what was in the original data?
    http://m.mlb.com/glossary/standard-stats/batting-average
  4. For messing around. Google is sponsoring an open-source project for quick data exploration. This one seems to be able to display about 5 dimensions of data at once, by using:
    Colors
    Positions on x,y axes
    Facets on x, y axes. (especially for categorical variables)I may give it a demo on Wednesday. It seems to be only for exploration, but exploration is well worth one to several hours of project time.
  5. Monday handouts — feature engineering,

Handouts, resources

Cheat sheet on performance measurements: Sensitivity, Specificity, and many other measures. contingency table defns SAVE Keep this for reference.

Handout: some key plots for exploring data. Specifically for Hitters data. R Graphics Cookbook – Excerpt

Lecture notes: The medical testing paradox and how to solve it. BDA18 feature engineering case study   In the same lecture is  “Feature engineering for detecting credit card fraud”.
An abbreviated version of the paper, which is all you need for this week. “Data mining for credit card fraud: A comparative study” by Siddhartha Bhattacharyya et al. Credit card fraud – excerpt.  The original article for credit card fraud detection. Includes their feature engineering solution. The full article is available here. (On-campus or VPN only.)

Explaining test results: article in Science. In class we applied this to catching terrorists in the country of Blabia. Risk literacy in medical decision-making: How can we better represent the statistical structure of risk?   By Joachim T. Operskalski  and Aron K. Barbey.  Science, 22 APRIL 2016 • VOL 352 ISSUE 6284. Risk literacy in medical decision-making

Lecture notes, Wednesday   BDA18 Lecture feature engineering

Wk. 7 LASSO and Regularization

  1. Readings and homework assignment on LASSO. BDA18 HW Lasso May 14-18e  To be done in class on Wednesday May 16. The data file is here. Be ready with your set of reference books, old code, and cheat sheets. Note the new method called cross-validation. Most algorithms we use have cross-validation functions pre-written for them.
    • Hand in a memorandum version of this assignment on Friday. Also hand in some additional notes on R. Details are on TritonEd as usual.
  2. Lecture notes from Monday. RB Lasso lecture 2018
  3. Lecture notes for Wednesday. Most of Wednesday was devoted to coding the homework problem. These notes include: list of functions, some material from ISLR Chapter 6 (best description of LASSO), discussion of debugging strategies. Notes for wk 7 LASSO
  4. I will not be providing complete R code for the homework.
  5. NEW!  How to write a great final report. This includes a suggested table of contents, a checklist, the grading template I use for final reports, and other information. BDA18 Writing your final report

Week 6 Notes Text Mining

This page is the weekly summary of relevant material for assignments, projects, etc. Edited May 8.

  1. Projects: all projects should have downloaded live data by this weekend. If you have trouble with that deadline, it is time to drastically prune your goals. You can always add more elaborate analysis after you do the basics, but the opposite is not true.
    • If you have not submitted a project update last weekend (nothing since #3, April 29), come to Wednesday night office hours, or text me to suggest another time.
    • Every team should expect to meet with me at least twice after your basic project is approved. The first meeting usually discusses whether you have formulated the problem in a way that is solvable with the data you have. Roughly 30% of the projects e need to change the problem statement at this stage, which is  easier than getting more data.
    • The second meeting generally looks at your results, and finds ways to boost them / make them more interesting.
  2. The next homework is due Friday May 11. It is problem 20.3 from the main DMBA textbook. Data files are available on the book’s website. Details are on the linked assignment from the syllabus page. Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego
    1. Error in file name! The file AutoElectronics.zip has been scrambled by the book authors. On their web site they call it AutoAndElectronics.zip. (They have yet a third name in the textbook!!) You can DL it from them at http://www.dataminingbook.com/system/files/AutoAndElectronics.zip or at the course Google page under either name.
  3. Key resource list  Reference books on R and on specialized data mining methods. Resources for Mining + R language. Information to solve 98% of your R and data mining algorithm problems are available from this page!  (You still need to figure out how to formulate your business problem as Data Mining. Many of the references give advice about this, but it is not reducible to a “cookbook.”)
  4. Text-mining specific resources. Text-mining resources for projects
  5.  Check out the seminar next Monday on careers. Careers in Data Analytics discussion Monday May 14, 12:30
  6. coverOne new book is particularly worth checking out if you are struggling with messy data sets. It discusses packages that have been developed specifically for common data manipulation problems in machine learning/data mining. Several of Feiyang’s tutorials use this book.
    R for Data Science: The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available for $20 or from the library, but the same material is on a web site http://r4ds.had.co.nz/
  7. Weekly notes. Text-Mining Bohn Day 1+2.
  8. The file we used for a tm tutorial on Monday. Called Basic Text Mining in R. It uses slightly different techniques than the textbook, such as no Latent Semantic Analysis, but instead taking out the rare words.
  9. Files for Wednesday’s class, section 20.5
    1. May 9 list of functions.               Keep this as a “cheat sheet,” and add to it as you read and do mining.
    2.    Assignment:  Text mining #2 2018b
    3. lecture notes.  Text-Mining Bohn Day 1+2
  10. The R files cannot be stored on WordPress. Get them on the course’s Google drive page. You can reconstruct my work with two files:
    • Files ending in .R contain the code which I wrote and kept in Rstudio’s Source window (upper left). This code can be run.  Note: if I actually ran additional code but did not save it in Source file, it appears in History, but not in Source.
    • Files ending in .RData contain the entire Global Environment of variables.
    • By loading both of these into RStudio, you can recreate the status at the time I saved them. Thus all calculations are preserved.

Lecture note supplements

From time to time I write  guides/tutorials on topics in lectures that people find confusing. Taken together, they add up to a supplemental textbook.

Material for wk. 4, linear regression. Update 4/26

  1. NEW!  Playstation  #SIE-PSN teams – I am setting up a page for information about this project. THE DEADLINE IS NOW – you must be either IN or OUT so we can move forward. On the + side, Sony is very serious about hiring, and the experience you will get will be immediately relevant to many potential employers. On the – side, you will need to be a self-starter, seek out documentation and information yourself, and generally have a “hands-on” attitude. For example, in order to understand the data coming from the Sony SIE website, you will need to actually visit and examine the site.
  2. NEW!  Lecture notes from Monday 4/23 and 4/25 Updated to Wednesday. BDA18 regression 2018B.key
  3. For the EPA homework, I found a Word document containing the R code. It may be easier to work with BDA16S BohnDA-gram data editing in R Other than the format, it should be almost identical to the one posted with the homework.
  4. The best books and sites about R and data mining. For each project, there are some specialty books and web sites that can save you hours or even days of effort.  It is organized into:
    • Reference books
    • Cheat sheets
    • Resources for learning R
    • Books on special topics.
  5. The minute you start working with R directly, run to this page and download a few resources. Especially get at least 1 cheat sheet. I handed one out in class – now get the e-version.
  6. Answers to some short homework questions from last week – I will post them soon.
  7. NEW! Answers to questions that were on the yellow post-its. DONE
  8. Study guides for ROC curves and other topics about classification models (week 3). I have put them on a new page of supplemental notes. Lecture note supplements
  9. We got a fan note from the author of our Rattle book:
    Hi Roger,

    Just saw your blog post on installing Rattle on mac OS X. Thank you so much for that.
    I’ve added a pointer on my rattle install page (https://rattle.togaware.com/rattle-install-mac.html  … which needs a refresh one day :-).
    You are correct, I don’t have any ready access to a Mac and spend most of my time on Linux.
    It is great that you shared your experience – I know many others will find this useful and no doubt will be appreciative as well.
    All the best.

    Regards,
    Graham

 

Class 4, Classification trees and Toyota cars + projects. Update 4/12 1pm

This page has been expanded after class.

  1. (new).  Important: list of key ideas. In draft form only. BDA18 Class 4 key conceptsC 
  2. (new)Lecture notes BDA18 Class 4 Lecture notes Toyota  This has answers to some of the questions asked on post-its. I still need to post additional Q&A.
  3. (new) Updated page with more Questions & Answers about the Toyota case. Q&A about CART + Toyota What to put in your write-up, what to do about “Model” variable, etc.
  4. Feiyang will have an R session, this Friday at  1pm. Gardner Auditorium. These classes are required for the R certification. See also the new page Resources for R language. My office hours tonight 6:30 pm – let’s talk about paper topics.

Continue reading “Class 4, Classification trees and Toyota cars + projects. Update 4/12 1pm”

Notes from class 3, CART using Rattle

MEMORANDUM
To: Big Data Analytics students
From: Prof. Roger Bohn
Subject: Class #3 Monday April 9 – next steps, Q&A, homework schedule, 
Date: April 9, 2018

The lecture notes were provided before class. Visit Latest handouts  We did not cover all of them, and will continue with CART algorithm on Wednesday before discussing Toyota.

Another topic we discussed, not in the notes: Benefits and disadvantages of open source software.

Please email (or put in comments on this page) questions about the Weather exercise from the Rattle book. Several people asked good questions about Toyota after class. If there are no more questions about how to use Rattle, we will move right into the next segment on Wednesday.

Toyota homework now due Friday at Noon. The TritonEd assignment has been updated.

Still having trouble with Rattle? Feiyang 4pm today. Location unclear, check near GPS office 3132  Feiyang is polling about what her tutorial hours should be. Please respond to her Doodle poll at https://goo.gl/forms/5rjhpIjevaewMBjJ2.  No response = you don’t get a vote.

Other questions asked in class and not answered:

  • Can we have a group of 3 for homework. No. You can discuss with others if you put their names in a note. But only 2 people should work on the actual memo answers.
  • Grading scale, grading policy. I will post something about this. Homework is graded on a 0 to 10 scale. An average of 8 is fine.
  • How to find other people who are interested in projects. I just created a page specifically for that. Final paper ‘dating site’
  • Where to learn more R. Attend the TA tutorials, and I will shortly post a list of recommended websites and readings.  This page is a starting point. Resources for R language