(updated 11am June 12) Where to turn in your paper: You may put it under the door of my office (room 1315, 3rd floor), or in my faculty mailbox. Submit the PDF version on TritonEd, in the TurnitIn link.
Improving the course: I think I have approximately the right mix of topics in Big Data Analytics, but there are other areas that I would like to improve. For example, how can I better smooth the workload in the projects, to reduce the end-of-year crunch? Here is a BDA End-of-year questionaire that asks a series of questions about what I should change.
Thanks for taking the class – I certainly enjoy teaching it. Enjoy graduation! Enjoy life after the university!
We are finished with formal classes, so I will post here some additional advice for your projects. Most of this is based on things I noticed either in interim reports or in the final presentations on June 6. It’s going to take a few days to write and edit all of these notes, so keep checking this page through Saturday.
Some of this advice was written with one specific project or team in mind, but all of it applies to multiple projects.
- Don’t use loops in R. Most computer languages make heavy use of FOR loops. R avoids almost all uses of loops, and it runs faster and is easier to write and debug without them. Here is an example of how to rewrite code without needing a loop, taken from one of this year’s projects. In some cases, R code that avoids loops will run 100x faster (literally). BDA18 Avoid loops in R
- Don’t use CSV data when working with large datasets. CSV (Comma separated value) files have become a lingua franca for exchanging data among different computer languages and environments. For that purpose, they are decent. But they are very inefficient, in terms of both speed and using up memory. One team mentioned that they were running into memory limits, but the problem was most likely due to their keeping CSV files around! Solution: Use CSV files with read.csv to get data into R. But after that, store your data as R objects (dataframes or some other kind). If you want to store intermediate results in a file, create the object inside R, then use RStudio to save it as an R object. (File name ends in .Rdata.) When you want it again, use RStudio File/Open File command to load it. No additional conversion will be needed.
I will add something about this to the notes on Handling Big Data. Dealing with the “big” in Big Data.
- Fix unbalanced accuracy in confusion matrices. Yesterday, I noticed several confusion matrices with much higher accuracy for one case than the other. Reminder: We spent 1.5 classes on this topic, and there are multiple solutions. It’s usually due to having much more data of one type than the other. See:
- Week 8 assignments and notes
- Good graphics. Because graphics are a concise way to communicate, even with non-specialists, I recommend having at least one superb image in every report. (See BDA18 Writing your final report ). I will try to post some examples from your projects, with comments on how to make them even better. On Friday.
- Revised discussion on how to write a good final report. Writing your final report June 8
Wednesday May 23
Assignment – Oversampling, feature engineering
- Read main textbook section 5.5 on oversampling. The credit card fraud case was an example where they used extreme oversampling of the fraud cases, because there were so few in the original data. Most real classification problems have uneven numbers of each case or uneven costs of errors. Therefore, oversampling is frequently important. Without it, or some other way to accomplish the same thing, you can get a model that is accurate but useless, such as predicting that all cases are the most commom outcome.
- We will continue discussing and practicing feature engineering.
Assignment: Email me 5 to 10 proposed new variables for the credit card fraud case. The actual case is posted above, and in my lecture notes. You can see what variables the authors decided to use. Devise some additional/better ones. Can they be calculated from the available 44 million transaction records?
- Notice that you do not have to know what effect a variable will have in order to decide that it’s a potentially good feature. A LASSO or Random Forest can sort out whether it is useful, or not.
- Baseball homework from last week – the Hitters baseball data had a number of variables, but some of them seem strange. For example, “lifetime hits” just rewards players who have been playing for a long time. We also have a variable that directly measures how long they have been playing. So it is not surprising if LASSO decides that only one of the time-dependent variables is important. What changes could make a more useful measurement?
We will discuss this on Wednesday.
If you are not familiar with baseball, read about measurements of “batting performance,” i.e. how hitters are traditionally measured. For example, a key statistic is batting average. Note that many key statistics are ratios. Looking back at your homework from last week, what would have been good additional variables beyond what was in the original data?
- For messing around. Google is sponsoring an open-source project for quick data exploration. This one seems to be able to display about 5 dimensions of data at once, by using:
Positions on x,y axes
Facets on x, y axes. (especially for categorical variables)I may give it a demo on Wednesday. It seems to be only for exploration, but exploration is well worth one to several hours of project time.
- Monday handouts — feature engineering,
Cheat sheet on performance measurements: Sensitivity, Specificity, and many other measures. contingency table defns SAVE Keep this for reference.
Handout: some key plots for exploring data. Specifically for Hitters data. R Graphics Cookbook – Excerpt
Lecture notes: The medical testing paradox and how to solve it. BDA18 feature engineering case study In the same lecture is “Feature engineering for detecting credit card fraud”.
An abbreviated version of the paper, which is all you need for this week. “Data mining for credit card fraud: A comparative study” by Siddhartha Bhattacharyya et al. Credit card fraud – excerpt. The original article for credit card fraud detection. Includes their feature engineering solution. The full article is available here. (On-campus or VPN only.)
Explaining test results: article in Science. In class we applied this to catching terrorists in the country of Blabia. Risk literacy in medical decision-making: How can we better represent the statistical structure of risk? By Joachim T. Operskalski and Aron K. Barbey. Science, 22 APRIL 2016 • VOL 352 ISSUE 6284. Risk literacy in medical decision-making
Lecture notes, Wednesday BDA18 Lecture feature engineering
This page is the weekly summary of relevant material for assignments, projects, etc. Edited May 8.
- Projects: all projects should have downloaded live data by this weekend. If you have trouble with that deadline, it is time to drastically prune your goals. You can always add more elaborate analysis after you do the basics, but the opposite is not true.
- If you have not submitted a project update last weekend (nothing since #3, April 29), come to Wednesday night office hours, or text me to suggest another time.
- Every team should expect to meet with me at least twice after your basic project is approved. The first meeting usually discusses whether you have formulated the problem in a way that is solvable with the data you have. Roughly 30% of the projects e need to change the problem statement at this stage, which is easier than getting more data.
- The second meeting generally looks at your results, and finds ways to boost them / make them more interesting.
- The next homework is due Friday May 11. It is problem 20.3 from the main DMBA textbook. Data files are available on the book’s website. Details are on the linked assignment from the syllabus page. Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego
- Error in file name! The file AutoElectronics.zip has been scrambled by the book authors. On their web site they call it AutoAndElectronics.zip. (They have yet a third name in the textbook!!) You can DL it from them at http://www.dataminingbook.com/system/files/AutoAndElectronics.zip or at the course Google page under either name.
- Key resource list: Reference books on R and on specialized data mining methods. Resources for Mining + R language. Information to solve 98% of your R and data mining algorithm problems are available from this page! (You still need to figure out how to formulate your business problem as Data Mining. Many of the references give advice about this, but it is not reducible to a “cookbook.”)
- Text-mining specific resources. Text-mining resources for projects
- Check out the seminar next Monday on careers. Careers in Data Analytics discussion Monday May 14, 12:30
- One new book is particularly worth checking out if you are struggling with messy data sets. It discusses packages that have been developed specifically for common data manipulation problems in machine learning/data mining. Several of Feiyang’s tutorials use this book.
R for Data Science: The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available for $20 or from the library, but the same material is on a web site http://r4ds.had.co.nz/
- Weekly notes. Text-Mining Bohn Day 1+2.
- The file we used for a tm tutorial on Monday. Called Basic Text Mining in R. It uses slightly different techniques than the textbook, such as no Latent Semantic Analysis, but instead taking out the rare words.
- Files for Wednesday’s class, section 20.5
- May 9 list of functions. Keep this as a “cheat sheet,” and add to it as you read and do mining.
- Assignment: Text mining #2 2018b
- lecture notes. Text-Mining Bohn Day 1+2
- The R files cannot be stored on WordPress. Get them on the course’s Google drive page. You can reconstruct my work with two files:
- Files ending in .R contain the code which I wrote and kept in Rstudio’s Source window (upper left). This code can be run. Note: if I actually ran additional code but did not save it in Source file, it appears in History, but not in Source.
- Files ending in .RData contain the entire Global Environment of variables.
- By loading both of these into RStudio, you can recreate the status at the time I saved them. Thus all calculations are preserved.
From time to time I write guides/tutorials on topics in lectures that people find confusing. Taken together, they add up to a supplemental textbook.
- Supplemental notes classification scoring, ROC curves, etc.
- How to clean and add to a data set, using R, based on the EPA homework. Two versions of this guide, one a PDF BDA16S BohnDA-gram data editing in R, and the other in .doc form so you can edit out the line numbers on the R code.BDA16S BohnDA-gram data editing in R Other than the format, they should be identical. It is also attached to the EPA regression assignment of April 25.
This page has been expanded after class.
- (new). Important: list of key ideas. In draft form only. BDA18 Class 4 key conceptsC
- (new)Lecture notes BDA18 Class 4 Lecture notes Toyota This has answers to some of the questions asked on post-its. I still need to post additional Q&A.
- (new) Updated page with more Questions & Answers about the Toyota case. Q&A about CART + Toyota What to put in your write-up, what to do about “Model” variable, etc.
- Feiyang will have an R session, this Friday at 1pm. Gardner Auditorium. These classes are required for the R certification. See also the new page Resources for R language. My office hours tonight 6:30 pm – let’s talk about paper topics.
Continue reading “Class 4, Classification trees and Toyota cars + projects. Update 4/12 1pm”
To: Big Data Analytics students
From: Prof. Roger Bohn
Subject: Class #3 Monday April 9 – next steps, Q&A, homework schedule,
Date: April 9, 2018
The lecture notes were provided before class. Visit Latest handouts We did not cover all of them, and will continue with CART algorithm on Wednesday before discussing Toyota.
Another topic we discussed, not in the notes: Benefits and disadvantages of open source software.
Please email (or put in comments on this page) questions about the Weather exercise from the Rattle book. Several people asked good questions about Toyota after class. If there are no more questions about how to use Rattle, we will move right into the next segment on Wednesday.
Toyota homework now due Friday at Noon. The TritonEd assignment has been updated.
Still having trouble with Rattle? Feiyang 4pm today. Location unclear, check near GPS office 3132 Feiyang is polling about what her tutorial hours should be. Please respond to her Doodle poll at https://goo.gl/forms/5rjhpIjevaewMBjJ2. No response = you don’t get a vote.
Other questions asked in class and not answered:
- Can we have a group of 3 for homework. No. You can discuss with others if you put their names in a note. But only 2 people should work on the actual memo answers.
- Grading scale, grading policy. I will post something about this. Homework is graded on a 0 to 10 scale. An average of 8 is fine.
- How to find other people who are interested in projects. I just created a page specifically for that. Final paper ‘dating site’
- Where to learn more R. Attend the TA tutorials,
and I will shortly post a list of recommended websites and readings. This page is a starting point. Resources for R language