Material for wk. 4, linear regression. Update 4/26

  1. NEW!  Playstation  #SIE-PSN teams – I am setting up a page for information about this project. THE DEADLINE IS NOW – you must be either IN or OUT so we can move forward. On the + side, Sony is very serious about hiring, and the experience you will get will be immediately relevant to many potential employers. On the – side, you will need to be a self-starter, seek out documentation and information yourself, and generally have a “hands-on” attitude. For example, in order to understand the data coming from the Sony SIE website, you will need to actually visit and examine the site.
  2. NEW!  Lecture notes from Monday 4/23 and 4/25 Updated to Wednesday. BDA18 regression 2018B.key
  3. For the EPA homework, I found a Word document containing the R code. It may be easier to work with BDA16S BohnDA-gram data editing in R Other than the format, it should be almost identical to the one posted with the homework.
  4. The best books and sites about R and data mining. For each project, there are some specialty books and web sites that can save you hours or even days of effort.  It is organized into:
    • Reference books
    • Cheat sheets
    • Resources for learning R
    • Books on special topics.
  5. The minute you start working with R directly, run to this page and download a few resources. Especially get at least 1 cheat sheet. I handed one out in class – now get the e-version.
  6. Answers to some short homework questions from last week – I will post them soon.
  7. NEW! Answers to questions that were on the yellow post-its. DONE
  8. Study guides for ROC curves and other topics about classification models (week 3). I have put them on a new page of supplemental notes. Lecture note supplements
  9. We got a fan note from the author of our Rattle book:
    Hi Roger,

    Just saw your blog post on installing Rattle on mac OS X. Thank you so much for that.
    I’ve added a pointer on my rattle install page (  … which needs a refresh one day :-).
    You are correct, I don’t have any ready access to a Mac and spend most of my time on Linux.
    It is great that you shared your experience – I know many others will find this useful and no doubt will be appreciative as well.
    All the best.



Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego

This page links to the latest versions of course material. Some PDF, some HTML. Update May 29, 2018

Lecture Notes (chronological order)

  1. BDA18-D3 Chap9_CART RB.  For the class of April 9, on CART
  2. BDA18 Class 4 Lecture notes Toyota  For the class of April 11, on CART + Toyota
  3. Logistic Regression 2018  Class of April 16 on classification using linear models aka logistic regression.
  4. Class of April 18 on linear categorical models aka logistic regression. BDA18 illustration of Rattle use 04-18
  5. Notes on Linear Regression, Week 4,  April 23, 25  BDA18 regression 04/25.pdf.   BDA18 regression slides 4-23. Use primarily the April 25 version; 4/23 has a few additional  slides.
  6. How to go from Rattle to RBDA18 Rattle to R code 4-25.pdf
  7.   Lecture Notes Week 5 Random Forests BDA18 Random Forests2018B
  8. Lecture Notes Week 6 Text Mining, Day 1
    Tutorial worked through in class. Basic Text Mining in R 2017 version
  9. Week 6 Text mining #2 2018b 
  10. Week 7 LASSO, Monday May 14.
  11. Week 8  lecture notes. Monday May 21. BDA18 feature engineering case study

Advice, tutorials, reference books, other useful material

Special topics – for specific papers

The Big Data Analytics course introduces data mining with techniques and concepts that are broadly applicable. Individual topics and projects have specific techniques, needs, and resources. In keeping with the theme “Borrow and re-use, don’t invent anything yourself,” here are some resources that are especially suited to particular topics.

Don’t forget to try to site’s Search window  (usually near the upper right) to look up possible keywords. Many of these topics also have entire books about them, such as on Springerlink.

Other links:

Google folder for the course.  There you will find all datasets for the textbook,

The official textbook web site is
Once you register, you can get these datasets, and the R Code. (It’s better to type the R Code by hand, the first time.)

Contact Information

Personal web site:

Random Forests + LASSO Lecture May 11

Here are the lecture notes on Random Forests from Thursday May 11.  BDA17 Random Forests May 11 Bohn  Remember, Random Forests are a technique everyone should try.  LASSO, also discussed on Wednesday, is great when you have lots of variables. With fewer than 20 variables, it’s not as necessary. BUT

LASSO, also discussed on Wednesday, is great when you have lots of variables. With fewer than 20 variables, it’s not as necessary. BUT remember that you will often want to add interaction terms (and jump terms/quadratic terms/etc.) to linear models. As soon as you start that, the number of variables ballons.