BDA 2018; final schedule(updated)

I have created a new list of resources, for specific projects types such as spatial analysis and Twitter analysis. It is the  heading on the Latest Handouts page at Special topics for individual papers. 

Summary of the last 2 weeks of the course:

  • Only nominal homework – readings and one figure.
  • Work on projects. Ask for help if desired.  No more interim reports are due.
  • R Certification: If you want R certification for the course, take a one-hour quiz and meet some other requirements.
  • Make an in-class presentation: two-person teams only.
  • Final paper due

Wednesday, May 30. Handling unbalanced data, and other useful techniques.
Reading: Chapter 5.5, also 5.3 and 5.4. These were assigned previously.
Nothing to be turned in

Saturday, June 2: No progress report is due.

Monday, June 4: A/B Testing and other emerging topics in  Big Data

  • Look up specific techniques for your project. Spatial data and GIS, Text processing, Crime, Graphics, or Twitter. One or more applies to every project. Special topics for individual papers. 
  •   Turn in: One careful plot from your project. Hard copy, with comments on it by hand. Format the plot carefully and clearly including scales, colors, definitions, etc. Please turn these in by hand in class. This is to encourage hand-writing of comments.  Circle and explain at least one interesting/important feature of your plot.
    • Include a caption. Captions in scientific papers are sometimes several sentences long.
    • The goal of the assignment is to help you focus intensively on one result of your project, and how to explain it visually. It does not have to be a data-mining result.
  • Reading,  “The A/B Test: Inside the Technology That’s Changing the Rules of Business” Wired Magazine, 04.25.12.
  • Visit an e-commerce website and think about how to improve it using A/B testing.

Wednesday, June 6: All two-person project teams will give 5 to 7 minute presentations. The goal is to fascinate, impress, and surprise your audience. Think of this as the “elevator pitch” for your project.

Friday, June 8  1pm or other times as agreed: Quiz for R Certification. The quiz emphasizes data manipulation in R, Selecting data subsets, creating new variables , rearranging and redefining data such as event logs. The other requirements for R certificates are completing your project using appropriate R programming, and attending 50% of TA tutorials.

Friday, June 8 midnight: Formal due date for final project papers.
All projects who request one receive an automatic extension until Wednesday.
Submit both hard copy and PDF files. Submit via Turnitin, on TritonEd.

June 11.   Wednesday, June 13. Deadline for  projects.

Resources on data manipulation

Some are elsewhere on this site.  for May 23.

Continue reading “Resources on data manipulation”

Wk. 7 LASSO and Regularization

  1. Readings and homework assignment on LASSO. BDA18 HW Lasso May 14-18e  To be done in class on Wednesday May 16. The data file is here. Be ready with your set of reference books, old code, and cheat sheets. Note the new method called cross-validation. Most algorithms we use have cross-validation functions pre-written for them.
    • Hand in a memorandum version of this assignment on Friday. Also hand in some additional notes on R. Details are on TritonEd as usual.
  2. Lecture notes from Monday. RB Lasso lecture 2018
  3. Lecture notes for Wednesday. Most of Wednesday was devoted to coding the homework problem. These notes include: list of functions, some material from ISLR Chapter 6 (best description of LASSO), discussion of debugging strategies. Notes for wk 7 LASSO
  4. I will not be providing complete R code for the homework.
  5. NEW!  How to write a great final report. This includes a suggested table of contents, a checklist, the grading template I use for final reports, and other information. BDA18 Writing your final report

Text mining homework: speeding up calculations

A few hours ago I received an email from Emily about the farm advertisement problem due on May 11. I wrote her back. But her question raises general issues for many projects. I’m sure other students also had similar problem with Friday’s homework.

In response, I just wrote BDA18 Memo =My program runs too slowly v 1.1.
New version! BDA18 My program =slow v 1.2 The memo includes most of my specific suggestions about the May 11 homework. (This may be too late for some students, but most of the ideas were also discussed briefly in class on Monday or Wednesday.)

There are probably multiple typos and errors in the memo. Please send me corrections by email, for class credit.

I am working on the assignment due tomorrow and have encountered a problem. When reducing the TF-IDF matrix to 20 concepts, RStudio always stops working (as indicated by the little ‘Stop’ sign in the console. I’m thinking this is because the farm-ads.csv dataset is too large. Without reducing the concepts, I am unable to move forward with the random forest part of the assignment. I am wondering if there is a solution to this problem or a way to work around it.
Apologies in advance for not approaching you with this question earlier. It’s been a very hectic week!
Thanks for your help,
By the way, I’m 98% sure that in fact RStudio did not “stop working.” It was probably still cranking away. Check the Activity Monitor application on your computer to be sure.

Week 5: Random Forests, R, debugging

The ostensible material this week is Random Forests. They are a generalization of Classification/Regression Trees. No assignment for Monday; everything will be done in class. Redo the class material and hand it in on Wednesday (not Friday – we will go over the results in class.)

Here is the assignment for Wednesday, including the readings on Random Forests.  BDA18 Random Forest Assign May 2, 2018.

Here is the material used in class. BDA18 Random Forests2018B

The other agenda this week is to develop your skills in debugging. This is the key skill of writing code: figuring out what is wrong and how to fix it.

For Week 6, text mining. Here is the assignment for next Monday BDA18 Text mining #1 assign. (Short assignment handed in Sunday night.)

Weekly project reports 3, 4, 5, …

What to submit each week to show progress on your project.

Each week, each team should submit a project progress report. Submit via Ted/TritonEd. Their purpose is  partly to help you  focus on what you have accomplished and what needs to be done next. I will also scan them, and offer comments from time to time. If you have a specific question /advice that you want to be answered also send your report to me via email #BDA18. It’s especially important to let me know by email or visit if you have become hung up by a  bottleneck such as getting specific data, a technique that you have not figured out (often, a text mining question), or anything else.

Here is the “generic assignment:”

Project reports continue to be due weekly, preferably on Saturdays unless we discuss another time for your next report (such as just after a meeting during office hours). Follow the usual rules if you are a team – both people submit identical files.

The content of each report depends on your stage of activity. A general guide is on the website. It’s especially important to 1) show steady progress, and 2) gather and look at real data, even if you don’t know yet how you are going to analyze it, 3) Use an approach of incremental modeling, rather than trying to create one giant analysis.

You only need write a paragraph or 2 of text. Emphasize your major new insights and analyses. Attach printouts of outputs (including exploratory analysis etc.), highlighting anything that you think is noteworthy. Make the exhibits  self-explanatory by incuding a good caption, circling key numbers, etc. Informal exhibits are ok, even handwriting.

Memos and examples about projects and project reports

BDA18 Project assignment example 4-17 

Examples of past intermediate reports.


BDA Assign 2016-02-14_Hyerim Kim_Project 3+ comments

Added material for the #SIE-PSN project.


BDA18 Sony PSN project update 4-19

Playstation #SIE-PSN information

Material for wk. 4, linear regression. Update 4/26

  1. NEW!  Playstation  #SIE-PSN teams – I am setting up a page for information about this project. THE DEADLINE IS NOW – you must be either IN or OUT so we can move forward. On the + side, Sony is very serious about hiring, and the experience you will get will be immediately relevant to many potential employers. On the – side, you will need to be a self-starter, seek out documentation and information yourself, and generally have a “hands-on” attitude. For example, in order to understand the data coming from the Sony SIE website, you will need to actually visit and examine the site.
  2. NEW!  Lecture notes from Monday 4/23 and 4/25 Updated to Wednesday. BDA18 regression 2018B.key
  3. For the EPA homework, I found a Word document containing the R code. It may be easier to work with BDA16S BohnDA-gram data editing in R Other than the format, it should be almost identical to the one posted with the homework.
  4. The best books and sites about R and data mining. For each project, there are some specialty books and web sites that can save you hours or even days of effort.  It is organized into:
    • Reference books
    • Cheat sheets
    • Resources for learning R
    • Books on special topics.
  5. The minute you start working with R directly, run to this page and download a few resources. Especially get at least 1 cheat sheet. I handed one out in class – now get the e-version.
  6. Answers to some short homework questions from last week – I will post them soon.
  7. NEW! Answers to questions that were on the yellow post-its. DONE
  8. Study guides for ROC curves and other topics about classification models (week 3). I have put them on a new page of supplemental notes. Lecture note supplements
  9. We got a fan note from the author of our Rattle book:
    Hi Roger,

    Just saw your blog post on installing Rattle on mac OS X. Thank you so much for that.
    I’ve added a pointer on my rattle install page (  … which needs a refresh one day :-).
    You are correct, I don’t have any ready access to a Mac and spend most of my time on Linux.
    It is great that you shared your experience – I know many others will find this useful and no doubt will be appreciative as well.
    All the best.



Homework week 4: Linear regression

This week we have 3 learning goals. It will take the entire week to do them.

  1. Linear regression for prediction. How it differs from hypothesis testing.
  2. Showing how to use R instead of, or in conjunction with, Rattle.
  3. Many specific tricks and issues that come up with linear regression, such as word equations and creating interactive variables.

Please see the attached document, which includes the readings, specific homework due Friday, and supplemental information about various useful ideas and techniques.

BDA18 Week 4 Readings + assign

You can get  data files here:

There is nothing due on Monday.