Sign up for Wed. presentations

The goal of your presentation is to astound and interest your classmates. (Educating them is nice, also.) So think in terms of an “elevator pitch” for your research. Someone who has zero idea what you have done, but does know about data mining. Show a few viewgraphs.

I suggest being light on prose. Instead, show samples of the world you were investigating (e.g. real tweets), nice infographics about the problem or what you found, etc. This is practice for talking with an outside audience, and NOT for talking to academics.

Sign up for presentation sequence First come, first served, at The times in this Doodle poll are wrong! Expect 5 minutes, so present 1 or 2 cool viewgraphs only.

Don’t forget guidance on final reports. It’s at
It includes a rubric, checklist, etc. I am editing this document to make it more readable.

You can text me to ask about irregular office hours. Tonight (Monday) IFF anyone asks, and other times by arrangement. Be sure to tell me who you are. +1 858 381-2015


BDA 2018; final schedule(updated)

I have created a new list of resources, for specific projects types such as spatial analysis and Twitter analysis. It is the  heading on the Latest Handouts page at Special topics for individual papers. 

Summary of the last 2 weeks of the course:

  • Only nominal homework – readings and one figure.
  • Work on projects. Ask for help if desired.  No more interim reports are due.
  • R Certification: If you want R certification for the course, take a one-hour quiz and meet some other requirements.
  • Make an in-class presentation: two-person teams only.
  • Final paper due

Wednesday, May 30. Handling unbalanced data, and other useful techniques.
Reading: Chapter 5.5, also 5.3 and 5.4. These were assigned previously.
Nothing to be turned in

Saturday, June 2: No progress report is due.

Monday, June 4: A/B Testing and other emerging topics in  Big Data

  • Look up specific techniques for your project. Spatial data and GIS, Text processing, Crime, Graphics, or Twitter. One or more applies to every project. Special topics for individual papers. 
  •   Turn in: One careful plot from your project. Hard copy, with comments on it by hand. Format the plot carefully and clearly including scales, colors, definitions, etc. Please turn these in by hand in class. This is to encourage hand-writing of comments.  Circle and explain at least one interesting/important feature of your plot.
    • Include a caption. Captions in scientific papers are sometimes several sentences long.
    • The goal of the assignment is to help you focus intensively on one result of your project, and how to explain it visually. It does not have to be a data-mining result.
  • Reading,  “The A/B Test: Inside the Technology That’s Changing the Rules of Business” Wired Magazine, 04.25.12.
  • Visit an e-commerce website and think about how to improve it using A/B testing.

Wednesday, June 6: All two-person project teams will give 5 to 7 minute presentations. The goal is to fascinate, impress, and surprise your audience. Think of this as the “elevator pitch” for your project.

Friday, June 8  1pm or other times as agreed: Quiz for R Certification. The quiz emphasizes data manipulation in R, Selecting data subsets, creating new variables , rearranging and redefining data such as event logs. The other requirements for R certificates are completing your project using appropriate R programming, and attending 50% of TA tutorials.

Friday, June 8 midnight: Formal due date for final project papers.
All projects who request one receive an automatic extension until Wednesday.
Submit both hard copy and PDF files. Submit via Turnitin, on TritonEd.

June 11.   Wednesday, June 13. Deadline for  projects.

Careers in Data Analytics discussion Monday May 14, 12:30

GPS alumni Nick Beaudoin ( Class of 2015 will hold a session next Monday May 14 on  “Careers in Data Analytics” from 12:30-1:30 pm. It’s in Room 3107 (Building 3, downstairs corner conference room).

Nick works for Deloitte, but it’s not a recruiting session. He will discuss general advice and skills needed to pursue a career in data analytics with any (large) company.  Please  RSVP for the event on GPScareers. Ask Kristen if you have any questions.

Week 6 Notes Text Mining

This page is the weekly summary of relevant material for assignments, projects, etc. Edited May 8.

  1. Projects: all projects should have downloaded live data by this weekend. If you have trouble with that deadline, it is time to drastically prune your goals. You can always add more elaborate analysis after you do the basics, but the opposite is not true.
    • If you have not submitted a project update last weekend (nothing since #3, April 29), come to Wednesday night office hours, or text me to suggest another time.
    • Every team should expect to meet with me at least twice after your basic project is approved. The first meeting usually discusses whether you have formulated the problem in a way that is solvable with the data you have. Roughly 30% of the projects e need to change the problem statement at this stage, which is  easier than getting more data.
    • The second meeting generally looks at your results, and finds ways to boost them / make them more interesting.
  2. The next homework is due Friday May 11. It is problem 20.3 from the main DMBA textbook. Data files are available on the book’s website. Details are on the linked assignment from the syllabus page. Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego
    1. Error in file name! The file has been scrambled by the book authors. On their web site they call it (They have yet a third name in the textbook!!) You can DL it from them at or at the course Google page under either name.
  3. Key resource list  Reference books on R and on specialized data mining methods. Resources for Mining + R language. Information to solve 98% of your R and data mining algorithm problems are available from this page!  (You still need to figure out how to formulate your business problem as Data Mining. Many of the references give advice about this, but it is not reducible to a “cookbook.”)
  4. Text-mining specific resources. Text-mining resources for projects
  5.  Check out the seminar next Monday on careers. Careers in Data Analytics discussion Monday May 14, 12:30
  6. coverOne new book is particularly worth checking out if you are struggling with messy data sets. It discusses packages that have been developed specifically for common data manipulation problems in machine learning/data mining. Several of Feiyang’s tutorials use this book.
    R for Data Science: The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available for $20 or from the library, but the same material is on a web site
  7. Weekly notes. Text-Mining Bohn Day 1+2.
  8. The file we used for a tm tutorial on Monday. Called Basic Text Mining in R. It uses slightly different techniques than the textbook, such as no Latent Semantic Analysis, but instead taking out the rare words.
  9. Files for Wednesday’s class, section 20.5
    1. May 9 list of functions.               Keep this as a “cheat sheet,” and add to it as you read and do mining.
    2.    Assignment:  Text mining #2 2018b
    3. lecture notes.  Text-Mining Bohn Day 1+2
  10. The R files cannot be stored on WordPress. Get them on the course’s Google drive page. You can reconstruct my work with two files:
    • Files ending in .R contain the code which I wrote and kept in Rstudio’s Source window (upper left). This code can be run.  Note: if I actually ran additional code but did not save it in Source file, it appears in History, but not in Source.
    • Files ending in .RData contain the entire Global Environment of variables.
    • By loading both of these into RStudio, you can recreate the status at the time I saved them. Thus all calculations are preserved.

Week 5: Random Forests, R, debugging

The ostensible material this week is Random Forests. They are a generalization of Classification/Regression Trees. No assignment for Monday; everything will be done in class. Redo the class material and hand it in on Wednesday (not Friday – we will go over the results in class.)

Here is the assignment for Wednesday, including the readings on Random Forests.  BDA18 Random Forest Assign May 2, 2018.

Here is the material used in class. BDA18 Random Forests2018B

The other agenda this week is to develop your skills in debugging. This is the key skill of writing code: figuring out what is wrong and how to fix it.

For Week 6, text mining. Here is the assignment for next Monday BDA18 Text mining #1 assign. (Short assignment handed in Sunday night.)

Playstation #SIE-PSN information

Information dissemination for Sony.

This page is only for BDA participants who are working with the Sony Playstation data set. The information flow will usually be:

SIE   ===> William ===> Professor ====> This page ====> Participants

Ground Rules for this Project

  • You must have signed an NDA and gotten it to William before getting any data. No exceptions. You must take the NDA seriously, including protecting the safety of electronic data.
  • For questions and information flow in the other direction, William will answer questions submitted by email. But expect 3 day responses, so mainly figure out answers on your own.
  • This project will require initiative on your part. Huge amounts of information relevant to the project are around, but you have to identify and locate it yourselves. Examples include thorough documentation for Adobe’s web statistics package, the structure of the web site itself, the meanings of the variables (many of which are standard Adobe material). You will deal with incomplete and occasionally even erroneous data.
  • Creativity and curiosity will be well rewarded. This is a great opportunity – take advantage of it.

Next steps

Write to me with confirmation that you did the NDA, and that both team members are willing and able to work on the project. I will send back the URL’s for the data. Write by team. Be sure to use hashtag   #SIE-PSN in all your email about this project.

Advice and Insights from Wednesday’s Lunch

  1. The supplied data is of two kinds: Sony’s mobile PSN app, and the PSN website. You probably want to concentrate on one or the other. Both are of great interest to Sony.
  2. One hour of data should be plenty to start with. Web and mobile sessions average less than an hour, so you should have many complete sessions in one hour. Limiting the data in this way speeds up development activity
  3.  As Blake put it, visiting the actual website and using the mobile app yourself are “table stakes.” In any project, what you see in a database is only a pale reflection of reality. Tracing their behaviors carefully will give lots of insights
  4. I allow and encourage cooperation between the teams. This may be especially useful in the early data munging, and in finding reference material. This is not a zero-sum activity; collaboration and producing public goods will earn you “points.”
  5. You can immediately strip the data down to less than 100 variables. For example, he described how many variables come in two variants, pre and post processing. You can throw away the unprocessed variables.
  6. I recommend studying raw data at first, but then transitioning to session data. Each time a user “hits” a Sony server, it creates a transaction record. Informally we call those “clickstream data,” although it is actually more aggregated than that.
  7. Getting session data involves sorting everything by user ID. dplyr can probably do the sorting of a single hour in one pass, but only if you have shrunk the dataset first in various ways. As few as 50 variables may be enough to get started with.

Files and sites

  • BDA18 Sony PSN project update 4-19.  What kinds of insights Sony is looking for? What variables should you start with?
  • An R file for early processing of the data. Sony_BDA_Preprocessing It has lots of comments and explanations.  Looks like routine R stuff that you can easily figure out yourself. For example:
    • # How do I call out a single variable in my dataframe?  dataframe$column.  hit_data$prop75
    • # What are the unique values in a variable?    unique(hit_data$prop75)
    •  # How many unique variables?      length(unique(hit_data$prop75))
  • Powerpoint presentation used in class. SIE Adobe for GPS BDA 2018
  • Sony web sites: playstation network

Participants approved so far

  • Allen Tian, Sylvia Wu



Material for wk. 4, linear regression. Update 4/26

  1. NEW!  Playstation  #SIE-PSN teams – I am setting up a page for information about this project. THE DEADLINE IS NOW – you must be either IN or OUT so we can move forward. On the + side, Sony is very serious about hiring, and the experience you will get will be immediately relevant to many potential employers. On the – side, you will need to be a self-starter, seek out documentation and information yourself, and generally have a “hands-on” attitude. For example, in order to understand the data coming from the Sony SIE website, you will need to actually visit and examine the site.
  2. NEW!  Lecture notes from Monday 4/23 and 4/25 Updated to Wednesday. BDA18 regression 2018B.key
  3. For the EPA homework, I found a Word document containing the R code. It may be easier to work with BDA16S BohnDA-gram data editing in R Other than the format, it should be almost identical to the one posted with the homework.
  4. The best books and sites about R and data mining. For each project, there are some specialty books and web sites that can save you hours or even days of effort.  It is organized into:
    • Reference books
    • Cheat sheets
    • Resources for learning R
    • Books on special topics.
  5. The minute you start working with R directly, run to this page and download a few resources. Especially get at least 1 cheat sheet. I handed one out in class – now get the e-version.
  6. Answers to some short homework questions from last week – I will post them soon.
  7. NEW! Answers to questions that were on the yellow post-its. DONE
  8. Study guides for ROC curves and other topics about classification models (week 3). I have put them on a new page of supplemental notes. Lecture note supplements
  9. We got a fan note from the author of our Rattle book:
    Hi Roger,

    Just saw your blog post on installing Rattle on mac OS X. Thank you so much for that.
    I’ve added a pointer on my rattle install page (  … which needs a refresh one day :-).
    You are correct, I don’t have any ready access to a Mac and spend most of my time on Linux.
    It is great that you shared your experience – I know many others will find this useful and no doubt will be appreciative as well.
    All the best.



Homework week 4: Linear regression

This week we have 3 learning goals. It will take the entire week to do them.

  1. Linear regression for prediction. How it differs from hypothesis testing.
  2. Showing how to use R instead of, or in conjunction with, Rattle.
  3. Many specific tricks and issues that come up with linear regression, such as word equations and creating interactive variables.

Please see the attached document, which includes the readings, specific homework due Friday, and supplemental information about various useful ideas and techniques.

BDA18 Week 4 Readings + assign

You can get  data files here:

There is nothing due on Monday.