Week 6 Notes Text Mining

This page is the weekly summary of relevant material for assignments, projects, etc. Edited May 8.

  1. Projects: all projects should have downloaded live data by this weekend. If you have trouble with that deadline, it is time to drastically prune your goals. You can always add more elaborate analysis after you do the basics, but the opposite is not true.
    • If you have not submitted a project update last weekend (nothing since #3, April 29), come to Wednesday night office hours, or text me to suggest another time.
    • Every team should expect to meet with me at least twice after your basic project is approved. The first meeting usually discusses whether you have formulated the problem in a way that is solvable with the data you have. Roughly 30% of the projects e need to change the problem statement at this stage, which is  easier than getting more data.
    • The second meeting generally looks at your results, and finds ways to boost them / make them more interesting.
  2. The next homework is due Friday May 11. It is problem 20.3 from the main DMBA textbook. Data files are available on the book’s website. Details are on the linked assignment from the syllabus page. Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego
    1. Error in file name! The file AutoElectronics.zip has been scrambled by the book authors. On their web site they call it AutoAndElectronics.zip. (They have yet a third name in the textbook!!) You can DL it from them at http://www.dataminingbook.com/system/files/AutoAndElectronics.zip or at the course Google page under either name.
  3. Key resource list  Reference books on R and on specialized data mining methods. Resources for Mining + R language. Information to solve 98% of your R and data mining algorithm problems are available from this page!  (You still need to figure out how to formulate your business problem as Data Mining. Many of the references give advice about this, but it is not reducible to a “cookbook.”)
  4. Text-mining specific resources. Text-mining resources for projects
  5.  Check out the seminar next Monday on careers. Careers in Data Analytics discussion Monday May 14, 12:30
  6. coverOne new book is particularly worth checking out if you are struggling with messy data sets. It discusses packages that have been developed specifically for common data manipulation problems in machine learning/data mining. Several of Feiyang’s tutorials use this book.
    R for Data Science: The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available for $20 or from the library, but the same material is on a web site http://r4ds.had.co.nz/
  7. Weekly notes. Text-Mining Bohn Day 1+2.
  8. The file we used for a tm tutorial on Monday. Called Basic Text Mining in R. It uses slightly different techniques than the textbook, such as no Latent Semantic Analysis, but instead taking out the rare words.
  9. Files for Wednesday’s class, section 20.5
    1. May 9 list of functions.               Keep this as a “cheat sheet,” and add to it as you read and do mining.
    2.    Assignment:  Text mining #2 2018b
    3. lecture notes.  Text-Mining Bohn Day 1+2
  10. The R files cannot be stored on WordPress. Get them on the course’s Google drive page. You can reconstruct my work with two files:
    • Files ending in .R contain the code which I wrote and kept in Rstudio’s Source window (upper left). This code can be run.  Note: if I actually ran additional code but did not save it in Source file, it appears in History, but not in Source.
    • Files ending in .RData contain the entire Global Environment of variables.
    • By loading both of these into RStudio, you can recreate the status at the time I saved them. Thus all calculations are preserved.

Homework week 4: Linear regression

This week we have 3 learning goals. It will take the entire week to do them.

  1. Linear regression for prediction. How it differs from hypothesis testing.
  2. Showing how to use R instead of, or in conjunction with, Rattle.
  3. Many specific tricks and issues that come up with linear regression, such as word equations and creating interactive variables.

Please see the attached document, which includes the readings, specific homework due Friday, and supplemental information about various useful ideas and techniques.

BDA18 Week 4 Readings + assign

You can get  data files here:  https://bda2020.wordpress.com/data-sets/

There is nothing due on Monday.


Misc. announcements: homework, Sony projects, tutorial on R, etc.

Several notices about events tomorrow and later.

Student Question: What cutoff in homework problem 10.4i?

Dana write:

Hey everyone!

I have a question about 10.4 i. exercise. In Rattle we use default setting and cannot change the cutoff, so are we supposed to guess the cutoff for the most accurate classification? Or are we supposed to get it some other way?
Response: Good question. I have 3 levels of answer. At the simplest, should you push the cutoff up or down from 50%? (Be sure to specify which way is which – sometimes it can be ambiguous).
At the next level, Rattle produces an ROC curve, as we did with the airplane data on Wednesday. See textbook p 131. The ROC curve is traced out by moving the threshold all the way from 0 to 1.
Third level: Soon, you will learn how to grab the code produced by Rattle, and run it in RStudio. There you can change the cutoff parameter and calculate different confusion matrices. For example function confusion.matrix(obs, pred, threshold = 0.5) allows any threshold. Through trial and error, you can experiment with different thresholds. Of course, there are more specialized functions that can come up with the optimal answer.

Sony Playstation Network project

Anyone considering the Sony Playstation Network project, read this memo.BDA18 Sony PSN project update 4-19  It points you to some data, and requests a revised proposal as soon as possible. Or, switch to another topic. Make comments on this page if you are looking for a partner, or on the Final paper ‘dating site’

R Tutorial Friday in room 3201 at 1pm

Feiyang will provide several resources for learning R at the level we need it in BDA. She will demonstrate how functions work, and illustrate function use with examples from the textbook. Other topics will include using R Help, and good cheat sheets. This will all be useful next week when we move away from Rattle and toward straight R.

Next week: Linear continuous models (Linear Regression)

Next week, we will look at a method that everyone has seen in a different context, OLS linear regression. I don’t like the textbook treatment of the topic, so I’m assigning a supplemental book. Please read:
Gareth James et al, An Introduction to Statistical Learning with Applications in R.(Supplementary textbook.) It’s available from Springerlink. Review section 3.2 which should be familiar, and read section 3.3 on variations on the basic model linear. Also read DMBA main textbook, only sections 6.1 and 6.2
A more detailed assignment will be posted Saturday, and nothing is due Sunday.

Office hours are set!

Professor Roger Bohn
Office = RBC 1315. Phone and text (858) 381-2015.
Email Rbohn@ucsd.edu. Put #BDA18 in the subject line of all emails.

TA Feiyang.Chen@rady.ucsd.edu

Office Hours (final). There are office hours on Monday, Wednesday, Friday.

  • Wednesday 6:30 to 7:30 PM in Peet’s Coffee. Prof. will stay later if a long line.  Text (858) 381-2015 to reserve a late time.
  • Fridays 4:00 to 5:15 in my office RBC 1315. Prof. will stay as long as needed to talk to anyone who is in line at 5:05.
  • TA Office hours Monday 1 to 2 in RBC 3128 (shortly after class).
  •      TA R tutorial Fridays 1 to 2 in Gardner Auditorium

Notes from class 3, CART using Rattle

To: Big Data Analytics students
From: Prof. Roger Bohn
Subject: Class #3 Monday April 9 – next steps, Q&A, homework schedule, 
Date: April 9, 2018

The lecture notes were provided before class. Visit Latest handouts  We did not cover all of them, and will continue with CART algorithm on Wednesday before discussing Toyota.

Another topic we discussed, not in the notes: Benefits and disadvantages of open source software.

Please email (or put in comments on this page) questions about the Weather exercise from the Rattle book. Several people asked good questions about Toyota after class. If there are no more questions about how to use Rattle, we will move right into the next segment on Wednesday.

Toyota homework now due Friday at Noon. The TritonEd assignment has been updated.

Still having trouble with Rattle? Feiyang 4pm today. Location unclear, check near GPS office 3132  Feiyang is polling about what her tutorial hours should be. Please respond to her Doodle poll at https://goo.gl/forms/5rjhpIjevaewMBjJ2.  No response = you don’t get a vote.

Other questions asked in class and not answered:

  • Can we have a group of 3 for homework. No. You can discuss with others if you put their names in a note. But only 2 people should work on the actual memo answers.
  • Grading scale, grading policy. I will post something about this. Homework is graded on a 0 to 10 scale. An average of 8 is fine.
  • How to find other people who are interested in projects. I just created a page specifically for that. Final paper ‘dating site’
  • Where to learn more R. Attend the TA tutorials, and I will shortly post a list of recommended websites and readings.  This page is a starting point. Resources for R language