Week 6 Notes Text Mining

This page is the weekly summary of relevant material for assignments, projects, etc. Edited May 8.

  1. Projects: all projects should have downloaded live data by this weekend. If you have trouble with that deadline, it is time to drastically prune your goals. You can always add more elaborate analysis after you do the basics, but the opposite is not true.
    • If you have not submitted a project update last weekend (nothing since #3, April 29), come to Wednesday night office hours, or text me to suggest another time.
    • Every team should expect to meet with me at least twice after your basic project is approved. The first meeting usually discusses whether you have formulated the problem in a way that is solvable with the data you have. Roughly 30% of the projects e need to change the problem statement at this stage, which is  easier than getting more data.
    • The second meeting generally looks at your results, and finds ways to boost them / make them more interesting.
  2. The next homework is due Friday May 11. It is problem 20.3 from the main DMBA textbook. Data files are available on the book’s website. Details are on the linked assignment from the syllabus page. Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego
    1. Error in file name! The file AutoElectronics.zip has been scrambled by the book authors. On their web site they call it AutoAndElectronics.zip. (They have yet a third name in the textbook!!) You can DL it from them at http://www.dataminingbook.com/system/files/AutoAndElectronics.zip or at the course Google page under either name.
  3. Key resource list  Reference books on R and on specialized data mining methods. Resources for Mining + R language. Information to solve 98% of your R and data mining algorithm problems are available from this page!  (You still need to figure out how to formulate your business problem as Data Mining. Many of the references give advice about this, but it is not reducible to a “cookbook.”)
  4. Text-mining specific resources. Text-mining resources for projects
  5.  Check out the seminar next Monday on careers. Careers in Data Analytics discussion Monday May 14, 12:30
  6. coverOne new book is particularly worth checking out if you are struggling with messy data sets. It discusses packages that have been developed specifically for common data manipulation problems in machine learning/data mining. Several of Feiyang’s tutorials use this book.
    R for Data Science: The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available for $20 or from the library, but the same material is on a web site http://r4ds.had.co.nz/
  7. Weekly notes. Text-Mining Bohn Day 1+2.
  8. The file we used for a tm tutorial on Monday. Called Basic Text Mining in R. It uses slightly different techniques than the textbook, such as no Latent Semantic Analysis, but instead taking out the rare words.
  9. Files for Wednesday’s class, section 20.5
    1. May 9 list of functions.               Keep this as a “cheat sheet,” and add to it as you read and do mining.
    2.    Assignment:  Text mining #2 2018b
    3. lecture notes.  Text-Mining Bohn Day 1+2
  10. The R files cannot be stored on WordPress. Get them on the course’s Google drive page. You can reconstruct my work with two files:
    • Files ending in .R contain the code which I wrote and kept in Rstudio’s Source window (upper left). This code can be run.  Note: if I actually ran additional code but did not save it in Source file, it appears in History, but not in Source.
    • Files ending in .RData contain the entire Global Environment of variables.
    • By loading both of these into RStudio, you can recreate the status at the time I saved them. Thus all calculations are preserved.

Author: Roger Bohn

Professor of Technology Management, UC San Diego. Visiting Stanford Medical School Rbohn@ucsd.edu. Twitter =Roger.Bohn