Lecture note supplements

From time to time I write  guides/tutorials on topics in lectures that people find confusing. Taken together, they add up to a supplemental textbook.

Advertisements

Resources for Mining + R language

Lots of books and web sites to use.

[Updated 05/08/2018] There are a lot of good resources for the R programming language, and for data mining/machine learning/AI/BDA. There are video courses, books, reference sites, discussion boards, and plenty more. The single best place to look for resources is probably Computerworld, and its guide 60+ R resources to improve your data skills.  Each has a few sentences of description. Another good place for information is the UCSD library has a well-organized  UCSD Guide to Business Analytics, just as it has guides for political science, international studies, etc.

Remember that you don’t need to learn much R in order to use it for analytics. What you need in the course is enough to 1) glue together pieces of R code that do particular tasks, and 2) read guides to specialized topics, such as web scraping, text mining, or particular algorithms, that you need for an individual project.

Cheat Sheets

Good reference books about R

Several books and web sites contain “recipes,” meaning chunks of R code to do particular tasks. These are big time-savers, although they are not a good way to learn the language. Everyone should get at least one of these, as e-book, physical book, or permanent bookmark in your notes! Here are a few:

http://proquest.safaribooksonline.com/book/programming/r/9780596809287

  • Data Wrangling with R – on the library’s Springerlink site for complete downloading. Springerlink.com, but only from VPN or on campus.
  • There are several other good books, but they are expensive. If you are not on a budget, ask me.
  • The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available, but the same material is on a web site http://r4ds.had.co.nz/

 

Library downloadable books on Data Mining using R

These books are the ones to study when you want to learn a data mining technique. All of them use R as the primary language. These books are about machine learning, and are textbooks.  The earlier books are about the R language, and are written as reference books. 

    Rattle, the  second textbook for BDA. Used  because it has an easy interface.     http://link.springer.com/book/10.1007/978-1-4419-9890-3   

 ISLR = Introduction to Statistical Learning with R   http://link.springer.com/book/10.1007/978-1-4614-7138-7.  Course textbook #3. More theoretical than other books in this list, it has good explanations of how and why important algorithms work. 

 ggplot2 http://link.springer.com/book/10.1007%2F978-0-387-98141-3

The main graphics system  we use. This book was written by ggplot2’s developer, and covers the early version of the software. A new edition is due out in 2018

  http://link.springer.com/book/10.1007/978-1-4419-1318-0

If you know Stata and are learning R, this book is good for looking up “how do I do that?”

http://link.springer.com/book/10.1007%2F978-3-319-12066-9

A short book that covers the basics of data mining, with everything written in R 

R for specific kinds of analysis (networks, GIS, marketing, ….)

Springerlink publishes a series of more than 60 books on different uses of R. https://link.springer.com/bookseries/6991 They are at the intermediate level, about right for refining your knowledge of special techniques needed for a  project. Examples: Spatial analysis in R, R for Market Research, Data Wrangling with R,  ggplot2 (several books), Political analysis using R, Analyzing Networks using R, Phylogenetics with R, etc. All of them are free to download, or you can buy them as paperback books for $25.

Because it is so trendy, practically every business and textbook publisher has books on data mining and related topics. You can search them through the UCSD book catalog, UCSD.worldcat.org. For example, here are 2000+ e-books about ‘Machine Learning’. That is not a misprint, and all are available through UCSD in some form.

Last, there are literally dozens of books about R/statistics written for a particular audience, or exploring a particular applied statistics topic.  The following lists books I have found especially relevant to this course. Note that many of them are specifically for reference: when you need  to do something specific, look it up in one of these books. Others are intended for learning from.

For searching on your own e.g. on Google Scholar, good phrases are data mining, machine learning, data analytics (broader), data science, and specific topics for your application, such as fraud detection. Use quotation marks around these phrases! Finally, there are many 20 to 50 page articles that cover the basics of particular R topics. These are often more up to date than books, and better ways to get a start on a topic.

Mining Text Data    

R for Marketing Research and Analytics

A User’s Guide to Network Analysis in R

Statistical Analysis of Network Data with R

Applied Spatial Data Analysis with R (Geographic Info systems)

Graphical Models with R

Six Sigma with R

Introductory Time Series with R

Applied Econometrics with R

Nonlinear Regression in R

Data Manipulation with R

 

2018 Big Data Analytics: materials

Welcome! We will use one principal textbook, with a variety of supplements.

The TA will be available to help with software installation before and after the class of April 4. Location uncertain. Look for her in the lobby of the auditorium. 

1. Main Text DMBA: Data Mining for Business Analytics: Concepts, Techniques, and Applications in R.     by Galit Shmueli et al.

This textbook is required.  It’s a good survey of the topic. It uses R, which is the only language we will use in the course. This book will be referred to as DMBA.

You can get the  book  on Amazon for about $106, or Kindle for $90, or a company called VitalSource  online version for about $100. I’m not familiar with VitalSource, but they appear to have sensible study aids, and they claim “read anywhere, 100% offline.” Data Mining for Business Analytics: Concepts, Techniques, and Applications in R 1st edition | 9781118879368 | VitalSource.   You can even rent the book from Amazon for $44. So everyone will find it worthwhile to have your own copy for the assignments, studying, etc.   I have asked the UCSD bookstore to get it. I suspect they will charge close to the list price.

The UCSD library has the e-book. It is on their ProQuest platform. The rules on that platform limit use to 3 simultaneous users, so when you are done reading a section, close the page. http://roger.ucsd.edu/record=b9688724

2. Supplemental text for first 2 weeks: DMRR

  Our supplemental textbook is: Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery by Graham Williams. 

Log in from campus (or VPN) so you don’t have to pay for it.

3. Other reference books (optional)

The UCSD library, via Springerlink.com and other sources, has a variety of good books on data mining, AI, business analytics, and so forth. All are available free, and most can be downloaded as PDFs. We will use chapters from some of these books.  By week 4, check out Springerlink, which has thousands of free technical books on every computer language and applied math method you can think of. If you like physical books, hard  copy versions of all Springerlink books are available  for $25 each. Springer has a collection called “Use R!” of about 100 books, at http://link.springer.com/search/page/2?facet-content-type=%22Book%22&query=%22use+R%22

ISLR: For reference and to fill holes on statistical issues, I will mainly refer to An Introduction to Statistical Learning with Applications in R  (ISLR) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

R-Stata If you are proficient with Stata, get this book. R for Stata Users  . Similarly, there are “translation” books for other computer lanuages.

Some of these books and web courses are also available in Chinese. A  few are available in other languages.

4. Software

We will use  the following software: The R statistical language, the Rattle package for R which provides a graphical user interface (GUI),  RStudio, and numerous special purpose analysis packages that you load via R as the course unfolds.  All of this software is free.