Resources for Mining + R language

Lots of books and web sites to use.

Advertisements

[Updated 05/08/2018] There are a lot of good resources for the R programming language, and for data mining/machine learning/AI/BDA. There are video courses, books, reference sites, discussion boards, and plenty more. The single best place to look for resources is probably Computerworld, and its guide 60+ R resources to improve your data skills.  Each has a few sentences of description. Another good place for information is the UCSD library has a well-organized  UCSD Guide to Business Analytics, just as it has guides for political science, international studies, etc.

Remember that you don’t need to learn much R in order to use it for analytics. What you need in the course is enough to 1) glue together pieces of R code that do particular tasks, and 2) read guides to specialized topics, such as web scraping, text mining, or particular algorithms, that you need for an individual project.

Cheat Sheets

Good reference books about R

Several books and web sites contain “recipes,” meaning chunks of R code to do particular tasks. These are big time-savers, although they are not a good way to learn the language. Everyone should get at least one of these, as e-book, physical book, or permanent bookmark in your notes! Here are a few:

http://proquest.safaribooksonline.com/book/programming/r/9780596809287

  • Data Wrangling with R – on the library’s Springerlink site for complete downloading. Springerlink.com, but only from VPN or on campus.
  • There are several other good books, but they are expensive. If you are not on a budget, ask me.
  • The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available, but the same material is on a web site http://r4ds.had.co.nz/

 

Library downloadable books on Data Mining using R

These books are the ones to study when you want to learn a data mining technique. All of them use R as the primary language. These books are about machine learning, and are textbooks.  The earlier books are about the R language, and are written as reference books. 

    Rattle, the  second textbook for BDA. Used  because it has an easy interface.     http://link.springer.com/book/10.1007/978-1-4419-9890-3   

 ISLR = Introduction to Statistical Learning with R   http://link.springer.com/book/10.1007/978-1-4614-7138-7.  Course textbook #3. More theoretical than other books in this list, it has good explanations of how and why important algorithms work. 

 ggplot2 http://link.springer.com/book/10.1007%2F978-0-387-98141-3

The main graphics system  we use. This book was written by ggplot2’s developer, and covers the early version of the software. A new edition is due out in 2018

  http://link.springer.com/book/10.1007/978-1-4419-1318-0

If you know Stata and are learning R, this book is good for looking up “how do I do that?”

http://link.springer.com/book/10.1007%2F978-3-319-12066-9

A short book that covers the basics of data mining, with everything written in R 

R for specific kinds of analysis (networks, GIS, marketing, ….)

Springerlink publishes a series of more than 60 books on different uses of R. https://link.springer.com/bookseries/6991 They are at the intermediate level, about right for refining your knowledge of special techniques needed for a  project. Examples: Spatial analysis in R, R for Market Research, Data Wrangling with R,  ggplot2 (several books), Political analysis using R, Analyzing Networks using R, Phylogenetics with R, etc. All of them are free to download, or you can buy them as paperback books for $25.

Because it is so trendy, practically every business and textbook publisher has books on data mining and related topics. You can search them through the UCSD book catalog, UCSD.worldcat.org. For example, here are 2000+ e-books about ‘Machine Learning’. That is not a misprint, and all are available through UCSD in some form.

Last, there are literally dozens of books about R/statistics written for a particular audience, or exploring a particular applied statistics topic.  The following lists books I have found especially relevant to this course. Note that many of them are specifically for reference: when you need  to do something specific, look it up in one of these books. Others are intended for learning from.

For searching on your own e.g. on Google Scholar, good phrases are data mining, machine learning, data analytics (broader), data science, and specific topics for your application, such as fraud detection. Use quotation marks around these phrases! Finally, there are many 20 to 50 page articles that cover the basics of particular R topics. These are often more up to date than books, and better ways to get a start on a topic.

Mining Text Data    

R for Marketing Research and Analytics

A User’s Guide to Network Analysis in R

Statistical Analysis of Network Data with R

Applied Spatial Data Analysis with R (Geographic Info systems)

Graphical Models with R

Six Sigma with R

Introductory Time Series with R

Applied Econometrics with R

Nonlinear Regression in R

Data Manipulation with R

 

Preparing for first week of BDA

As of today March 29, the course is oversubscribed. Come to the first class anyway, because by the third week lots of people will drop the course, for various reasons. See the page on Should I take Big Data Analytics in 2018? for more information.
Class will probably meet in RBC 3203, but there is a chance it will move to the Gardner Room. 

Here are steps that you need to do by Tuesday, April 3. If you can get most of it done before the first class, even better.  Most important: get the software installed. Some students will run into problems, and we don’t want to wait to discover them.

  • Installing R is straightforward and covered in many places. We will use  R version 3.4.4 (Someone to Lean On) which was released on 2018-03-15. Here is a Coursera video on how to install. The official web site for latest versions is https://cran.r-project.org.
  • Start R, make sure it runs. Set up at least one folder/directory where your R programs will go.
  • Install Rattle. Its home page is https://rattle.togaware.com . To install Rattle, start up R, then follow the instructions on the Rattle page.
  • Download the Rattle textbook from Springerlink.com. It also has instructions for installing Rattle. https://link.springer.com/book/10.1007/978-1-4419-9890-3
  • Get the main textbook. You can try using the library’s online copies, especially for the first chapters. Instructions are on this web site (BDA2020.wordpress.com).
    • Read chapter 1 on your own.
    • Start on Chapter 2.

The TA for the course will be Feiyang Chan. She will hold informal office hours before and after the Wednesday class, for anyone who is having trouble installing the software. So 10:30 to 11AM, and again 12:30 onward. In the main classroom, RBC 3203.

What is Big Data Analytics?

This confuses students every year, and for excellent reasons. A variety of terms are thrown around without clear definitions, or clear distinctions among them. The concepts and applications are evolving so fast that there is no consensus. You should think of all of the following as closely related, and all covered by this course:

  • Data Analytics
  • Data Mining
  • Machine Learning
  • Business Analytics
  • Data Science
  • at least 5 others.

It is very helpful to look at a range of case studies where these ideas have been used successfully. Here are a few.  Some may be bogus – as we will try to discuss during the course.

Assignment: send me other examples. Either put them in the comments, or email them to me and I will post them.

Big Data At Caesars Entertainment – A One Billion Dollar Asset? – Forbes

BDA examples: Pollution and health

Popular Press Articles

Analyzing 170,000,000 NYC Taxi trips

 

Course status update April 23 (with a hidden prize)

The following message went out as an announcement. If there are updates, I will put them here, rather than send another announcement.

1. Assignments are posted for the next 3 classes: Tuesday April 25, Thursday April 27, and Tuesday May 2 (due on May 1 at midnight).

2. Remember that Sai’s TA workshop is still on Wednesdays at 6PM., but his office hours  are now on Tuesday mornings at 9:30 AM. If you want help with the homework that is due Monday night, better email him early on the weekend, and hope for the best. The new schedule is discussed in a blog post on https://irgn452.wordpress.com .

3. Projects :

  • If you will be looking at spatial information in your project, check out my recent blog post on GIS in R. https://irgn452.wordpress.com
  • I have returned comments on all project proposals, and met with many of you to discuss them. Some project proposals are great; others  are just getting started, and may change  direction or get a lot more specific.
  • If you are still working by yourself, many of you will benefit from collaborating. Take a look at “Discussions” on Tritoned, and post a message about your plans. Each of the first 5 people to post a message there will be entered in a lottery for a prize. (“Hello world”  messages don’t count.)
  • Some teams are planning to work on projects based on the same data sets. (e.g. AirBNB) In that situation, get together and exchange information about what you are finding, especially about retrieving and cleaning the data.
  • If we have time in class on Tuesday, I will ask some of you to introduce yourselves and discuss/ask about projects.

R-bloggers pulls useful articles from multiple R blogs

R has an active user community, and therefore numerous blogs, dedicated pages, etc. This site, R-bloggers, pulls interesting articles from a lot of them. I do not recommend it for novices, but if you want to explore more advanced methods for particular purposes, it’s worth checking out. For example, it’s “most cited articles of the week” today includes the following:

  1. R tutorials
  2. In-depth introduction to machine learning in 15 hours of expert videos
  3. Using apply, sapply, lapply in R
  4. How to perform a Logistic Regression in R
  5. Working with databases in R

Typically these articles are very short, and give a quick introduction and example of a topic. Source: Data Types, Part 3: Factors! | R-bloggers