Preparing for first week of BDA

As of today March 29, the course is oversubscribed. Come to the first class anyway, because by the third week lots of people will drop the course, for various reasons. See the page on Should I take Big Data Analytics in 2018? for more information.
Class will probably meet in RBC 3203, but there is a chance it will move to the Gardner Room. 

Here are steps that you need to do by Tuesday, April 3. If you can get most of it done before the first class, even better.  Most important: get the software installed. Some students will run into problems, and we don’t want to wait to discover them.

  • Installing R is straightforward and covered in many places. We will use  R version 3.4.4 (Someone to Lean On) which was released on 2018-03-15. Here is a Coursera video on how to install. The official web site for latest versions is https://cran.r-project.org.
  • Start R, make sure it runs. Set up at least one folder/directory where your R programs will go.
  • Install Rattle. Its home page is https://rattle.togaware.com . To install Rattle, start up R, then follow the instructions on the Rattle page.
  • Download the Rattle textbook from Springerlink.com. It also has instructions for installing Rattle. https://link.springer.com/book/10.1007/978-1-4419-9890-3
  • Get the main textbook. You can try using the library’s online copies, especially for the first chapters. Instructions are on this web site (BDA2020.wordpress.com).
    • Read chapter 1 on your own.
    • Start on Chapter 2.

The TA for the course will be Feiyang Chan. She will hold informal office hours before and after the Wednesday class, for anyone who is having trouble installing the software. So 10:30 to 11AM, and again 12:30 onward. In the main classroom, RBC 3203.

What is Big Data Analytics?

This confuses students every year, and for excellent reasons. A variety of terms are thrown around without clear definitions, or clear distinctions among them. The concepts and applications are evolving so fast that there is no consensus. You should think of all of the following as closely related, and all covered by this course:

  • Data Analytics
  • Data Mining
  • Machine Learning
  • Business Analytics
  • Data Science
  • at least 5 others.

It is very helpful to look at a range of case studies where these ideas have been used successfully. Here are a few.  Some may be bogus – as we will try to discuss during the course.

Assignment: send me other examples. Either put them in the comments, or email them to me and I will post them.

Big Data At Caesars Entertainment – A One Billion Dollar Asset? – Forbes

BDA examples: Pollution and health

Popular Press Articles

Analyzing 170,000,000 NYC Taxi trips

 

Advertisements

Chart Relationship diagram from Financial Times

This diagram of about 80 kinds of charts, with clear explanations of their purposes, is impressive. It is the most comprehensive such list I have seen, and it’s quite easy to understand. I have not looked for an R/ggplot version of this, but if one does not exist yet I suspect someone will soon create it. Here is a web site with more details about each type of chart.

Visual-vocabulary.png

Visual-vocabulary

R “cheat sheets”

This page is obsolete. Please see the “cheat sheet” section of this page. Resources for Mining + R languageinstead.  

 

There must be 50 R summaries in the form of “cheat sheets”.  Each is designed for slightly different purposes, e.g. for ggplot2, RStudio, etc. Here are a few that I find are especially good. Feel free to list your own favorites in the comments.

Ultimate R_Cheat_ for Data Management is a good place to start. This covers importing, summarizing, and basic manipulation.  Here are its first few rows. (The author uses Z=c(1,2) for assignment. IMO it is better to use Z <- c(1,2) complete with extra spaces.

  •  dat1 <- read.csv(“name.csv”) to import a standard CSV file (first row are variable names).
  •  attach(dat1) to set a table as default to look for variables. Use detach() to release.
  •  dat1 <- read.delim(“name.txt”) to import a standard tab-delimited file.
  •  dat1 <- read.fwf(“name.prn”, widths=c(8,8,8)) fixed width (3 variables, 8 characters wide).
  •  head(dat1) to check the first few rows and variable names of the data table you imported.

More advanced cheat sheets, covering the dplyr package and other more advanced functions, are available from RStudio (now owned by Microsoft). Do not start with these, but they can be useful for specialized purposes.

  • The whole listhttps://www.rstudio.com/resources/cheatsheets/
  • One that covers RStudio commands.

Probably the most useful one overall, in my experience, from 2012. You will see lots of variations of this, all of which started with a 2004 version. Skip the earlier ones.

 

Installing Rattle on the Mac

Rattle sometimes gives errors when installed on Macs.

Rattle is not needed until next week, but Yicong Li figured out some very useful information. In the past, many Mac owners have had trouble installing Rattle. Here is what she learned, after considerable research. Also: It is worth repeating the regular Rattle install several times. The basic problem is that some additional libraries must be installed in order for Rattle to run. These libraries seem to install only gradually.

Please give comments on this message if you run into additional installation problems. Also respond if you have suggestions that might be helpful.

===============
Hi Roger,
Just want to update with you that I think I found a solution for installing rattle packages.
Problem: run >library(rattle) , error on install GTK+
Note: No need re-install R or Rstudio, just install XQuartz and GTK+ from links provided.
Then run >install.packages(“rattle”, repos=”http://rattle.togaware.com“, type=”source”)
                >library(rattle)
                >rattle()
It works on my Mac now. Hope it can help others that having the same problem.
Here is what I did on another machine and it worked.
  1. Install XQuartz from this link.
  2. Install GTK+ from this link.
  3. run install.packages("rattle", dependencies = T)
This will work.

More quick R methods

We will go into most of these topics at the appropriate times in the course. But, it is very useful to have summaries and code for key techniques. This list is from 2016, and I have not rechecked the links for 2017.

http://www.r-bloggers.com/summarizing-data/  = making quick tabular and distribution summaries of variables.

Matrix indexing (the topic of the April 13 2016 quiz). http://www.r-bloggers.com/quick-and-easy-subsetting/

Introduction to basic scatter plots http://www.r-bloggers.com/what-a-nice-looking-scatterplot/

Converting continuous variables into categorical variables (like the years from 1984 to 2016)  http://www.r-bloggers.com/from-continuous-to-categorical/

Factor variables (categorical variables that have many categories) http://www.r-bloggers.com/data-types-part-3-factors/

Quick intro to logistic regression.  http://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/

 

 

Source: Summarizing Data | R-bloggers