Notes from class 3, CART using Rattle

MEMORANDUM
To: Big Data Analytics students
From: Prof. Roger Bohn
Subject: Class #3 Monday April 9 – next steps, Q&A, homework schedule, 
Date: April 9, 2018

The lecture notes were provided before class. Visit Latest handouts  We did not cover all of them, and will continue with CART algorithm on Wednesday before discussing Toyota.

Another topic we discussed, not in the notes: Benefits and disadvantages of open source software.

Please email (or put in comments on this page) questions about the Weather exercise from the Rattle book. Several people asked good questions about Toyota after class. If there are no more questions about how to use Rattle, we will move right into the next segment on Wednesday.

Toyota homework now due Friday at Noon. The TritonEd assignment has been updated.

Still having trouble with Rattle? Feiyang 4pm today. Location unclear, check near GPS office 3132  Feiyang is polling about what her tutorial hours should be. Please respond to her Doodle poll at https://goo.gl/forms/5rjhpIjevaewMBjJ2.  No response = you don’t get a vote.

Other questions asked in class and not answered:

  • Can we have a group of 3 for homework. No. You can discuss with others if you put their names in a note. But only 2 people should work on the actual memo answers.
  • Grading scale, grading policy. I will post something about this. Homework is graded on a 0 to 10 scale. An average of 8 is fine.
  • How to find other people who are interested in projects. I just created a page specifically for that. Final paper ‘dating site’
  • Where to learn more R. Attend the TA tutorials, and I will shortly post a list of recommended websites and readings.  This page is a starting point. Resources for R language

 

Advertisements

Week 2 #BDA assignments

Feiyang and I are working to get everyone up to speed for next week. Here is a list of items to be aware of, in no particular order. I will get this material organized better over the weekend. As always, you can post comments on this message if you have questions.

  1. Rattle now works on the Mac! see Installing Rattle on Mac
  2. The homework for Monday does use Rattle. Do it in teams, and if only one of you has a working version of Rattle that is ok as long as you physically work together. In class, I will call on someone randomly for your solutions.
  3. You can find the homework at the end of the syllabus. Currently, it is version 1.05, but I expect  to revise it late today. Latest syllabus, assignments, + notes The Monday assignment is on page 12, or search for “Early Assignments”. Update: individual assignments are also in TritonEd. 
  4. The assignment due Tuesday night takes some people a long time because it requires using R, Rattle, and the first data mining algorithm, called CART. It also asks you to do some data manipulation. So set aside time, and do it with a teammate. Homeworks are due 11pm.
  5. To get the course information immediately, subscribe to this web site. Look for the subscribe button on the bottom right. (BDA2020.wordpress.com)
  6. Feiyang can provide assistance with Rattle by email (and then phone etc.) When asking for computer assistance with problems, provide basic information for debugging. Do not say “it didn’t work,” unless you don’t need any assistance. The more complete, the better.’ Give the exact error message. It is even ok to Copy and Paste an entire stream of activities and resulting error messages, into the bottom of an email. She won’t read it all, but it gives important clues.
  7. I have not confirmed this, but she should be available before or after class on Monday for anyone still having trouble with Rattle.
  8. For R in general, get in the habit of googling  error messages. Often this will send you to the Stack Overflow site.
    • There are various tricks involved in googling errors, which I will discuss in class.
    • A few newcomers to UCSD may not know how to do compound searches on Google. This is a basic life skill.

      “Data Mining” Rattle “text of error message” 

      is a much better search than
      Data Mining Rattle text of error message.  Why?

    • To start with, use Google’s Advanced Search page.
    • Search tips for dates, for example: https://www.makeuseof.com/tag/6-ways-to-search-by-date-on-google/

Installing Rattle on Mac

To start: Update your MacOS to version 10.13.3, which takes a little while. You also need Xcode, which can take 20 minutes

A. Download the two key files.

B.  One method for installing Rattle is available at https://zhiyzuo.github.io/installation-rattle/. This method does not require Xcode.

C. The second method is on this page.

It requires downloading Xcode, which takes a while, so start on it early. .  

1. upgrade your Mac operating system to High Sierra. 10.13.3
2. install Xcode which can be downloaded from App Store. This takes a long time (20 minutes).
3. In the following instructions, write these commands in your Terminal application (available in Applications/Utilities, shift-Cmd-U ). The preface “MyComputer$” will look slightly different on your machine.

MyComputer$ xcode-select -p
/Applications/Xcode.app/Contents/Developer

### If you see above line, you have xcode command line tools installed.
### If you don’t see that output, execute the command below:

MyComputer$ xcode-select –install

### Then, execute the commands below, in Terminal.

MyComputer$ export PATH=/opt/local/bin:/opt/local/sbin:$PATH

MyComputer$ sudo port selfupdate

MyComputer$ sudo port install pkgconfig

MyComputer$ sudo port install gtk2 +x11

MyComputer$ R CMD INSTALL ~/Downloads/RGtk2_2.20.34.tar.gz

MyComputer$ R CMD INSTALL ~/Downloads/cairoDevice_2.24.tar.gz

### The last two command lines depends on your file names and folders, so make sure that the name is consistent with your own files and folders. For example, when you download RGtk2 and cairoDevice, they may arrive with file names that look like RGtk2_2.20.34.tar (no .gz). If so, use the name on your system.

You can now quit Terminal and start Rstudio. Follow the regular instructions for installing Rattle. For example (from Rstudio, not from Terminal):

install.packages(“rattle”, repos=”https://rattle.togaware.com”, type=”source”)

library(rattle)

rattle()

 

Once you get this far, you should never have to install Rattle again. You can just run it, from Rstudio or from R. And you won’t have to install.packages again, either. Just:

library(rattle)

rattle()

D. Why all this trouble? – An educational discussion

Despite the inconvenience of this installation, there are some useful lessons here about modern software and software ecosystems. Rattle and R are both open-source software, meaning that no company is formally responsible for their development. Instead, they are built by volunteers (some of whom are paid by their employers to work on the software). The results are that the software is available for free, and portions of it may be very sophisticated, and better than anything available from conventional software companies.

But conversely, with open source software programs the robustness (it doesn’t break) and usability (non-specialists can use it ) depends on how popular that program is, which determines how much effort the open source software community puts into developing it. The core pieces of R and Rstudio are both heavily used and relatively well developed. Rstudio, on the other hand, is more limited. The inventory, Graham Williams,  as far as I can tell uses Linux and Windows, but does not use Macintosh software. Therefore, when he updates Rstudio he develops and tests it first for Windows. Other users then have to come along and test/adapt it for Mac. The same situation appears to be true for several packages that Rattle and many other R packages depend on.

The same situation applies to some packages that Rattle depends on, especially something called RGtk2. As of today (4/5/2018), the CRAN page for RGtk2 says the following:

Reference manual: RGtk2.pdf
Package source: RGtk2_2.20.34.tar.gz
Windows binaries: r-prerel: RGtk2_2.20.34.zip, r-release: RGtk2_2.20.34.zip, r-oldrel: RGtk2_2.20.31.zip
OS X binaries: r-prerel: not available, r-release: not available

In other words, there is no compiled (binary) version of version 2.20.34 of RGtk2 for Macintosh operating system. What you are doing in the above instructions is compiling it yourself, so that it will be available when Rattle wants it.

 

Rattle, job hunting, other announcements

On Thursday our TA had a session on installing Rattle on the Mac. She may repeat it on Friday. Send her an email if you are interested.

Please refer to this page for instructions. Installing Rattle on Mac

Data sources pages:

Data sets from Google and Kagle.    https://www.kaggle.com/datasets

A page of useful links

Data sources and project ideas related to pollution.

Projects: easily available data sets

Five strategies for locating interesting data sets. (From Dataquest)

Some data projects that encourage other people to use the data they collected.

Past student papers:

 

Job Hunting Opportunity

In 2 weeks, the Jacobs School of Engineering is running a day with lots of employers visiting. Student passes are $10, although you can probably sneak in if you want to. http://jacobsschool.ucsd.edu/re/

 

 

 

Preparing for first week of BDA

As of today March 29, the course is oversubscribed. Come to the first class anyway, because by the third week lots of people will drop the course, for various reasons. See the page on Should I take Big Data Analytics in 2018? for more information.
Class will probably meet in RBC 3203, but there is a chance it will move to the Gardner Room. 

Here are steps that you need to do by Tuesday, April 3. If you can get most of it done before the first class, even better.  Most important: get the software installed. Some students will run into problems, and we don’t want to wait to discover them.

  • Installing R is straightforward and covered in many places. We will use  R version 3.4.4 (Someone to Lean On) which was released on 2018-03-15. Here is a Coursera video on how to install. The official web site for latest versions is https://cran.r-project.org.
  • Start R, make sure it runs. Set up at least one folder/directory where your R programs will go.
  • Install Rattle. Its home page is https://rattle.togaware.com . To install Rattle, start up R, then follow the instructions on the Rattle page.
  • Download the Rattle textbook from Springerlink.com. It also has instructions for installing Rattle. https://link.springer.com/book/10.1007/978-1-4419-9890-3
  • Get the main textbook. You can try using the library’s online copies, especially for the first chapters. Instructions are on this web site (BDA2020.wordpress.com).
    • Read chapter 1 on your own.
    • Start on Chapter 2.

The TA for the course will be Feiyang Chan. She will hold informal office hours before and after the Wednesday class, for anyone who is having trouble installing the software. So 10:30 to 11AM, and again 12:30 onward. In the main classroom, RBC 3203.

What is Big Data Analytics?

This confuses students every year, and for excellent reasons. A variety of terms are thrown around without clear definitions, or clear distinctions among them. The concepts and applications are evolving so fast that there is no consensus. You should think of all of the following as closely related, and all covered by this course:

  • Data Analytics
  • Data Mining
  • Machine Learning
  • Business Analytics
  • Data Science
  • at least 5 others.

It is very helpful to look at a range of case studies where these ideas have been used successfully. Here are a few.  Some may be bogus – as we will try to discuss during the course.

Assignment: send me other examples. Either put them in the comments, or email them to me and I will post them.

Big Data At Caesars Entertainment – A One Billion Dollar Asset? – Forbes

BDA examples: Pollution and health

Popular Press Articles

Analyzing 170,000,000 NYC Taxi trips