Misc. announcements: homework, Sony projects, tutorial on R, etc.

Several notices about events tomorrow and later.

Student Question: What cutoff in homework problem 10.4i?

Dana write:

Hey everyone!

I have a question about 10.4 i. exercise. In Rattle we use default setting and cannot change the cutoff, so are we supposed to guess the cutoff for the most accurate classification? Or are we supposed to get it some other way?
Response: Good question. I have 3 levels of answer. At the simplest, should you push the cutoff up or down from 50%? (Be sure to specify which way is which – sometimes it can be ambiguous).
At the next level, Rattle produces an ROC curve, as we did with the airplane data on Wednesday. See textbook p 131. The ROC curve is traced out by moving the threshold all the way from 0 to 1.
Third level: Soon, you will learn how to grab the code produced by Rattle, and run it in RStudio. There you can change the cutoff parameter and calculate different confusion matrices. For example function confusion.matrix(obs, pred, threshold = 0.5) allows any threshold. Through trial and error, you can experiment with different thresholds. Of course, there are more specialized functions that can come up with the optimal answer.

Sony Playstation Network project

Anyone considering the Sony Playstation Network project, read this memo.BDA18 Sony PSN project update 4-19  It points you to some data, and requests a revised proposal as soon as possible. Or, switch to another topic. Make comments on this page if you are looking for a partner, or on the Final paper ‘dating site’

R Tutorial Friday in room 3201 at 1pm

Feiyang will provide several resources for learning R at the level we need it in BDA. She will demonstrate how functions work, and illustrate function use with examples from the textbook. Other topics will include using R Help, and good cheat sheets. This will all be useful next week when we move away from Rattle and toward straight R.

Next week: Linear continuous models (Linear Regression)

Next week, we will look at a method that everyone has seen in a different context, OLS linear regression. I don’t like the textbook treatment of the topic, so I’m assigning a supplemental book. Please read:
Gareth James et al, An Introduction to Statistical Learning with Applications in R.(Supplementary textbook.) It’s available from Springerlink. Review section 3.2 which should be familiar, and read section 3.3 on variations on the basic model linear. Also read DMBA main textbook, only sections 6.1 and 6.2
A more detailed assignment will be posted Saturday, and nothing is due Sunday.

Comments on early project proposals

Thanks to everyone who submitted these project proposals on time. I have returned half (Saturday night); the others are coming. A few are still missing.

Here is some general advice that applies to some proposals. If you already know it, please help other BDA students with it.

  1. For this course, all projects need to do prediction or classification of individual-level characteristics of some kind. You have already had numerous courses where the goal is to test a hypothesis or to measure the strength of the causal relationships among variables. These are important, but in BDA we are learning a new skill, and indeed a new purpose for analyzing quantitative data. Mathematically, they is closely related. But it requires a different mind-set. We will talk more about this in the weeks to come.
  2. Internet search: Everyone knows how to do a simple search on Google. But that is a low skill level. It is possible to do research far better in 2018. Before you leave UCSD, get good at it. Here are some simple questions to test yourself.
    1. What is the difference between Yelp reviews and “Yelp Reviews” (with or without quotes and capital letters).
    2. Find pages  that mention dates between 1985 and 1995. (Not that were written then, but that contain those dates.)
    3. Almost everything on Google Scholar can be found in a regular Google search. Why, then, is it so useful to search on Google Scholar instead of (or sometimes in addition to) Google.com?
    4. What extra benefits do you get on Google Scholar when you are on campus or VPNed to campus?**
      Where on this listing should you click to get the paper most easily? 2018-04-14_20-48-57
    5. Name two good databases that have massive collections of non-public  business-related papers, reports, magazines, etc. Material that is not available through either Google or Google Scholar.
  3. A few people forgot the syllabus request that all written work be turned in with a professional appearance, as if you were in a company or organization. I got more than one file called “proposal.pdf.”
    I have therefore updated and reposted the syllabus section on homework and reports. BDA18 Syllabus = HW formats + grading 2018-04-14. Parts of it, such as discussion of figures, are overkill for these early project descriptions.



** This one is pretty obscure. The answer is in the picture. If you are on campus, you get the UC-eLinks entry. For recent published papers, that is often the only way to download it.

For Wk 3 on linear classification. Updated 4/18 after Wednesday class

This page has notes and advice for the classes of April 16 and 18, 2018. Now includes lecture notes from both days.

  1. Done in class on Wednesday: logistic model (categorical linear models) of airline delays. Here is a fairly complete log of what I did on my own, before our actual class. This should be all you need to do the eBay work mechanically. You still need to think about what the results tell you.
  2. Main homework that was due the night of April 17 is moved to Friday afternoon, just like last week. However, please at least load the eBay data, and run some histograms and other exploratory analysis before class.
  3. Submitting homework on TritonEd. Several students had trouble with Ted/TritonEd. Feiyang suggests changing your browser. (e.g try Chrome).
    • My personal experience/advice is that although I use Safari most of the time, about 5% of the sites I try to work with cannot handle it. After a few tries on a site, I switch browsers.
  4. Two visitors from Sony will talk about their work  on Wednesday, May 18. Changed to the following week, approximately April 25
  5. Week of April 16. Linear models. 80% of the class has studied “logistic regression” in a statistics or economics class. We will be doing something closely related, and I will rely on people being familiar with several of the basic ideas. If you are unfamiliar with logistic regression, study the textbook quite carefully.
  6. IMPORTANT FOR HOMEWORK: The Rattle book/manual does not discuss how to do linear models.
  7. RStudio and Rattle are now on UCSD computers near student affairs. (4 Windows machines, each with 16 GB of RAM. First come, first served.) They are also on the UCSD virtual machines. I tested the Windows machines myself. BUT it’s not clear exactly what is needed to download the libraries for either method. So don’t try using them yet.
     So far, only machines 001 and 005 are proved to work. 002 does not work. 
    These machines have some limitations, but they are clearly fine for homework. We are testing to find out what their limits are.Rattle on UCSD machines.shrunk

Office hours are set!

Professor Roger Bohn
Office = RBC 1315. Phone and text (858) 381-2015.
Email Rbohn@ucsd.edu. Put #BDA18 in the subject line of all emails.

TA Feiyang.Chen@rady.ucsd.edu

Office Hours (final). There are office hours on Monday, Wednesday, Friday.

  • Wednesday 6:30 to 7:30 PM in Peet’s Coffee. Prof. will stay later if a long line.  Text (858) 381-2015 to reserve a late time.
  • Fridays 4:00 to 5:15 in my office RBC 1315. Prof. will stay as long as needed to talk to anyone who is in line at 5:05.
  • TA Office hours Monday 1 to 2 in RBC 3128 (shortly after class).
  •      TA R tutorial Fridays 1 to 2 in Gardner Auditorium

Q&A about CART + Toyota

Some questions about Wednesday’s Toyota case. Now postponed to Friday April 13.

Question: what should we include in our write-up? (Several questions asking this in different ways.) For example:

We were only thinking of presenting our preliminary analysis (plot of key variables and our best result tree and error matrix), but should we add the discussion of the Quarterly Road Tax and omitting other variables as well?

Answer  Detailed trees take up lots of space and are not very interesting. Error matrix is very important – as well as what it means both technically and managerially. (see Chapter 5 of DMBA and Chapter 15 of DMRR) Whatever you do, don’t give a chronological list of everything you tried. That is boring, and it’s more important to discuss where you ended up (your best analysis) than all the things you tried that you later discarded.

But use your judgment. Enhancing your judgment is one of the goals of this course, and you only learn it by practicing.

Question: My partner and I are getting different results. Changing the random seed also changes the results. Why?

Answer: The CART algorithm is “unstable.” See the lecture notes from yesterday, and the textbooks. Small changes in inputs can produce large changes in results. However, although the tree may change, the accuracy of the model is still about the same. When we get to Random Forests in a few weeks, these problems will mostly go away. So don’t overthink them for now.

Question: How do I fine-tune the CART parameters (bin size etc.) for Toyota? Is there a cookbook? (paraphrased)

There is lots of additional information on the CART algorithm. In the syllabus I recommend another general book on data mining  Introduction to Statistical Learning  which is better on theory. There are also the Springerlink books.  But all I expect is that you read the textbook (DMRR) discussion of these parameters, so you understand what each one does.  I would prefer that you save your detailed study for other, more important, algorithms. CART is simple and easy to explain, but it is basically obsolete compared with methods we will do later, especially Random Forests.

There is no “cookbook.” That’s why data mining is still difficult and still pays big $. It takes experience and judgment to do it well. Of course, there are guidelines and advice. But no detailed recipes that you just read and follow! For example, if you have 100,000 records, then a minimum split of 50 is “small.” But if you have 1000 records, that would be 5% of your entire sample, and is probably to large.

This is already a long assignment. At this stage of the course, you will get more benefit, and more learning, out of learning to transform variables. Here is a clue: Air conditioning is important. But there are 2 variables related to AC, for a total of 4 possibilities. Should you merge them in some fashion? (See my lecture notes for one solution.)

Continue reading “Q&A about CART + Toyota”

Week 2 #BDA assignments

Feiyang and I are working to get everyone up to speed for next week. Here is a list of items to be aware of, in no particular order. I will get this material organized better over the weekend. As always, you can post comments on this message if you have questions.

  1. Rattle now works on the Mac! see Installing Rattle on Mac
  2. The homework for Monday does use Rattle. Do it in teams, and if only one of you has a working version of Rattle that is ok as long as you physically work together. In class, I will call on someone randomly for your solutions.
  3. You can find the homework at the end of the syllabus. Currently, it is version 1.05, but I expect  to revise it late today. Latest syllabus, assignments, + notes The Monday assignment is on page 12, or search for “Early Assignments”. Update: individual assignments are also in TritonEd. 
  4. The assignment due Tuesday night takes some people a long time because it requires using R, Rattle, and the first data mining algorithm, called CART. It also asks you to do some data manipulation. So set aside time, and do it with a teammate. Homeworks are due 11pm.
  5. To get the course information immediately, subscribe to this web site. Look for the subscribe button on the bottom right. (BDA2020.wordpress.com)
  6. Feiyang can provide assistance with Rattle by email (and then phone etc.) When asking for computer assistance with problems, provide basic information for debugging. Do not say “it didn’t work,” unless you don’t need any assistance. The more complete, the better.’ Give the exact error message. It is even ok to Copy and Paste an entire stream of activities and resulting error messages, into the bottom of an email. She won’t read it all, but it gives important clues.
  7. I have not confirmed this, but she should be available before or after class on Monday for anyone still having trouble with Rattle.
  8. For R in general, get in the habit of googling  error messages. Often this will send you to the Stack Overflow site.
    • There are various tricks involved in googling errors, which I will discuss in class.
    • A few newcomers to UCSD may not know how to do compound searches on Google. This is a basic life skill.

      “Data Mining” Rattle “text of error message” 

      is a much better search than
      Data Mining Rattle text of error message.  Why?

    • To start with, use Google’s Advanced Search page.
    • Search tips for dates, for example: https://www.makeuseof.com/tag/6-ways-to-search-by-date-on-google/

Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego

This page links to the latest versions of course material. Some PDF, some HTML. Update May 29, 2018

Lecture Notes (chronological order)

  1. BDA18-D3 Chap9_CART RB.  For the class of April 9, on CART
  2. BDA18 Class 4 Lecture notes Toyota  For the class of April 11, on CART + Toyota
  3. Logistic Regression 2018  Class of April 16 on classification using linear models aka logistic regression.
  4. Class of April 18 on linear categorical models aka logistic regression. BDA18 illustration of Rattle use 04-18
  5. Notes on Linear Regression, Week 4,  April 23, 25  BDA18 regression 04/25.pdf.   BDA18 regression slides 4-23. Use primarily the April 25 version; 4/23 has a few additional  slides.
  6. How to go from Rattle to RBDA18 Rattle to R code 4-25.pdf
  7.   Lecture Notes Week 5 Random Forests BDA18 Random Forests2018B
  8. Lecture Notes Week 6 Text Mining, Day 1
    Tutorial worked through in class. Basic Text Mining in R 2017 version
  9. Week 6 Text mining #2 2018b 
  10. Week 7 LASSO, Monday May 14.
  11. Week 8  lecture notes. Monday May 21. BDA18 feature engineering case study

Advice, tutorials, reference books, other useful material

Special topics – for specific papers

The Big Data Analytics course introduces data mining with techniques and concepts that are broadly applicable. Individual topics and projects have specific techniques, needs, and resources. In keeping with the theme “Borrow and re-use, don’t invent anything yourself,” here are some resources that are especially suited to particular topics.

Don’t forget to try to site’s Search window  (usually near the upper right) to look up possible keywords. Many of these topics also have entire books about them, such as on Springerlink.

Other links:

Google folder for the course.  There you will find all datasets for the textbook,

The official textbook web site is http://www.dataminingbook.com/book/r-edition
Once you register, you can get these datasets, and the R Code. (It’s better to type the R Code by hand, the first time.)

Contact Information

Personal web site: Art2science.org