Week 5: Random Forests, R, debugging

The ostensible material this week is Random Forests. They are a generalization of Classification/Regression Trees. No assignment for Monday; everything will be done in class. Redo the class material and hand it in on Wednesday (not Friday – we will go over the results in class.)

Here is the assignment for Wednesday, including the readings on Random Forests.  BDA18 Random Forest Assign May 2, 2018.

Here is the material used in class. BDA18 Random Forests2018B

The other agenda this week is to develop your skills in debugging. This is the key skill of writing code: figuring out what is wrong and how to fix it.

For Week 6, text mining. Here is the assignment for next Monday BDA18 Text mining #1 assign. (Short assignment handed in Sunday night.)

Q&A about CART + Toyota

Some questions about Wednesday’s Toyota case. Now postponed to Friday April 13.

Question: what should we include in our write-up? (Several questions asking this in different ways.) For example:

We were only thinking of presenting our preliminary analysis (plot of key variables and our best result tree and error matrix), but should we add the discussion of the Quarterly Road Tax and omitting other variables as well?

Answer  Detailed trees take up lots of space and are not very interesting. Error matrix is very important – as well as what it means both technically and managerially. (see Chapter 5 of DMBA and Chapter 15 of DMRR) Whatever you do, don’t give a chronological list of everything you tried. That is boring, and it’s more important to discuss where you ended up (your best analysis) than all the things you tried that you later discarded.

But use your judgment. Enhancing your judgment is one of the goals of this course, and you only learn it by practicing.

Question: My partner and I are getting different results. Changing the random seed also changes the results. Why?

Answer: The CART algorithm is “unstable.” See the lecture notes from yesterday, and the textbooks. Small changes in inputs can produce large changes in results. However, although the tree may change, the accuracy of the model is still about the same. When we get to Random Forests in a few weeks, these problems will mostly go away. So don’t overthink them for now.

Question: How do I fine-tune the CART parameters (bin size etc.) for Toyota? Is there a cookbook? (paraphrased)

There is lots of additional information on the CART algorithm. In the syllabus I recommend another general book on data mining  Introduction to Statistical Learning  which is better on theory. There are also the Springerlink books.  But all I expect is that you read the textbook (DMRR) discussion of these parameters, so you understand what each one does.  I would prefer that you save your detailed study for other, more important, algorithms. CART is simple and easy to explain, but it is basically obsolete compared with methods we will do later, especially Random Forests.

There is no “cookbook.” That’s why data mining is still difficult and still pays big $. It takes experience and judgment to do it well. Of course, there are guidelines and advice. But no detailed recipes that you just read and follow! For example, if you have 100,000 records, then a minimum split of 50 is “small.” But if you have 1000 records, that would be 5% of your entire sample, and is probably to large.

This is already a long assignment. At this stage of the course, you will get more benefit, and more learning, out of learning to transform variables. Here is a clue: Air conditioning is important. But there are 2 variables related to AC, for a total of 4 possibilities. Should you merge them in some fashion? (See my lecture notes for one solution.)

Continue reading “Q&A about CART + Toyota”