Some questions about Wednesday’s Toyota case. Now postponed to Friday April 13.
Question: what should we include in our write-up? (Several questions asking this in different ways.) For example:
We were only thinking of presenting our preliminary analysis (plot of key variables and our best result tree and error matrix), but should we add the discussion of the Quarterly Road Tax and omitting other variables as well?
Answer Detailed trees take up lots of space and are not very interesting. Error matrix is very important – as well as what it means both technically and managerially. (see Chapter 5 of DMBA and Chapter 15 of DMRR) Whatever you do, don’t give a chronological list of everything you tried. That is boring, and it’s more important to discuss where you ended up (your best analysis) than all the things you tried that you later discarded.
But use your judgment. Enhancing your judgment is one of the goals of this course, and you only learn it by practicing.
Question: My partner and I are getting different results. Changing the random seed also changes the results. Why?
Answer: The CART algorithm is “unstable.” See the lecture notes from yesterday, and the textbooks. Small changes in inputs can produce large changes in results. However, although the tree may change, the accuracy of the model is still about the same. When we get to Random Forests in a few weeks, these problems will mostly go away. So don’t overthink them for now.
Question: How do I fine-tune the CART parameters (bin size etc.) for Toyota? Is there a cookbook? (paraphrased)
There is lots of additional information on the CART algorithm. In the syllabus I recommend another general book on data mining Introduction to Statistical Learning which is better on theory. There are also the Springerlink books. But all I expect is that you read the textbook (DMRR) discussion of these parameters, so you understand what each one does. I would prefer that you save your detailed study for other, more important, algorithms. CART is simple and easy to explain, but it is basically obsolete compared with methods we will do later, especially Random Forests.
There is no “cookbook.” That’s why data mining is still difficult and still pays big $. It takes experience and judgment to do it well. Of course, there are guidelines and advice. But no detailed recipes that you just read and follow! For example, if you have 100,000 records, then a minimum split of 50 is “small.” But if you have 1000 records, that would be 5% of your entire sample, and is probably to large.
This is already a long assignment. At this stage of the course, you will get more benefit, and more learning, out of learning to transform variables. Here is a clue: Air conditioning is important. But there are 2 variables related to AC, for a total of 4 possibilities. Should you merge them in some fashion? (See my lecture notes for one solution.)
>Question: How does Rattle execute this selection of inputs? Are we supposed to manipulate input/ignore options to construct different decision trees with different combinations of variables
Answer: Choosing variables that matter is the essence of the CART algorithm (not Rattle). Read the chapters about how CART makes these decisions. Do not try to outguess the decision trees. However, you should think about some variables that are probably not worth considering at all. For example, if 99.5% of the cars are the same, leave out that variable unless you have 10,000 observations or more.
>Another issue I encounter is the “model” variable. If I keep it as an input for the decision tree, Rattle uses it as one of the splitting variable but it lists so many model names as one criteria and I don’t know how to interpret it. Should I just ignore this variable and instead use other criteria such as age and KM?
This is an excellent question, and one we will discuss in class. How many models are there? How many cars are in each category? If you have a ‘model’ with only a few cars of that type, then no statistical Procedure will be able to work with it very well.
An easy place to start is to leave it out. Later, you can think about using it and see if results improve.