This page has been expanded after class.
- (new). Important: list of key ideas. In draft form only. BDA18 Class 4 key conceptsC
- (new)Lecture notes BDA18 Class 4 Lecture notes Toyota This has answers to some of the questions asked on post-its. I still need to post additional Q&A.
- (new) Updated page with more Questions & Answers about the Toyota case. Q&A about CART + Toyota What to put in your write-up, what to do about “Model” variable, etc.
- Feiyang will have an R session, this Friday at 1pm. Gardner Auditorium. These classes are required for the R certification. See also the new page Resources for R language. My office hours tonight 6:30 pm – let’s talk about paper topics.
- A new page for trading information about possible topics. Final paper ‘dating site’ Remember that your first proposals are due Saturday. You may turn in more than 1, with a different combination of people even.
- If you find any dataset resources that other students will be especially interested in, please say something. On this page, or in class on Wednesday.
- The next step will be feedback on the proposals from me. I will usually say either a) flesh this out more and resubmit it, or b) this is interesting, now set up a meeting with me to discuss it.
- Decision Sciences has some data for a possible student project. It appears to me like a 3-dimensional analog of GIS data, such as the Big Pixel project. So it is more an engineering project than business or social science. On the other hand, knowing how to analyze 3-D data is a specialized skill and will be increasingly valuable. Here is their introduction to the project.
- Decision Sciences has developed and commercialized a technology capable of characterizing the contents of volumes based on the natural flux of cosmic-ray muons and electrons. Applying no artificial radiation, the system can be used to inspect the contents of large cargo conveyances for the trafficking of smuggled contraband and threat materials without interfering with the flow of commerce. Because it is passive and completely safe, inspection can occur concurrently with other activities, such as manifest review and driver questioning. MMPDS is staged to revolutionize cargo and vehicle scanning.
- Sony Entertainment has huge amounts of data on customers’ online gaming behavior. We could easily have 2 teams working on this data, and conceivably 3. Here is a brief description of the data.
Our data dumps work on a chronological, unsampled basis. So students could have one hour of all the activity, two hours, etc. There would probably be in excess of 20,000 individuals having sessions during such an hour from the mobile app alone.
Every *hit* in the clickstream may have information or not for things like the name of the page, the device/browser information, error codes, the page URL, visitor frequency info (what session # is this?) and naturally account and device IDs.
What questions to ask: things that correlate with or indicate a commercial activity (such as the Go Buy This Game Online button clicks) or indicate high “engagement” (high visit numbers, long visits in time) are of high interest, so start there and work backwards. Clustering of users would also be of interest.
This is a great opportunity, since Sony is a local company and will be interested in anything you learn. Given the large size of the dataset, it would be good to have someone on your team who is not afraid of getting their hands dirty with heavy-duty number crunching. I know what needs to be done, but I will just explain it once – after that, the team will have to go do it.
- If you want to work on this data, submit a proposal that shows serious interest, and also that your team has the qualifications (or the willingness to learn) to get past the early data-crunching hurdles. For example, do some research on how another online platform or game company analyzes streams of user transactions.
- Next week we will look at linear classification models, in contrast to week 2 when we looked at CART, a model for very nonlinear situations. Assignments have been posted on Ted.
- I will provide answers to the post-it questions. They are great questions. Some we can answer now; others we will deal with in a few weeks.
- Next week we will also “dig in” to Rattle. What are all the mysterious check boxes? Keep in mind that Rattle is just a front-end for R. Everything that Rattle does is taken from the behavior of some function in R. Every choice that you make in Rattle is there due to a corresponding choice in R. For example, here is the R code that Rattle creates to choose random rows.
>set.seed(crv$seed) #This uses the seed number that you input into the first Rattle screen.
>crs$nobs <- nrow(crs$dataset) # 1436 observations
>crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.7*crs$nobs) # 1005 observations. #the function sample(data, number) selects rows randomly from the matrix, data.
>crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train), 0.15*crs$nobs) # 215 observations. choosing random validation rows from the rows not already used.
>crs$numeric <- c(“Age_08_04”, “Mfg_Year”, “KM”, “HP”, “Doors”, “Quarterly_Tax”, “Weight”, “Guarantee_Period”)
This comes from the Log tab in Rattle, after you execute on the model. Try typing these lines yourself directly in Rstudio. First, create a garbage data frame that you name crs$dataset, with at least 1000 rows. Everything else should then work.