Many projects involve text mining, and they need to go far beyond Chapter 20 and the homework assignments for the course. I have put together specific resources on text mining, including examples, discussions, and R code for particular purposes. Text Mining Material This page is mandatory for anyone doing a TM project. Finding the right guides can save huge amounts of time and frustation.
The ostensible material this week is Random Forests. They are a generalization of Classification/Regression Trees. No assignment for Monday; everything will be done in class. Redo the class material and hand it in on Wednesday (not Friday – we will go over the results in class.)
I’m getting some good questions about projects and project topics. Rather than answer individual emails, I will put questions and answers here. Questions are edited for brevity. You can ask questions by putting them in the Comments on this page, or send me an email. (Don’t forget: Put #BDA18 in your subject line.)
This page, and similar ones, is in the blog category Projects. Select that category, and it will show you all posts on that topic, in reverse order.
Q: What makes projects easier? What makes them harder?
Harder. Projects are more educational, and more challenging, with these features. If you have several of these features, use a two-person team.
Data that you scrape yourself from some source
Working with text
Merging data from two very different sources, such as weather data with event data.
Large data sets
Data that requires lots of rearranging before you use it. For example, the Sony data arrives in buckets of one hour. For more purposes, you then need to a) pick out individuals, and find all the rows that correspond to those individuals, and b) combine many hours or days of data.
Easier: These features reduce the effort in parts of the project. There may still be plenty of other areas that need a lot of work. If there are a lot of them, they are probably only appropriate for working by yourself.