Text-mining resources for projects

(This page will be augmented the week of May 6.)

Many projects involve text mining, and they need to go far beyond Chapter 20 and the homework assignments for the course. I have put together specific resources on text mining, including examples, discussions, and R code for particular purposes.
Text Mining Material   This page is mandatory for anyone doing a TM project. Finding the right guides can save huge amounts of time and frustation.

Here is a page on web scraping. Not all text-mining projects need to scrape their own data, but it is the only way to get the latest information.  Scraping Twitter and other web sources. 

Week 5: Random Forests, R, debugging

The ostensible material this week is Random Forests. They are a generalization of Classification/Regression Trees. No assignment for Monday; everything will be done in class. Redo the class material and hand it in on Wednesday (not Friday – we will go over the results in class.)

Here is the assignment for Wednesday, including the readings on Random Forests.  BDA18 Random Forest Assign May 2, 2018.

Here is the material used in class. BDA18 Random Forests2018B

The other agenda this week is to develop your skills in debugging. This is the key skill of writing code: figuring out what is wrong and how to fix it.

For Week 6, text mining. Here is the assignment for next Monday BDA18 Text mining #1 assign. (Short assignment handed in Sunday night.)

Project topics – questions and answers

I’m getting some good questions about projects and project topics. Rather than answer individual emails, I will put questions and answers here. Questions are edited for brevity. You can ask questions by putting them in the Comments on this page, or send me an email. (Don’t forget: Put #BDA18 in your subject line.)

This page, and similar ones, is in the blog category Projects. Select that category, and it will show you all posts on that topic, in reverse order.

My email filter for BDA. It makes use of #BDA hashtags. Use similar filters on your email.

Q: What makes projects easier? What makes them harder?

Harder. Projects are more educational, and more challenging, with these features. If you have several of these features, use a two-person team.

  • Data that you scrape yourself from some source
  • Working with text
  • Merging data from two very different sources, such as weather data with event data.
  • Large data sets
  • Data that requires lots of rearranging before you use it. For example, the Sony data arrives in buckets of one hour. For more purposes, you then need to a) pick out individuals, and find all the rows that correspond to those individuals, and b) combine many hours or days of data.

Easier: These features reduce the effort in parts of the project. There may still be plenty of other areas that need a lot of work. If there are a lot of them, they are probably only appropriate for working by yourself. 

Continue reading “Project topics – questions and answers”