Text mining homework: speeding up calculations

A few hours ago I received an email from Emily about the farm advertisement problem due on May 11. I wrote her back. But her question raises general issues for many projects. I’m sure other students also had similar problem with Friday’s homework.

In response, I just wrote BDA18 Memo =My program runs too slowly v 1.1.
New version! BDA18 My program =slow v 1.2 The memo includes most of my specific suggestions about the May 11 homework. (This may be too late for some students, but most of the ideas were also discussed briefly in class on Monday or Wednesday.)

There are probably multiple typos and errors in the memo. Please send me corrections by email, for class credit.

I am working on the assignment due tomorrow and have encountered a problem. When reducing the TF-IDF matrix to 20 concepts, RStudio always stops working (as indicated by the little ‘Stop’ sign in the console. I’m thinking this is because the farm-ads.csv dataset is too large. Without reducing the concepts, I am unable to move forward with the random forest part of the assignment. I am wondering if there is a solution to this problem or a way to work around it.
Apologies in advance for not approaching you with this question earlier. It’s been a very hectic week!
Thanks for your help,
By the way, I’m 98% sure that in fact RStudio did not “stop working.” It was probably still cranking away. Check the Activity Monitor application on your computer to be sure.

Two warnings as projects heat up

At this stage each year, many teams run into either or both of two problems.

A. Getting error messages in R that appear to indicate their computer is out of memory (RAM). This is annoying but almost always straightforward to fix.  At least two teams have already run into this problem in 2018, and assumed that it would be a major difficulty.

B. More subtle and harder to fix is getting bogged down somewhere and running out of time. A common place this happens is in data acquisition and cleaning, This is easy to fix “in theory,” but my sad experience is that some teams sink into the trap of “just a little longer, and we will be finished.” This stage can last for weeks!

A few sad examples

  • More than one team has spent several weeks locating, downloading, cleaning, and merging data about crime (or other topics) in multiple cities. When they started to analyze it carefully they discovered that the crime reporting systems in the cities were quite different. By then it was week 8 of the course, and only had time for a partial analysis of one city.
  • A team had too little time to tune their models and algorithms. The result was a prediction that had too much error to be useful.
  • A team was racing to finish, and when they got their model results they did not take the time to check that they were reasonable. They  submitted a report claiming a prediction error below 1 percent. That means, invariably, that there is some “time travel” in their data: of the seemingly independent variables is actually a converted version of what they are predicting. Example: EPA fuel mileage, where fuel efficiency, oil consumption, and CO2 emissions all measure approximately the same thing.

What to do?

I will gradually provide notes on avoiding, or solving, both of these problems. Please take them seriously. A few hours invested now can save (literally) a week or longer later in your project.

  1. Memo: What to do if you run out of memory?  BDA18 Running out of memory v1.3  
  2. Don’t get bogged down!! Keep moving! You can go back and improve it later!