Text mining homework: speeding up calculations

A few hours ago I received an email from Emily about the farm advertisement problem due on May 11. I wrote her back. But her question raises general issues for many projects. I’m sure other students also had similar problem with Friday’s homework.

In response, I just wrote BDA18 Memo =My program runs too slowly v 1.1.
New version! BDA18 My program =slow v 1.2 The memo includes most of my specific suggestions about the May 11 homework. (This may be too late for some students, but most of the ideas were also discussed briefly in class on Monday or Wednesday.)

There are probably multiple typos and errors in the memo. Please send me corrections by email, for class credit.

I am working on the assignment due tomorrow and have encountered a problem. When reducing the TF-IDF matrix to 20 concepts, RStudio always stops working (as indicated by the little ‘Stop’ sign in the console. I’m thinking this is because the farm-ads.csv dataset is too large. Without reducing the concepts, I am unable to move forward with the random forest part of the assignment. I am wondering if there is a solution to this problem or a way to work around it.
Apologies in advance for not approaching you with this question earlier. It’s been a very hectic week!
Thanks for your help,
By the way, I’m 98% sure that in fact RStudio did not “stop working.” It was probably still cranking away. Check the Activity Monitor application on your computer to be sure.

Author: Roger Bohn

Professor of Technology Management, UC San Diego. Visiting Stanford Medical School Rbohn@ucsd.edu. Twitter =Roger.Bohn

2 thoughts on “Text mining homework: speeding up calculations”

  1. In this home work, I applied the function “Sparse” which is really useful to reduce the running time.

    With parameter 0.9, the number of terms decreases from about 50000 to about 200, and only takes a few seconds to run LSA.
    With parameter 0.99, the number of terms decreases from about 50000 to about 2700, and takes 5 minutes to run LSA.

    This works well for me. Hope it will also work to others!


    Liked by 1 person

    1. Thanks for the information, Jessica. The memo attached to this message, BDA18 Memo =My program runs too slowly v 1.1, explains this method in more detail, and how you can use it either in addition to LSA, or instead of LSA. I used the function removeSparseTerms in the Monday and Wednesday code that we did in class. It was suggested in Monday’s handout, Basic Text Mining in R by Philip Murphy.
      Initially, I found the parameter (e.g. .95 or .98) confusing. So, I experimented with it to get the number of terms I wanted. (e.g. 200 or 2700 in her examples.)


Leave a Reply to Roger Bohn Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s