Additional tips; turning in report

(updated 11am June 12) Where to turn in your paper: You may put it under the door of my office (room 1315, 3rd floor), or in my faculty mailbox. Submit the PDF version on TritonEd, in the TurnitIn link.

Improving the course:  I think I have approximately the right mix of topics in Big Data Analytics, but there are other areas that I would like to improve. For example, how can I better smooth the workload in the projects, to reduce the end-of-year crunch? Here is a BDA End-of-year questionaire that asks a series of questions about what I should change.

Thanks for taking the class – I certainly enjoy teaching it.  Enjoy graduation! Enjoy life after the university!

===========================

We are finished with formal classes, so I will post here some additional advice for your projects. Most of this is based on things I noticed either in interim reports or in the final presentations on June 6. It’s going to take a few days to write and edit all of these notes, so keep checking this page through Saturday.

Some of this advice was written with one specific project or team in mind, but all of it applies to multiple projects.

  1. Don’t use loops in R. Most computer languages make heavy use of FOR loops. R avoids almost all uses of loops, and it runs faster and is easier to write and debug without them. Here is an example of how to rewrite code without needing a loop, taken from one of this year’s projects. In some cases, R code that avoids loops will run 100x faster (literally). BDA18 Avoid loops in R
  2. Don’t use CSV data when working with large datasets. CSV  (Comma separated value) files have become a lingua franca for exchanging data among different computer languages and environments. For that purpose, they are decent. But they are very inefficient, in terms of both speed and using up memory. One team mentioned that they were running into memory limits, but the problem was most likely due to their keeping CSV files around! Solution: Use CSV files with read.csv to get data into R. But after that, store your data as R objects (dataframes or some other kind). If you want to store intermediate results in a file, create the object inside R, then use RStudio to save it as an R object. (File name ends in .Rdata.)  When you want it again, use RStudio File/Open File command to load it. No additional conversion will be needed.
    I will add something about this to the notes on Handling Big Data. Dealing with the “big” in Big Data.
  3. Fix unbalanced accuracy in confusion matrices. Yesterday, I noticed several confusion matrices with much higher accuracy for one case than the other. Reminder: We spent 1.5 classes on this topic, and there are multiple solutions. It’s usually due to having much more data of one type than the other. See:
    1. Week 8 assignments and notes
  4. Good graphics. Because graphics are a concise way to communicate, even with non-specialists, I recommend having at least one superb image in every report. (See BDA18 Writing your final report ). I will try to post some examples from your projects, with comments on how to make them even better. On Friday.
  5. Revised discussion on how to write a good final report. Writing your final report June 8
Advertisements

Keep moving! Don’t bog down trying to be perfect!

Aiming for a perfect data mining project leads to disaster. Instead, use incremental prototyping

MEMORANDUM on Projects               
DIRE WARNINGS BELOW- READ CAREFULLY

To: Students in Big Data Analytics BDA18
Subject: Managing a data mining project for speed and success. Avoid Perfectionism!
From: Your  boss’s boss, Prof.  Roger Bohn
Date: May 3, 2017; updated May 1, 2018
PDF version of this memo, June 5, 2018: Dire warning about big data projects

Introduction

Managing projects is a key life skill, and it’s something that you will never stop improving. These memos  are intended to help you manage your projects with insight, and to learn from your management experience.  They contain insights that make Data Mining projects successful overall, even though they don’t correspond to a particular formula or R function. By comparison, the weekly project assignments are intended more as step-by-step guides.

Early in this course it may seem difficult to know what projects will be feasible, and therefore to write a proposal. You don’t yet know what techniques will be taught, you don’t know how to manipulate data in R, and so forth. It will turn out that these are not big difficulties in successfully completing projects. Rather, the  big issues are:

A. Can you find, or create from the web, a large data set with interesting variables in it? The best data sets have event-level data, not aggregated data. For example for crime, there is an entry for every reported crime. For e-commerce, there is an entry for every transaction, or every item in the catalog, or every customer. (All 3 would be ideal.) This is now easy – there are huge data sets, publicly available on numerous topics.

Is the data suitable in various ways? Not confidential, it must be clean or cleanable, etc. It does not have to already be in the right format, just some format that you can get your hooks on.

B. Do you have some interesting questions/issues to investigate that this data contains information about? This is limited mainly by your imagination and your search skills in Proquest and Google Scholar. Look for papers about analogous issues in other countries/industries/data sets. Search their references (backward linking), and papers that reference them (forward linking).

C. Specific mining techniques, like Random Forests versus Nearest Neighbor, are just tools, and they will notmake or break your project. Not having interesting questions can break your project.

D. Once you are moving, the biggest issues for most teams are:

  1. Project- specific data mining concepts and techniques. Each project relies on a few key methods that go beyond what the course covers. Examples are geographic analysis, time series, scraping data from the web, or text analysis.  Find a narrowly targeted  book or web tutorial that already has R code in it. Use those methods where appropriate.
  2. Figuring out how to incorporate chunks of R code without having to write them yourselves.
  3. Managing the research project: Assembling the data, doing the analysis, writing up your results.
  4. Running out of RAM (memory) in your computer. This is actually a minor problem for almost everyone, but you will need to learn a few tricks to make it go away. Look for separate memoranda on this topic.

Skirting the Pit of Perfectionism

Here is a warning that I gave a team in week 5. This team has great data (potentially), covering multiple years with roughly 100,000 observations each year. They reported that “By our next weekly report, we hope to have merged all four years of data, and be able to produce some charts from the data.” Here is what I wrote back to them: Continue reading “Keep moving! Don’t bog down trying to be perfect!”