Additional tips; turning in report

(updated 11am June 12) Where to turn in your paper: You may put it under the door of my office (room 1315, 3rd floor), or in my faculty mailbox. Submit the PDF version on TritonEd, in the TurnitIn link.

Improving the course:  I think I have approximately the right mix of topics in Big Data Analytics, but there are other areas that I would like to improve. For example, how can I better smooth the workload in the projects, to reduce the end-of-year crunch? Here is a BDA End-of-year questionaire that asks a series of questions about what I should change.

Thanks for taking the class – I certainly enjoy teaching it.  Enjoy graduation! Enjoy life after the university!

===========================

We are finished with formal classes, so I will post here some additional advice for your projects. Most of this is based on things I noticed either in interim reports or in the final presentations on June 6. It’s going to take a few days to write and edit all of these notes, so keep checking this page through Saturday.

Some of this advice was written with one specific project or team in mind, but all of it applies to multiple projects.

  1. Don’t use loops in R. Most computer languages make heavy use of FOR loops. R avoids almost all uses of loops, and it runs faster and is easier to write and debug without them. Here is an example of how to rewrite code without needing a loop, taken from one of this year’s projects. In some cases, R code that avoids loops will run 100x faster (literally). BDA18 Avoid loops in R
  2. Don’t use CSV data when working with large datasets. CSV  (Comma separated value) files have become a lingua franca for exchanging data among different computer languages and environments. For that purpose, they are decent. But they are very inefficient, in terms of both speed and using up memory. One team mentioned that they were running into memory limits, but the problem was most likely due to their keeping CSV files around! Solution: Use CSV files with read.csv to get data into R. But after that, store your data as R objects (dataframes or some other kind). If you want to store intermediate results in a file, create the object inside R, then use RStudio to save it as an R object. (File name ends in .Rdata.)  When you want it again, use RStudio File/Open File command to load it. No additional conversion will be needed.
    I will add something about this to the notes on Handling Big Data. Dealing with the “big” in Big Data.
  3. Fix unbalanced accuracy in confusion matrices. Yesterday, I noticed several confusion matrices with much higher accuracy for one case than the other. Reminder: We spent 1.5 classes on this topic, and there are multiple solutions. It’s usually due to having much more data of one type than the other. See:
    1. Week 8 assignments and notes
  4. Good graphics. Because graphics are a concise way to communicate, even with non-specialists, I recommend having at least one superb image in every report. (See BDA18 Writing your final report ). I will try to post some examples from your projects, with comments on how to make them even better. On Friday.
  5. Revised discussion on how to write a good final report. Writing your final report June 8

Sign up for Wed. presentations

The goal of your presentation is to astound and interest your classmates. (Educating them is nice, also.) So think in terms of an “elevator pitch” for your research. Someone who has zero idea what you have done, but does know about data mining. Show a few viewgraphs.

I suggest being light on prose. Instead, show samples of the world you were investigating (e.g. real tweets), nice infographics about the problem or what you found, etc. This is practice for talking with an outside audience, and NOT for talking to academics.

Sign up for presentation sequence First come, first served, at https://doodle.com/poll/qyqwmfnaaf2ibaxw The times in this Doodle poll are wrong! Expect 5 minutes, so present 1 or 2 cool viewgraphs only.

Don’t forget guidance on final reports. It’s at https://bda2020.files.wordpress.com/2018/04/bda18-writing-your-final-report.pdf
It includes a rubric, checklist, etc. I am editing this document to make it more readable.

You can text me to ask about irregular office hours. Tonight (Monday) IFF anyone asks, and other times by arrangement. Be sure to tell me who you are. +1 858 381-2015

BDA 2018; final schedule(updated)

I have created a new list of resources, for specific projects types such as spatial analysis and Twitter analysis. It is the  heading on the Latest Handouts page at Special topics for individual papers. 

Summary of the last 2 weeks of the course:

  • Only nominal homework – readings and one figure.
  • Work on projects. Ask for help if desired.  No more interim reports are due.
  • R Certification: If you want R certification for the course, take a one-hour quiz and meet some other requirements.
  • Make an in-class presentation: two-person teams only.
  • Final paper due

Wednesday, May 30. Handling unbalanced data, and other useful techniques.
Reading: Chapter 5.5, also 5.3 and 5.4. These were assigned previously.
Nothing to be turned in

Saturday, June 2: No progress report is due.

Monday, June 4: A/B Testing and other emerging topics in  Big Data

  • Look up specific techniques for your project. Spatial data and GIS, Text processing, Crime, Graphics, or Twitter. One or more applies to every project. Special topics for individual papers. 
  •   Turn in: One careful plot from your project. Hard copy, with comments on it by hand. Format the plot carefully and clearly including scales, colors, definitions, etc. Please turn these in by hand in class. This is to encourage hand-writing of comments.  Circle and explain at least one interesting/important feature of your plot.
    • Include a caption. Captions in scientific papers are sometimes several sentences long.
    • The goal of the assignment is to help you focus intensively on one result of your project, and how to explain it visually. It does not have to be a data-mining result.
  • Reading,  “The A/B Test: Inside the Technology That’s Changing the Rules of Business” Wired Magazine, 04.25.12. https://www.wired.com/2012/04/ff_abtesting/all/
  • Visit an e-commerce website and think about how to improve it using A/B testing.

Wednesday, June 6: All two-person project teams will give 5 to 7 minute presentations. The goal is to fascinate, impress, and surprise your audience. Think of this as the “elevator pitch” for your project.

Friday, June 8  1pm or other times as agreed: Quiz for R Certification. The quiz emphasizes data manipulation in R, Selecting data subsets, creating new variables , rearranging and redefining data such as event logs. The other requirements for R certificates are completing your project using appropriate R programming, and attending 50% of TA tutorials.

Friday, June 8 midnight: Formal due date for final project papers.
All projects who request one receive an automatic extension until Wednesday.
Submit both hard copy and PDF files. Submit via Turnitin, on TritonEd.

June 11.   Wednesday, June 13. Deadline for  projects.

Text mining homework: speeding up calculations

A few hours ago I received an email from Emily about the farm advertisement problem due on May 11. I wrote her back. But her question raises general issues for many projects. I’m sure other students also had similar problem with Friday’s homework.

In response, I just wrote BDA18 Memo =My program runs too slowly v 1.1.
New version! BDA18 My program =slow v 1.2 The memo includes most of my specific suggestions about the May 11 homework. (This may be too late for some students, but most of the ideas were also discussed briefly in class on Monday or Wednesday.)

There are probably multiple typos and errors in the memo. Please send me corrections by email, for class credit.

I am working on the assignment due tomorrow and have encountered a problem. When reducing the TF-IDF matrix to 20 concepts, RStudio always stops working (as indicated by the little ‘Stop’ sign in the console. I’m thinking this is because the farm-ads.csv dataset is too large. Without reducing the concepts, I am unable to move forward with the random forest part of the assignment. I am wondering if there is a solution to this problem or a way to work around it.
Apologies in advance for not approaching you with this question earlier. It’s been a very hectic week!
Thanks for your help,
Emily
By the way, I’m 98% sure that in fact RStudio did not “stop working.” It was probably still cranking away. Check the Activity Monitor application on your computer to be sure.

Text-mining resources for projects

(This page will be augmented the week of May 6.)

Many projects involve text mining, and they need to go far beyond Chapter 20 and the homework assignments for the course. I have put together specific resources on text mining, including examples, discussions, and R code for particular purposes.
Text Mining Material   This page is mandatory for anyone doing a TM project. Finding the right guides can save huge amounts of time and frustation.

Here is a page on web scraping. Not all text-mining projects need to scrape their own data, but it is the only way to get the latest information.  Scraping Twitter and other web sources. 

Keep moving! Don’t bog down trying to be perfect!

Aiming for a perfect data mining project leads to disaster. Instead, use incremental prototyping

MEMORANDUM on Projects               
DIRE WARNINGS BELOW- READ CAREFULLY

To: Students in Big Data Analytics BDA18
Subject: Managing a data mining project for speed and success. Avoid Perfectionism!
From: Your  boss’s boss, Prof.  Roger Bohn
Date: May 3, 2017; updated May 1, 2018
PDF version of this memo, June 5, 2018: Dire warning about big data projects

Introduction

Managing projects is a key life skill, and it’s something that you will never stop improving. These memos  are intended to help you manage your projects with insight, and to learn from your management experience.  They contain insights that make Data Mining projects successful overall, even though they don’t correspond to a particular formula or R function. By comparison, the weekly project assignments are intended more as step-by-step guides.

Early in this course it may seem difficult to know what projects will be feasible, and therefore to write a proposal. You don’t yet know what techniques will be taught, you don’t know how to manipulate data in R, and so forth. It will turn out that these are not big difficulties in successfully completing projects. Rather, the  big issues are:

A. Can you find, or create from the web, a large data set with interesting variables in it? The best data sets have event-level data, not aggregated data. For example for crime, there is an entry for every reported crime. For e-commerce, there is an entry for every transaction, or every item in the catalog, or every customer. (All 3 would be ideal.) This is now easy – there are huge data sets, publicly available on numerous topics.

Is the data suitable in various ways? Not confidential, it must be clean or cleanable, etc. It does not have to already be in the right format, just some format that you can get your hooks on.

B. Do you have some interesting questions/issues to investigate that this data contains information about? This is limited mainly by your imagination and your search skills in Proquest and Google Scholar. Look for papers about analogous issues in other countries/industries/data sets. Search their references (backward linking), and papers that reference them (forward linking).

C. Specific mining techniques, like Random Forests versus Nearest Neighbor, are just tools, and they will notmake or break your project. Not having interesting questions can break your project.

D. Once you are moving, the biggest issues for most teams are:

  1. Project- specific data mining concepts and techniques. Each project relies on a few key methods that go beyond what the course covers. Examples are geographic analysis, time series, scraping data from the web, or text analysis.  Find a narrowly targeted  book or web tutorial that already has R code in it. Use those methods where appropriate.
  2. Figuring out how to incorporate chunks of R code without having to write them yourselves.
  3. Managing the research project: Assembling the data, doing the analysis, writing up your results.
  4. Running out of RAM (memory) in your computer. This is actually a minor problem for almost everyone, but you will need to learn a few tricks to make it go away. Look for separate memoranda on this topic.

Skirting the Pit of Perfectionism

Here is a warning that I gave a team in week 5. This team has great data (potentially), covering multiple years with roughly 100,000 observations each year. They reported that “By our next weekly report, we hope to have merged all four years of data, and be able to produce some charts from the data.” Here is what I wrote back to them: Continue reading “Keep moving! Don’t bog down trying to be perfect!”

Two warnings as projects heat up

At this stage each year, many teams run into either or both of two problems.

A. Getting error messages in R that appear to indicate their computer is out of memory (RAM). This is annoying but almost always straightforward to fix.  At least two teams have already run into this problem in 2018, and assumed that it would be a major difficulty.

B. More subtle and harder to fix is getting bogged down somewhere and running out of time. A common place this happens is in data acquisition and cleaning, This is easy to fix “in theory,” but my sad experience is that some teams sink into the trap of “just a little longer, and we will be finished.” This stage can last for weeks!

A few sad examples

  • More than one team has spent several weeks locating, downloading, cleaning, and merging data about crime (or other topics) in multiple cities. When they started to analyze it carefully they discovered that the crime reporting systems in the cities were quite different. By then it was week 8 of the course, and only had time for a partial analysis of one city.
  • A team had too little time to tune their models and algorithms. The result was a prediction that had too much error to be useful.
  • A team was racing to finish, and when they got their model results they did not take the time to check that they were reasonable. They  submitted a report claiming a prediction error below 1 percent. That means, invariably, that there is some “time travel” in their data: of the seemingly independent variables is actually a converted version of what they are predicting. Example: EPA fuel mileage, where fuel efficiency, oil consumption, and CO2 emissions all measure approximately the same thing.

What to do?

I will gradually provide notes on avoiding, or solving, both of these problems. Please take them seriously. A few hours invested now can save (literally) a week or longer later in your project.

  1. Memo: What to do if you run out of memory?  BDA18 Running out of memory v1.3  
  2. Don’t get bogged down!! Keep moving! You can go back and improve it later!

 

Weekly project reports 3, 4, 5, …

What to submit each week to show progress on your project.

Each week, each team should submit a project progress report. Submit via Ted/TritonEd. Their purpose is  partly to help you  focus on what you have accomplished and what needs to be done next. I will also scan them, and offer comments from time to time. If you have a specific question /advice that you want to be answered also send your report to me via email #BDA18. It’s especially important to let me know by email or visit if you have become hung up by a  bottleneck such as getting specific data, a technique that you have not figured out (often, a text mining question), or anything else.

Here is the “generic assignment:”

Project reports continue to be due weekly, preferably on Saturdays unless we discuss another time for your next report (such as just after a meeting during office hours). Follow the usual rules if you are a team – both people submit identical files.

The content of each report depends on your stage of activity. A general guide is on the website. It’s especially important to 1) show steady progress, and 2) gather and look at real data, even if you don’t know yet how you are going to analyze it, 3) Use an approach of incremental modeling, rather than trying to create one giant analysis.

You only need write a paragraph or 2 of text. Emphasize your major new insights and analyses. Attach printouts of outputs (including exploratory analysis etc.), highlighting anything that you think is noteworthy. Make the exhibits  self-explanatory by incuding a good caption, circling key numbers, etc. Informal exhibits are ok, even handwriting.

Memos and examples about projects and project reports

BDA18 Project assignment example 4-17 

Examples of past intermediate reports.

Literature-Review_DD&ZYYL_Feb14th*JeffBender_Feb12_4thProjectReport+comments.

BDA Assign 2016-02-14_Hyerim Kim_Project 3+ comments

Added material for the #SIE-PSN project.

Sony_BDA_Preprocessing

BDA18 Sony PSN project update 4-19

Playstation #SIE-PSN information