I have created a new list of resources, for specific projects types such as spatial analysis and Twitter analysis. It is the heading on the Latest Handouts page at Special topics for individual papers.
Summary of the last 2 weeks of the course:
- Only nominal homework – readings and one figure.
- Work on projects. Ask for help if desired. No more interim reports are due.
- R Certification: If you want R certification for the course, take a one-hour quiz and meet some other requirements.
- Make an in-class presentation: two-person teams only.
- Final paper due
Wednesday, May 30. Handling unbalanced data, and other useful techniques.
Reading: Chapter 5.5, also 5.3 and 5.4. These were assigned previously.
Nothing to be turned in
Saturday, June 2: No progress report is due.
Monday, June 4: A/B Testing and other emerging topics in Big Data
- Look up specific techniques for your project. Spatial data and GIS, Text processing, Crime, Graphics, or Twitter. One or more applies to every project. Special topics for individual papers.
- Turn in: One careful plot from your project. Hard copy, with comments on it by hand. Format the plot carefully and clearly including scales, colors, definitions, etc. Please turn these in by hand in class. This is to encourage hand-writing of comments. Circle and explain at least one interesting/important feature of your plot.
- Include a caption. Captions in scientific papers are sometimes several sentences long.
- The goal of the assignment is to help you focus intensively on one result of your project, and how to explain it visually. It does not have to be a data-mining result.
- Reading, “The A/B Test: Inside the Technology That’s Changing the Rules of Business” Wired Magazine, 04.25.12. https://www.wired.com/2012/04/ff_abtesting/all/
- Visit an e-commerce website and think about how to improve it using A/B testing.
Wednesday, June 6: All two-person project teams will give 5 to 7 minute presentations. The goal is to fascinate, impress, and surprise your audience. Think of this as the “elevator pitch” for your project.
Friday, June 8 1pm or other times as agreed: Quiz for R Certification. The quiz emphasizes data manipulation in R, Selecting data subsets, creating new variables , rearranging and redefining data such as event logs. The other requirements for R certificates are completing your project using appropriate R programming, and attending 50% of TA tutorials.
Friday, June 8 midnight: Formal due date for final project papers.
All projects who request one receive an automatic extension until Wednesday.
Submit both hard copy and PDF files. Submit via Turnitin, on TritonEd.
June 11. Wednesday, June 13. Deadline for projects.
Some are elsewhere on this site. for May 23.
- BohnDAgram on Data plumbing
- Rstudio cheat sheets for wrangling. https://www.rstudio.com/resources/cheatsheets/
- Base R cheat sheets (these should be old)
Continue reading “Resources on data manipulation”
A few hours ago I received an email from Emily about the farm advertisement problem due on May 11. I wrote her back. But her question raises general issues for many projects. I’m sure other students also had similar problem with Friday’s homework.
In response, I just wrote
BDA18 Memo =My program runs too slowly v 1.1.
New version! BDA18 My program =slow v 1.2 The memo includes most of my specific suggestions about the May 11 homework. (This may be too late for some students, but most of the ideas were also discussed briefly in class on Monday or Wednesday.)
There are probably multiple typos and errors in the memo. Please send me corrections by email, for class credit.
I am working on the assignment due tomorrow and have encountered a problem. When reducing the TF-IDF matrix to 20 concepts, RStudio always stops working (as indicated by the little ‘Stop’ sign in the console. I’m thinking this is because the farm-ads.csv dataset is too large. Without reducing the concepts, I am unable to move forward with the random forest part of the assignment. I am wondering if there is a solution to this problem or a way to work around it.
Apologies in advance for not approaching you with this question earlier. It’s been a very hectic week!
Thanks for your help,
By the way, I’m 98% sure that in fact RStudio did not “stop working.” It was probably still cranking away. Check the Activity Monitor application on your computer to be sure.
The ostensible material this week is Random Forests. They are a generalization of Classification/Regression Trees. No assignment for Monday; everything will be done in class. Redo the class material and hand it in on Wednesday (not Friday – we will go over the results in class.)
Here is the assignment for Wednesday, including the readings on Random Forests. BDA18 Random Forest Assign May 2, 2018.
Here is the material used in class. BDA18 Random Forests2018B
The other agenda this week is to develop your skills in debugging. This is the key skill of writing code: figuring out what is wrong and how to fix it.
For Week 6, text mining. Here is the assignment for next Monday BDA18 Text mining #1 assign. (Short assignment handed in Sunday night.)
What to submit each week to show progress on your project.
Each week, each team should submit a project progress report. Submit via Ted/TritonEd. Their purpose is partly to help you focus on what you have accomplished and what needs to be done next. I will also scan them, and offer comments from time to time. If you have a specific question /advice that you want to be answered also send your report to me via email #BDA18. It’s especially important to let me know by email or visit if you have become hung up by a bottleneck such as getting specific data, a technique that you have not figured out (often, a text mining question), or anything else.
Here is the “generic assignment:”
Project reports continue to be due weekly, preferably on Saturdays unless we discuss another time for your next report (such as just after a meeting during office hours). Follow the usual rules if you are a team – both people submit identical files.
The content of each report depends on your stage of activity. A general guide is on the website. It’s especially important to 1) show steady progress, and 2) gather and look at real data, even if you don’t know yet how you are going to analyze it, 3) Use an approach of incremental modeling, rather than trying to create one giant analysis.
You only need write a paragraph or 2 of text. Emphasize your major new insights and analyses. Attach printouts of outputs (including exploratory analysis etc.), highlighting anything that you think is noteworthy. Make the exhibits self-explanatory by incuding a good caption, circling key numbers, etc. Informal exhibits are ok, even handwriting.
Memos and examples about projects and project reports
BDA18 Project assignment example 4-17
Examples of past intermediate reports.
BDA Assign 2016-02-14_Hyerim Kim_Project 3+ comments
Added material for the #SIE-PSN project.
BDA18 Sony PSN project update 4-19
Playstation #SIE-PSN information
This week we have 3 learning goals. It will take the entire week to do them.
- Linear regression for prediction. How it differs from hypothesis testing.
- Showing how to use R instead of, or in conjunction with, Rattle.
- Many specific tricks and issues that come up with linear regression, such as word equations and creating interactive variables.
Please see the attached document, which includes the readings, specific homework due Friday, and supplemental information about various useful ideas and techniques.
BDA18 Week 4 Readings + assign
You can get data files here: https://bda2020.wordpress.com/data-sets/
There is nothing due on Monday.