(updated 11am June 12) Where to turn in your paper: You may put it under the door of my office (room 1315, 3rd floor), or in my faculty mailbox. Submit the PDF version on TritonEd, in the TurnitIn link.
Improving the course: I think I have approximately the right mix of topics in Big Data Analytics, but there are other areas that I would like to improve. For example, how can I better smooth the workload in the projects, to reduce the end-of-year crunch? Here is a BDA End-of-year questionaire that asks a series of questions about what I should change.
Thanks for taking the class – I certainly enjoy teaching it. Enjoy graduation! Enjoy life after the university!
We are finished with formal classes, so I will post here some additional advice for your projects. Most of this is based on things I noticed either in interim reports or in the final presentations on June 6. It’s going to take a few days to write and edit all of these notes, so keep checking this page through Saturday.
Some of this advice was written with one specific project or team in mind, but all of it applies to multiple projects.
- Don’t use loops in R. Most computer languages make heavy use of FOR loops. R avoids almost all uses of loops, and it runs faster and is easier to write and debug without them. Here is an example of how to rewrite code without needing a loop, taken from one of this year’s projects. In some cases, R code that avoids loops will run 100x faster (literally). BDA18 Avoid loops in R
- Don’t use CSV data when working with large datasets. CSV (Comma separated value) files have become a lingua franca for exchanging data among different computer languages and environments. For that purpose, they are decent. But they are very inefficient, in terms of both speed and using up memory. One team mentioned that they were running into memory limits, but the problem was most likely due to their keeping CSV files around! Solution: Use CSV files with read.csv to get data into R. But after that, store your data as R objects (dataframes or some other kind). If you want to store intermediate results in a file, create the object inside R, then use RStudio to save it as an R object. (File name ends in .Rdata.) When you want it again, use RStudio File/Open File command to load it. No additional conversion will be needed.
I will add something about this to the notes on Handling Big Data. Dealing with the “big” in Big Data.
- Fix unbalanced accuracy in confusion matrices. Yesterday, I noticed several confusion matrices with much higher accuracy for one case than the other. Reminder: We spent 1.5 classes on this topic, and there are multiple solutions. It’s usually due to having much more data of one type than the other. See:
- Week 8 assignments and notes
- Good graphics. Because graphics are a concise way to communicate, even with non-specialists, I recommend having at least one superb image in every report. (See BDA18 Writing your final report ). I will try to post some examples from your projects, with comments on how to make them even better. On Friday.
- Revised discussion on how to write a good final report. Writing your final report June 8
What to submit each week to show progress on your project.
Each week, each team should submit a project progress report. Submit via Ted/TritonEd. Their purpose is partly to help you focus on what you have accomplished and what needs to be done next. I will also scan them, and offer comments from time to time. If you have a specific question /advice that you want to be answered also send your report to me via email #BDA18. It’s especially important to let me know by email or visit if you have become hung up by a bottleneck such as getting specific data, a technique that you have not figured out (often, a text mining question), or anything else.
Here is the “generic assignment:”
Project reports continue to be due weekly, preferably on Saturdays unless we discuss another time for your next report (such as just after a meeting during office hours). Follow the usual rules if you are a team – both people submit identical files.
The content of each report depends on your stage of activity. A general guide is on the website. It’s especially important to 1) show steady progress, and 2) gather and look at real data, even if you don’t know yet how you are going to analyze it, 3) Use an approach of incremental modeling, rather than trying to create one giant analysis.
You only need write a paragraph or 2 of text. Emphasize your major new insights and analyses. Attach printouts of outputs (including exploratory analysis etc.), highlighting anything that you think is noteworthy. Make the exhibits self-explanatory by incuding a good caption, circling key numbers, etc. Informal exhibits are ok, even handwriting.
Memos and examples about projects and project reports
BDA18 Project assignment example 4-17
Examples of past intermediate reports.
BDA Assign 2016-02-14_Hyerim Kim_Project 3+ comments
Added material for the #SIE-PSN project.
BDA18 Sony PSN project update 4-19
Playstation #SIE-PSN information
This page has notes and advice for the classes of April 16 and 18, 2018. Now includes lecture notes from both days.
- Done in class on Wednesday: logistic model (categorical linear models) of airline delays. Here is a fairly complete log of what I did on my own, before our actual class. This should be all you need to do the eBay work mechanically. You still need to think about what the results tell you.
- Main homework that was due the night of April 17 is moved to Friday afternoon, just like last week. However, please at least load the eBay data, and run some histograms and other exploratory analysis before class.
- Submitting homework on TritonEd. Several students had trouble with Ted/TritonEd. Feiyang suggests changing your browser. (e.g try Chrome).
- My personal experience/advice is that although I use Safari most of the time, about 5% of the sites I try to work with cannot handle it. After a few tries on a site, I switch browsers.
- Two visitors from Sony will talk about their work on
Wednesday, May 18. Changed to the following week, approximately April 25
- Week of April 16. Linear models. 80% of the class has studied “logistic regression” in a statistics or economics class. We will be doing something closely related, and I will rely on people being familiar with several of the basic ideas. If you are unfamiliar with logistic regression, study the textbook quite carefully.
- IMPORTANT FOR HOMEWORK: The Rattle book/manual does not discuss how to do linear models.
- RStudio and Rattle are now on UCSD computers near student affairs. (4 Windows machines, each with 16 GB of RAM. First come, first served.) They are also on the UCSD virtual machines. I tested the Windows machines myself. BUT it’s not clear exactly what is needed to download the libraries for either method. So don’t try using them yet.
So far, only machines 001 and 005 are proved to work. 002 does not work.
These machines have some limitations, but they are clearly fine for homework. We are testing to find out what their limits are.
On Thursday our TA had a session on installing Rattle on the Mac. She may repeat it on Friday. Send her an email if you are interested.
Please refer to this page for instructions. Installing Rattle on Mac
Data sources pages:
Data sets from Google and Kagle. https://www.kaggle.com/datasets
A page of useful links
Data sources and project ideas related to pollution.
Projects: easily available data sets
Five strategies for locating interesting data sets. (From Dataquest)
Some data projects that encourage other people to use the data they collected.
Past student papers:
Job Hunting Opportunity
In 2 weeks, the Jacobs School of Engineering is running a day with lots of employers visiting. Student passes are $10, although you can probably sneak in if you want to. http://jacobsschool.ucsd.edu/re/
The only firm requirement for taking Big Data Analytics is knowledge of statistics through linear regression. Knowledge of R computer language is not required or even expected.
The workload is severe. There are weekly problem sets which have to be run on your computer. Most important, it requires an ambitious final project, which you define for yourself. Graduate students learn best from each other, and almost all work can be done in pairs (team of 2).
I do recommend a short introduction to R, via Coursera – see the page what to do in the first week. The TA will be leading sessions specifically about R. But BDA is not a programming course, and unless you make extra effort, you won’t become proficient in R as a general programming language. You will learn to use R code that other people have written, and glue it together for your own needs.
Other useful background:
- Knowledge of probability, such as decision trees.
- Some empirical area that you are informed about and would like to explore. It does not have to be quantitative, but you must be able to find real data about it.
- Knowledge of regression beyond straight linear regression, including binary outcomes, discrete variables, panel data, etc.
- Experience getting your hands dirty with data. Real data does not arrive in neatly formatted matrices. It has missing values, ambiguous definitions, outright errors, internally conflicting numbers, and so forth.
- Some programming experience, in any language. Programming is a mental discipline, because the computer hates you,and wants you to mess up. If you have never programmed before, this can be frustrating.
- Nobody will have all of these pieces (possibly excepting a few PhD students who occasionally take the course). You will have lots of opportunities to learn about them in the course.
All this will be discussed further in the first meeting.
Projects are a central part of BDA. Each of you can choose your own topic and data set, subject to my approval. This year, we are arranging some potential projects with local organizations. I hope someone (or a team of 2) will sign up for each of them, since they have real clients, and often have potential to lead to jobs.
Sony Playstation Network is an important part of Sony Interactive Entertainment, a profitable division of the parent company. It includes a very active web portal, which generates a lot of detailed data.
A small company near San Diego is developing technology that uses muons to detect smuggled plant matter and other illegal substances. Making the product commercially successful will depend on, among other things, how quickly and accurately it can screen. An installation is currently running and generating lots of data. Better analytics have direct value to the company.
Russian Twitter disinformatzia posts
NBC has a nice database of Russian disinformation tweets. If you are interested in Twitter or politics, they might provide the data for an good course project. Twitter deleted 200,000 Russian troll tweets. Read them here. Twitter makes it hard to find them, but NBC got some help putting together this data set.
By the way, disinformatzia is quite old. Here is a 1995 article in NYTimes.