(updated 11am June 12) Where to turn in your paper: You may put it under the door of my office (room 1315, 3rd floor), or in my faculty mailbox. Submit the PDF version on TritonEd, in the TurnitIn link.
Improving the course: I think I have approximately the right mix of topics in Big Data Analytics, but there are other areas that I would like to improve. For example, how can I better smooth the workload in the projects, to reduce the end-of-year crunch? Here is a BDA End-of-year questionaire that asks a series of questions about what I should change.
Thanks for taking the class – I certainly enjoy teaching it. Enjoy graduation! Enjoy life after the university!
We are finished with formal classes, so I will post here some additional advice for your projects. Most of this is based on things I noticed either in interim reports or in the final presentations on June 6. It’s going to take a few days to write and edit all of these notes, so keep checking this page through Saturday.
Some of this advice was written with one specific project or team in mind, but all of it applies to multiple projects.
- Don’t use loops in R. Most computer languages make heavy use of FOR loops. R avoids almost all uses of loops, and it runs faster and is easier to write and debug without them. Here is an example of how to rewrite code without needing a loop, taken from one of this year’s projects. In some cases, R code that avoids loops will run 100x faster (literally). BDA18 Avoid loops in R
- Don’t use CSV data when working with large datasets. CSV (Comma separated value) files have become a lingua franca for exchanging data among different computer languages and environments. For that purpose, they are decent. But they are very inefficient, in terms of both speed and using up memory. One team mentioned that they were running into memory limits, but the problem was most likely due to their keeping CSV files around! Solution: Use CSV files with read.csv to get data into R. But after that, store your data as R objects (dataframes or some other kind). If you want to store intermediate results in a file, create the object inside R, then use RStudio to save it as an R object. (File name ends in .Rdata.) When you want it again, use RStudio File/Open File command to load it. No additional conversion will be needed.
I will add something about this to the notes on Handling Big Data. Dealing with the “big” in Big Data.
- Fix unbalanced accuracy in confusion matrices. Yesterday, I noticed several confusion matrices with much higher accuracy for one case than the other. Reminder: We spent 1.5 classes on this topic, and there are multiple solutions. It’s usually due to having much more data of one type than the other. See:
- Week 8 assignments and notes
- Good graphics. Because graphics are a concise way to communicate, even with non-specialists, I recommend having at least one superb image in every report. (See BDA18 Writing your final report ). I will try to post some examples from your projects, with comments on how to make them even better. On Friday.
- Revised discussion on how to write a good final report. Writing your final report June 8
The goal of your presentation is to astound and interest your classmates. (Educating them is nice, also.) So think in terms of an “elevator pitch” for your research. Someone who has zero idea what you have done, but does know about data mining. Show a few viewgraphs.
I suggest being light on prose. Instead, show samples of the world you were investigating (e.g. real tweets), nice infographics about the problem or what you found, etc. This is practice for talking with an outside audience, and NOT for talking to academics.
Sign up for presentation sequence First come, first served, at https://doodle.com/poll/qyqwmfnaaf2ibaxw The times in this Doodle poll are wrong! Expect 5 minutes, so present 1 or 2 cool viewgraphs only.
Don’t forget guidance on final reports. It’s at https://bda2020.files.wordpress.com/2018/04/bda18-writing-your-final-report.pdf
It includes a rubric, checklist, etc. I am editing this document to make it more readable.
You can text me to ask about irregular office hours. Tonight (Monday) IFF anyone asks, and other times by arrangement. Be sure to tell me who you are. +1 858 381-2015
I have created a new list of resources, for specific projects types such as spatial analysis and Twitter analysis. It is the heading on the Latest Handouts page at Special topics for individual papers.
Summary of the last 2 weeks of the course:
- Only nominal homework – readings and one figure.
- Work on projects. Ask for help if desired. No more interim reports are due.
- R Certification: If you want R certification for the course, take a one-hour quiz and meet some other requirements.
- Make an in-class presentation: two-person teams only.
- Final paper due
Wednesday, May 30. Handling unbalanced data, and other useful techniques.
Reading: Chapter 5.5, also 5.3 and 5.4. These were assigned previously.
Nothing to be turned in
Saturday, June 2: No progress report is due.
Monday, June 4: A/B Testing and other emerging topics in Big Data
- Look up specific techniques for your project. Spatial data and GIS, Text processing, Crime, Graphics, or Twitter. One or more applies to every project. Special topics for individual papers.
- Turn in: One careful plot from your project. Hard copy, with comments on it by hand. Format the plot carefully and clearly including scales, colors, definitions, etc. Please turn these in by hand in class. This is to encourage hand-writing of comments. Circle and explain at least one interesting/important feature of your plot.
- Include a caption. Captions in scientific papers are sometimes several sentences long.
- The goal of the assignment is to help you focus intensively on one result of your project, and how to explain it visually. It does not have to be a data-mining result.
- Reading, “The A/B Test: Inside the Technology That’s Changing the Rules of Business” Wired Magazine, 04.25.12. https://www.wired.com/2012/04/ff_abtesting/all/
- Visit an e-commerce website and think about how to improve it using A/B testing.
Wednesday, June 6: All two-person project teams will give 5 to 7 minute presentations. The goal is to fascinate, impress, and surprise your audience. Think of this as the “elevator pitch” for your project.
Friday, June 8 1pm or other times as agreed: Quiz for R Certification. The quiz emphasizes data manipulation in R, Selecting data subsets, creating new variables , rearranging and redefining data such as event logs. The other requirements for R certificates are completing your project using appropriate R programming, and attending 50% of TA tutorials.
Friday, June 8 midnight: Formal due date for final project papers.
All projects who request one receive an automatic extension until Wednesday.
Submit both hard copy and PDF files. Submit via Turnitin, on TritonEd.
June 11. Wednesday, June 13. Deadline for projects.
Criticizing an article that appeared on Data Science Central, about logistic regression.
I recently came across a Twitter discussion of an article on a site called Data Science Central. The article was Why Logistic Regression should be the last thing you learn when becoming a Data Scientist. [TL;DR Don’t believe the headline!]
The article purports to explain that logistic regression is a bad technique, and nobody should use it. The article is nonsense. I critiqued it in the comments, but I’m not sure the editor will allow my comment to stand. Data Science Central appears to be a one-man site, with 90% of the material written by David Granville, and it’s hard not to conclude that he made a serious mistake in writing his attack on logistic regression.
So here is my response to his article. For my students – if you read something about Data Analytics that does not make sense to you, or contradicts something you have been taught, be suspicious. You can see some of the Twitter criticism here.
I am sorry to report that this article is nonsense. It’s not the conclusion – use it or don’t use it, there are now many alternatives to logistic regression. (Which inthe machine learning world is a “linear classifier.” )
The difficulty is that most of the discussion is Just Wrong. Analytically incorrect. No correspondence to the usual definitions, use, and interpretation of logistic regression.
- The diagram is incomprehensible. If it is intended to be the standard representation of logistic regression, it has multiple errors.
- LR maps from -infinity to +infinity (on the X scale), not from 0 to 1.
- The y axisis correct.
- The colors and the points show the curve (called the logistic curve or similar) as the boundary between positive and negative outcomes, for points defined by two independent variables (shown as x and y). That is not at allwhat the curve means. See e.g. https://en.wikipedia.org/wiki/File:Logistic-curve.svg
- “There are hundreds of types of logistic regression.” Maybe in an aworld with a different definition, but the standard definition does not include Poisson models. Of courseas always there are a variety of possible algorithms that can be used to solvea logistic model.
- From https://www.medcalc.org/manual/logistic_regression.php “Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).”
- “If you transform your variable you can instead use linear regression.” Yes, and that is how logistic regressions are usually solved! That is, LRs are solved by transforming the variables (using alogit transform ) and solving the resulting equation, which is linear in the variables. In practice, many other transformation equations can be used instead, but the logit transform has a nice interpretation.
- “Coefficients are not easy to interpret.” I suppose that easy is in the eye of the beholder, but there is a standard and straightforward interpretation.
- “The logistic regression coefficients show the change in the predicted logged odds of having the characteristic of interest for a one-unit change in the independent variables.” It does take a few examples to figure out what “log odds” means, unless you do a lot of horse racing. But after that, it is a clever and powerful way to think about changes in the probability of an outcome.
- The (corrected) version of the logistic curve corresponds to an equivalent way to interpret the coefficient values.
There certainly are some mild criticisms of logistic regression, but in situations where a linear model is reasonably accurate, it is a good quick model to try. Of course, if the situation is highly nonlinear, a tree model is going to be better. Furthermore, the particular logistic equation generally used should not be considered sacred.
My interpretation is that this article is an attack on a straw man, an undefined and radically unconventional model that is here being called “logistic regression.” It would be a shame if anyone took it seriously. We will see if the author/site manager leaves this comment up. If he does, I invite him to respond and explain the meaning of his diagram.
By the way, I agree with much of the discussion on the medcalcweb site I’m quoting, but not all of it.
Some are elsewhere on this site. for May 23.
- BohnDAgram on Data plumbing
- Rstudio cheat sheets for wrangling. https://www.rstudio.com/resources/cheatsheets/
- Base R cheat sheets (these should be old)
Continue reading “Resources on data manipulation”
Wednesday May 23
Assignment – Oversampling, feature engineering
- Read main textbook section 5.5 on oversampling. The credit card fraud case was an example where they used extreme oversampling of the fraud cases, because there were so few in the original data. Most real classification problems have uneven numbers of each case or uneven costs of errors. Therefore, oversampling is frequently important. Without it, or some other way to accomplish the same thing, you can get a model that is accurate but useless, such as predicting that all cases are the most commom outcome.
- We will continue discussing and practicing feature engineering.
Assignment: Email me 5 to 10 proposed new variables for the credit card fraud case. The actual case is posted above, and in my lecture notes. You can see what variables the authors decided to use. Devise some additional/better ones. Can they be calculated from the available 44 million transaction records?
- Notice that you do not have to know what effect a variable will have in order to decide that it’s a potentially good feature. A LASSO or Random Forest can sort out whether it is useful, or not.
- Baseball homework from last week – the Hitters baseball data had a number of variables, but some of them seem strange. For example, “lifetime hits” just rewards players who have been playing for a long time. We also have a variable that directly measures how long they have been playing. So it is not surprising if LASSO decides that only one of the time-dependent variables is important. What changes could make a more useful measurement?
We will discuss this on Wednesday.
If you are not familiar with baseball, read about measurements of “batting performance,” i.e. how hitters are traditionally measured. For example, a key statistic is batting average. Note that many key statistics are ratios. Looking back at your homework from last week, what would have been good additional variables beyond what was in the original data?
- For messing around. Google is sponsoring an open-source project for quick data exploration. This one seems to be able to display about 5 dimensions of data at once, by using:
Positions on x,y axes
Facets on x, y axes. (especially for categorical variables)I may give it a demo on Wednesday. It seems to be only for exploration, but exploration is well worth one to several hours of project time.
- Monday handouts — feature engineering,
Cheat sheet on performance measurements: Sensitivity, Specificity, and many other measures. contingency table defns SAVE Keep this for reference.
Handout: some key plots for exploring data. Specifically for Hitters data. R Graphics Cookbook – Excerpt
Lecture notes: The medical testing paradox and how to solve it. BDA18 feature engineering case study In the same lecture is “Feature engineering for detecting credit card fraud”.
An abbreviated version of the paper, which is all you need for this week. “Data mining for credit card fraud: A comparative study” by Siddhartha Bhattacharyya et al. Credit card fraud – excerpt. The original article for credit card fraud detection. Includes their feature engineering solution. The full article is available here. (On-campus or VPN only.)
Explaining test results: article in Science. In class we applied this to catching terrorists in the country of Blabia. Risk literacy in medical decision-making: How can we better represent the statistical structure of risk? By Joachim T. Operskalski and Aron K. Barbey. Science, 22 APRIL 2016 • VOL 352 ISSUE 6284. Risk literacy in medical decision-making
Lecture notes, Wednesday BDA18 Lecture feature engineering
A few hours ago I received an email from Emily about the farm advertisement problem due on May 11. I wrote her back. But her question raises general issues for many projects. I’m sure other students also had similar problem with Friday’s homework.
In response, I just wrote
BDA18 Memo =My program runs too slowly v 1.1.
New version! BDA18 My program =slow v 1.2 The memo includes most of my specific suggestions about the May 11 homework. (This may be too late for some students, but most of the ideas were also discussed briefly in class on Monday or Wednesday.)
There are probably multiple typos and errors in the memo. Please send me corrections by email, for class credit.
I am working on the assignment due tomorrow and have encountered a problem. When reducing the TF-IDF matrix to 20 concepts, RStudio always stops working (as indicated by the little ‘Stop’ sign in the console. I’m thinking this is because the farm-ads.csv dataset is too large. Without reducing the concepts, I am unable to move forward with the random forest part of the assignment. I am wondering if there is a solution to this problem or a way to work around it.
Apologies in advance for not approaching you with this question earlier. It’s been a very hectic week!
Thanks for your help,
By the way, I’m 98% sure that in fact RStudio did not “stop working.” It was probably still cranking away. Check the Activity Monitor application on your computer to be sure.