Wednesday May 23
Assignment – Oversampling, feature engineering
- Read main textbook section 5.5 on oversampling. The credit card fraud case was an example where they used extreme oversampling of the fraud cases, because there were so few in the original data. Most real classification problems have uneven numbers of each case or uneven costs of errors. Therefore, oversampling is frequently important. Without it, or some other way to accomplish the same thing, you can get a model that is accurate but useless, such as predicting that all cases are the most commom outcome.
- We will continue discussing and practicing feature engineering.
Assignment: Email me 5 to 10 proposed new variables for the credit card fraud case. The actual case is posted above, and in my lecture notes. You can see what variables the authors decided to use. Devise some additional/better ones. Can they be calculated from the available 44 million transaction records?
- Notice that you do not have to know what effect a variable will have in order to decide that it’s a potentially good feature. A LASSO or Random Forest can sort out whether it is useful, or not.
- Baseball homework from last week – the Hitters baseball data had a number of variables, but some of them seem strange. For example, “lifetime hits” just rewards players who have been playing for a long time. We also have a variable that directly measures how long they have been playing. So it is not surprising if LASSO decides that only one of the time-dependent variables is important. What changes could make a more useful measurement?
We will discuss this on Wednesday.
If you are not familiar with baseball, read about measurements of “batting performance,” i.e. how hitters are traditionally measured. For example, a key statistic is batting average. Note that many key statistics are ratios. Looking back at your homework from last week, what would have been good additional variables beyond what was in the original data?
- For messing around. Google is sponsoring an open-source project for quick data exploration. This one seems to be able to display about 5 dimensions of data at once, by using:
Positions on x,y axes
Facets on x, y axes. (especially for categorical variables)I may give it a demo on Wednesday. It seems to be only for exploration, but exploration is well worth one to several hours of project time.
- Monday handouts — feature engineering,
Cheat sheet on performance measurements: Sensitivity, Specificity, and many other measures. contingency table defns SAVE Keep this for reference.
Handout: some key plots for exploring data. Specifically for Hitters data. R Graphics Cookbook – Excerpt
Lecture notes: The medical testing paradox and how to solve it. BDA18 feature engineering case study In the same lecture is “Feature engineering for detecting credit card fraud”.
An abbreviated version of the paper, which is all you need for this week. “Data mining for credit card fraud: A comparative study” by Siddhartha Bhattacharyya et al. Credit card fraud – excerpt. The original article for credit card fraud detection. Includes their feature engineering solution. The full article is available here. (On-campus or VPN only.)
Explaining test results: article in Science. In class we applied this to catching terrorists in the country of Blabia. Risk literacy in medical decision-making: How can we better represent the statistical structure of risk? By Joachim T. Operskalski and Aron K. Barbey. Science, 22 APRIL 2016 • VOL 352 ISSUE 6284. Risk literacy in medical decision-making
Lecture notes, Wednesday BDA18 Lecture feature engineering
This page is the weekly summary of relevant material for assignments, projects, etc. Edited May 8.
- Projects: all projects should have downloaded live data by this weekend. If you have trouble with that deadline, it is time to drastically prune your goals. You can always add more elaborate analysis after you do the basics, but the opposite is not true.
- If you have not submitted a project update last weekend (nothing since #3, April 29), come to Wednesday night office hours, or text me to suggest another time.
- Every team should expect to meet with me at least twice after your basic project is approved. The first meeting usually discusses whether you have formulated the problem in a way that is solvable with the data you have. Roughly 30% of the projects e need to change the problem statement at this stage, which is easier than getting more data.
- The second meeting generally looks at your results, and finds ways to boost them / make them more interesting.
- The next homework is due Friday May 11. It is problem 20.3 from the main DMBA textbook. Data files are available on the book’s website. Details are on the linked assignment from the syllabus page. Latest syllabus, assignments, + notes for #BDA Big Data Analytics at UC San Diego
- Error in file name! The file AutoElectronics.zip has been scrambled by the book authors. On their web site they call it AutoAndElectronics.zip. (They have yet a third name in the textbook!!) You can DL it from them at http://www.dataminingbook.com/system/files/AutoAndElectronics.zip or at the course Google page under either name.
- Key resource list: Reference books on R and on specialized data mining methods. Resources for Mining + R language. Information to solve 98% of your R and data mining algorithm problems are available from this page! (You still need to figure out how to formulate your business problem as Data Mining. Many of the references give advice about this, but it is not reducible to a “cookbook.”)
- Text-mining specific resources. Text-mining resources for projects
- Check out the seminar next Monday on careers. Careers in Data Analytics discussion Monday May 14, 12:30
- One new book is particularly worth checking out if you are struggling with messy data sets. It discusses packages that have been developed specifically for common data manipulation problems in machine learning/data mining. Several of Feiyang’s tutorials use this book.
R for Data Science: The tidyverse and set of new tools for file and data manipulation. Much more efficient than raw R, and faster to write code with. Chapter 5 is probably the place to start. This book is available for $20 or from the library, but the same material is on a web site http://r4ds.had.co.nz/
- Weekly notes. Text-Mining Bohn Day 1+2.
- The file we used for a tm tutorial on Monday. Called Basic Text Mining in R. It uses slightly different techniques than the textbook, such as no Latent Semantic Analysis, but instead taking out the rare words.
- Files for Wednesday’s class, section 20.5
- May 9 list of functions. Keep this as a “cheat sheet,” and add to it as you read and do mining.
- Assignment: Text mining #2 2018b
- lecture notes. Text-Mining Bohn Day 1+2
- The R files cannot be stored on WordPress. Get them on the course’s Google drive page. You can reconstruct my work with two files:
- Files ending in .R contain the code which I wrote and kept in Rstudio’s Source window (upper left). This code can be run. Note: if I actually ran additional code but did not save it in Source file, it appears in History, but not in Source.
- Files ending in .RData contain the entire Global Environment of variables.
- By loading both of these into RStudio, you can recreate the status at the time I saved them. Thus all calculations are preserved.
The ostensible material this week is Random Forests. They are a generalization of Classification/Regression Trees. No assignment for Monday; everything will be done in class. Redo the class material and hand it in on Wednesday (not Friday – we will go over the results in class.)
Here is the assignment for Wednesday, including the readings on Random Forests. BDA18 Random Forest Assign May 2, 2018.
Here is the material used in class. BDA18 Random Forests2018B
The other agenda this week is to develop your skills in debugging. This is the key skill of writing code: figuring out what is wrong and how to fix it.
For Week 6, text mining. Here is the assignment for next Monday BDA18 Text mining #1 assign. (Short assignment handed in Sunday night.)
This week we have 3 learning goals. It will take the entire week to do them.
- Linear regression for prediction. How it differs from hypothesis testing.
- Showing how to use R instead of, or in conjunction with, Rattle.
- Many specific tricks and issues that come up with linear regression, such as word equations and creating interactive variables.
Please see the attached document, which includes the readings, specific homework due Friday, and supplemental information about various useful ideas and techniques.
BDA18 Week 4 Readings + assign
You can get data files here: https://bda2020.wordpress.com/data-sets/
There is nothing due on Monday.
Thanks to everyone who submitted these project proposals on time. I have returned half (Saturday night); the others are coming. A few are still missing.
Here is some general advice that applies to some proposals. If you already know it, please help other BDA students with it.
- For this course, all projects need to do prediction or classification of individual-level characteristics of some kind. You have already had numerous courses where the goal is to test a hypothesis or to measure the strength of the causal relationships among variables. These are important, but in BDA we are learning a new skill, and indeed a new purpose for analyzing quantitative data. Mathematically, they is closely related. But it requires a different mind-set. We will talk more about this in the weeks to come.
- Internet search: Everyone knows how to do a simple search on Google. But that is a low skill level. It is possible to do research far better in 2018. Before you leave UCSD, get good at it. Here are some simple questions to test yourself.
- What is the difference between Yelp reviews and “Yelp Reviews” (with or without quotes and capital letters).
- Find pages that mention dates between 1985 and 1995. (Not that were written then, but that contain those dates.)
- Almost everything on Google Scholar can be found in a regular Google search. Why, then, is it so useful to search on Google Scholar instead of (or sometimes in addition to) Google.com?
- What extra benefits do you get on Google Scholar when you are on campus or VPNed to campus?**
Where on this listing should you click to get the paper most easily?
- Name two good databases that have massive collections of non-public business-related papers, reports, magazines, etc. Material that is not available through either Google or Google Scholar.
- A few people forgot the syllabus request that all written work be turned in with a professional appearance, as if you were in a company or organization. I got more than one file called “proposal.pdf.”
I have therefore updated and reposted the syllabus section on homework and reports. BDA18 Syllabus = HW formats + grading 2018-04-14. Parts of it, such as discussion of figures, are overkill for these early project descriptions.
** This one is pretty obscure. The answer is in the picture. If you are on campus, you get the UC-eLinks entry. For recent published papers, that is often the only way to download it.
Feiyang and I are working to get everyone up to speed for next week. Here is a list of items to be aware of, in no particular order. I will get this material organized better over the weekend. As always, you can post comments on this message if you have questions.
- Rattle now works on the Mac! see Installing Rattle on Mac
- The homework for Monday does use Rattle. Do it in teams, and if only one of you has a working version of Rattle that is ok as long as you physically work together. In class, I will call on someone randomly for your solutions.
- You can find the homework at the end of the syllabus. Currently, it is version 1.05, but I expect to revise it late today. Latest syllabus, assignments, + notes The Monday assignment is on page 12, or search for “Early Assignments”. Update: individual assignments are also in TritonEd.
- The assignment due Tuesday night takes some people a long time because it requires using R, Rattle, and the first data mining algorithm, called CART. It also asks you to do some data manipulation. So set aside time, and do it with a teammate. Homeworks are due 11pm.
- To get the course information immediately, subscribe to this web site. Look for the subscribe button on the bottom right. (BDA2020.wordpress.com)
- Feiyang can provide assistance with Rattle by email (and then phone etc.) When asking for computer assistance with problems, provide basic information for debugging. Do not say “it didn’t work,” unless you don’t need any assistance. The more complete, the better.’ Give the exact error message. It is even ok to Copy and Paste an entire stream of activities and resulting error messages, into the bottom of an email. She won’t read it all, but it gives important clues.
- I have not confirmed this, but she should be available before or after class on Monday for anyone still having trouble with Rattle.
- For R in general, get in the habit of googling error messages. Often this will send you to the Stack Overflow site.
This page links to the latest versions of course material. Some PDF, some HTML. Update May 29, 2018
Lecture Notes (chronological order)
- BDA18-D3 Chap9_CART RB. For the class of April 9, on CART.
- BDA18 Class 4 Lecture notes Toyota For the class of April 11, on CART + Toyota
- Logistic Regression 2018 Class of April 16 on classification using linear models aka logistic regression.
- Class of April 18 on linear categorical models aka logistic regression. BDA18 illustration of Rattle use 04-18
- Notes on Linear Regression, Week 4, April 23, 25 BDA18 regression 04/25.pdf. BDA18 regression slides 4-23. Use primarily the April 25 version; 4/23 has a few additional slides.
- How to go from Rattle to R. BDA18 Rattle to R code 4-25.pdf
- Lecture Notes Week 5 Random Forests BDA18 Random Forests2018B
- Lecture Notes Week 6 Text Mining, Day 1
Tutorial worked through in class. Basic Text Mining in R 2017 version
- Week 6 Text mining #2 2018b
- Week 7 LASSO, Monday May 14.
- Week 8 lecture notes. Monday May 21. BDA18 feature engineering case study
Advice, tutorials, reference books, other useful material
Special topics – for specific papers
The Big Data Analytics course introduces data mining with techniques and concepts that are broadly applicable. Individual topics and projects have specific techniques, needs, and resources. In keeping with the theme “Borrow and re-use, don’t invent anything yourself,” here are some resources that are especially suited to particular topics.
Don’t forget to try to site’s Search window (usually near the upper right) to look up possible keywords. Many of these topics also have entire books about them, such as on Springerlink.
- Especially useful R books for the course. Resources for Mining + R language
- Text processing. Start with this list: Text Mining Resources for Projects Then look at https://bda2020.files.wordpress.com/2017/04/bda17-text-mining-resources.pdf These two pages alone will save many hours of programming time. There are also many books on this subject. Specific books include: Mining Text Data R for Marketing Research and Analytics
- Spatial data, Geographic Information Systems. For projects on taxis, bicycle sharing, crime, and many other topics where the underlying data is geographically distributed, and location affects behavior. Read this page: Spatial (GIS) data in R: easy maps One of many books is Applied Spatial Data Analysis with R. Also Spatial analysis in R
- Time series require a special kind of validation, in which you train the model on early years, and then validate it on later years. You can do this in rolling fashion. For example use years 1-5 for training, and validate on year 6. Then use years 1 to 6 for training (or 2 to 6), and then validate on year 7. Validating machine learning time series models
- Twitter and other social networking sites. In addition to material on text mining, R for Marketing Research and Analytics; Text mining of Amazon reviews.; Also be sure to read about “Regular Expressions.” Handling and Processing Strings in R by Gason Sanchez is a 100 page mini- book on manipulating text. Look here when you need to do something with text like “find all words that start with ‘UCSD’.” Finally, there are many previous student papers in BDA that use Twitter data.
- Local crime. Local crime models are tricky because they require predicting events that are spread out over space and time. If you set up your data with “buckets” that are geographically and temporally small, then most buckets are empty. But if you make the buckets too large, such as “Any time on Mondays, for the lower half of Manhattan,” then the buckets are too big to be useful to decision makers. Wk 8: Feature engineering, other topics CHRONological handouts, 2016. Lectures 2017
Google folder for the course. There you will find all datasets for the textbook,
The official textbook web site is http://www.dataminingbook.com/book/r-edition
Once you register, you can get these datasets, and the R Code. (It’s better to type the R Code by hand, the first time.)
PROFESSOR ROGER BOHN OFFICE = RBC 1315 PHONE 858 534-7630
EMAIL: RBOHNat UCSDdotEDU.
Personal web site: Art2science.org