I have created a new list of resources, for specific projects types such as spatial analysis and Twitter analysis. It is the heading on the Latest Handouts page at Special topics for individual papers.
Summary of the last 2 weeks of the course:
- Only nominal homework – readings and one figure.
- Work on projects. Ask for help if desired. No more interim reports are due.
- R Certification: If you want R certification for the course, take a one-hour quiz and meet some other requirements.
- Make an in-class presentation: two-person teams only.
- Final paper due
Wednesday, May 30. Handling unbalanced data, and other useful techniques.
Reading: Chapter 5.5, also 5.3 and 5.4. These were assigned previously.
Nothing to be turned in
Saturday, June 2: No progress report is due.
Monday, June 4: A/B Testing and other emerging topics in Big Data
- Look up specific techniques for your project. Spatial data and GIS, Text processing, Crime, Graphics, or Twitter. One or more applies to every project. Special topics for individual papers.
- Turn in: One careful plot from your project. Hard copy, with comments on it by hand. Format the plot carefully and clearly including scales, colors, definitions, etc. Please turn these in by hand in class. This is to encourage hand-writing of comments. Circle and explain at least one interesting/important feature of your plot.
- Include a caption. Captions in scientific papers are sometimes several sentences long.
- The goal of the assignment is to help you focus intensively on one result of your project, and how to explain it visually. It does not have to be a data-mining result.
- Reading, “The A/B Test: Inside the Technology That’s Changing the Rules of Business” Wired Magazine, 04.25.12. https://www.wired.com/2012/04/ff_abtesting/all/
- Visit an e-commerce website and think about how to improve it using A/B testing.
Wednesday, June 6: All two-person project teams will give 5 to 7 minute presentations. The goal is to fascinate, impress, and surprise your audience. Think of this as the “elevator pitch” for your project.
Friday, June 8 1pm or other times as agreed: Quiz for R Certification. The quiz emphasizes data manipulation in R, Selecting data subsets, creating new variables , rearranging and redefining data such as event logs. The other requirements for R certificates are completing your project using appropriate R programming, and attending 50% of TA tutorials.
Friday, June 8 midnight: Formal due date for final project papers.
All projects who request one receive an automatic extension until Wednesday.
Submit both hard copy and PDF files. Submit via Turnitin, on TritonEd.
June 11. Wednesday, June 13. Deadline for projects.
This page links to the latest versions of course material. Some PDF, some HTML. Update May 29, 2018
Lecture Notes (chronological order)
- BDA18-D3 Chap9_CART RB. For the class of April 9, on CART.
- BDA18 Class 4 Lecture notes Toyota For the class of April 11, on CART + Toyota
- Logistic Regression 2018 Class of April 16 on classification using linear models aka logistic regression.
- Class of April 18 on linear categorical models aka logistic regression. BDA18 illustration of Rattle use 04-18
- Notes on Linear Regression, Week 4, April 23, 25 BDA18 regression 04/25.pdf. BDA18 regression slides 4-23. Use primarily the April 25 version; 4/23 has a few additional slides.
- How to go from Rattle to R. BDA18 Rattle to R code 4-25.pdf
- Lecture Notes Week 5 Random Forests BDA18 Random Forests2018B
- Lecture Notes Week 6 Text Mining, Day 1
Tutorial worked through in class. Basic Text Mining in R 2017 version
- Week 6 Text mining #2 2018b
- Week 7 LASSO, Monday May 14.
- Week 8 lecture notes. Monday May 21. BDA18 feature engineering case study
Advice, tutorials, reference books, other useful material
Special topics – for specific papers
The Big Data Analytics course introduces data mining with techniques and concepts that are broadly applicable. Individual topics and projects have specific techniques, needs, and resources. In keeping with the theme “Borrow and re-use, don’t invent anything yourself,” here are some resources that are especially suited to particular topics.
Don’t forget to try to site’s Search window (usually near the upper right) to look up possible keywords. Many of these topics also have entire books about them, such as on Springerlink.
- Especially useful R books for the course. Resources for Mining + R language
- Text processing. Start with this list: Text Mining Resources for Projects Then look at https://bda2020.files.wordpress.com/2017/04/bda17-text-mining-resources.pdf These two pages alone will save many hours of programming time. There are also many books on this subject. Specific books include: Mining Text Data R for Marketing Research and Analytics
- Spatial data, Geographic Information Systems. For projects on taxis, bicycle sharing, crime, and many other topics where the underlying data is geographically distributed, and location affects behavior. Read this page: Spatial (GIS) data in R: easy maps One of many books is Applied Spatial Data Analysis with R. Also Spatial analysis in R
- Time series require a special kind of validation, in which you train the model on early years, and then validate it on later years. You can do this in rolling fashion. For example use years 1-5 for training, and validate on year 6. Then use years 1 to 6 for training (or 2 to 6), and then validate on year 7. Validating machine learning time series models
- Twitter and other social networking sites. In addition to material on text mining, R for Marketing Research and Analytics; Text mining of Amazon reviews.; Also be sure to read about “Regular Expressions.” Handling and Processing Strings in R by Gason Sanchez is a 100 page mini- book on manipulating text. Look here when you need to do something with text like “find all words that start with ‘UCSD’.” Finally, there are many previous student papers in BDA that use Twitter data.
- Local crime. Local crime models are tricky because they require predicting events that are spread out over space and time. If you set up your data with “buckets” that are geographically and temporally small, then most buckets are empty. But if you make the buckets too large, such as “Any time on Mondays, for the lower half of Manhattan,” then the buckets are too big to be useful to decision makers. Wk 8: Feature engineering, other topics CHRONological handouts, 2016. Lectures 2017
Google folder for the course. There you will find all datasets for the textbook,
The official textbook web site is http://www.dataminingbook.com/book/r-edition
Once you register, you can get these datasets, and the R Code. (It’s better to type the R Code by hand, the first time.)
PROFESSOR ROGER BOHN OFFICE = RBC 1315 PHONE 858 534-7630
EMAIL: RBOHNat UCSDdotEDU.
Personal web site: Art2science.org