Course status update April 23 (with a hidden prize)

The following message went out as an announcement. If there are updates, I will put them here, rather than send another announcement.

1. Assignments are posted for the next 3 classes: Tuesday April 25, Thursday April 27, and Tuesday May 2 (due on May 1 at midnight).

2. Remember that Sai’s TA workshop is still on Wednesdays at 6PM., but his office hours  are now on Tuesday mornings at 9:30 AM. If you want help with the homework that is due Monday night, better email him early on the weekend, and hope for the best. The new schedule is discussed in a blog post on .

3. Projects :

  • If you will be looking at spatial information in your project, check out my recent blog post on GIS in R.
  • I have returned comments on all project proposals, and met with many of you to discuss them. Some project proposals are great; others  are just getting started, and may change  direction or get a lot more specific.
  • If you are still working by yourself, many of you will benefit from collaborating. Take a look at “Discussions” on Tritoned, and post a message about your plans. Each of the first 5 people to post a message there will be entered in a lottery for a prize. (“Hello world”  messages don’t count.)
  • Some teams are planning to work on projects based on the same data sets. (e.g. AirBNB) In that situation, get together and exchange information about what you are finding, especially about retrieving and cleaning the data.
  • If we have time in class on Tuesday, I will ask some of you to introduce yourselves and discuss/ask about projects.

Spatial (GIS) data in R: easy maps

Most, if not all, paper topics will benefit from finding books and articles discussing (and giving code for) relevant techniques. Common examples in the past have been text mining and web scraping. Another example is analyzing spatial data. There are R functions for a full range of geographic modeling, including both analysis and display. Several groups are planning to look at Airbnb or other data where space is important.

Continue reading “Spatial (GIS) data in R: easy maps”

New weekly sequence Thursday/Tuesday

Based on in-class discussion today (April 20), we will aim for the following schedule in future weeks. We will not always follow this exact schedule.

  • Thursdays: Lecture introducing a new topic. Homework assigned for the following Tuesday.
  • Tuesday: Homework assignment due, using those concepts. That gives you the weekend to do the work. Homework discussed in class on Tuesday.
  • Sundays: Weekly project report due.
  • Wednesdays 6 PM: TA workshop. Bring your computer, get practice using R and running programs.
  • Tuesdays 9:30 AM (New time): TA office hours. Bring your questions. You can also contact him by email to ask for an appointment.
  • Tuesday and Thursday 5:10 PM Professor’s office hours. In room 1315 on Tuesday, Peet’s Coffee on Thursdays. You can also contact him by email to request an  appointment.

Next week only, your project assignment is due on Tuesday, April 25.

Project Progress Reports

I have posted the first two progress report assignments. One is due Tuesday night, and the second is due following Sunday (April 23). These are not being graded, but I am giving feedback.

Schedule a meeting with me during office hours or after office hours, some time by Friday April 28. The purpose is to go over your proposal, and help you think about how to do something feasible and interesting. If you have already met with me to discuss topics, you do not need another meeting unless I specify in my feedback.

Work with other students, including ones you don’t know, to find interesting data sets and topics. Use the TritonEd forum for this purpose for now. The best teams consist of two very different people. (Background, culture, gender, field of study, etc.)

Kagle competition Meetup team in SD this weekend

A San Diego group of Kagglers is looking for introductory members to join a contest. They are meeting this weekend (Saturday AM) near campus. My personal take: Since we are only in week 2 of the course, a Kagle project is not a good use of time for most students in BDA. But you might want to use this data for a project, in which case you will learn a lot by watching the contest over the next few weeks. And getting to know this Meetup Group might lead to additional benefits in May.

Kaggle Competition – MLS San Diego Team

Saturday, Apr 8, 2017, 12:00 PM

Bella Vista Social Club and Cafe
2880 Torrey Pines Scenic Dr, San Diego, CA

3 Computational Explorers Went

Congratulations to the SD MLS team for placing 46th out of 421 on the leader boards in the Satellite Data Classification competition ( This will be our third Kaggle Team competition and we have figured out a winning strategy!We will form 2 teams per city (San Diego and New York).“…

Check out this Meetup →

Instructions for joining the team:

“Observer” team – (Beginner – Intermediate)

This team invites students, coders of all levels and anyone that has a desire to learn more about data science through hands-on experience. No limit to team size. This team will be lead by a volunteer Machine Learning Society data scientist with extensive experience in the field. This teams’ challenge is the Two Sigma Connect: Rental Listing Inquiries ( which ends on April 25, 2017.

You may have to join the Kagle site to see the contest description, so I am posting part of it here. I have not looked at the data set in detail, but at least the variables are labeled. The data includes optional photographs for each observation. Analyzing photos turns it into a much more challenging project.

Finding the perfect place to call your new home should be more than browsing through endless listings. RentHop makes apartment search smarter by using data to sort rental listings by quality. But while looking for the perfect apartment is difficult enough, structuring and making sense of all available real estate data programmatically is even harder. Two Sigma and RentHop, a portfolio company of Two Sigma Ventures, invite Kagglers to unleash their creative engines to uncover business value in this unique recruiting competition.
Two Sigma invites you to apply your talents in this recruiting competition featuring rental listing data from RentHop. Kagglers will predict the number of inquiries a new listing receives based on the listing’s creation date and other features. Doing so will help RentHop better handle fraud control, identify potential listing quality issues, and allow owners and agents to better understand renters’ needs and preferences.
Two Sigma has been at the forefront of applying technology and data science to financial forecasts. While their pioneering advances in big data, AI, and machine learning in the financial world have been pushing the industry forward, as with all other scientific progress, they are driven to make continual progress. This challenge is an opportunity for competitors to gain a sneak peek into Two Sigma’s data science work outside of finance.


This competition is co-hosted by Two Sigma and RentHop (a portfolio company of Two Sigma Ventures, which is a division of Two Sigma Investments) to encourage creativity in using real world data to solve everyday problems.

BDA for pollution

There is a lot of pollution data easily available for projects.



To: BDA Students

From: Roger Bohn

Subject: How to do BDA analysis of pollution causes, forecasts, or effects. 

Date: March 29, 2016

Pollution data is available now on a daily basis from sensors all over the world. This has a multitude of uses, from mapping exposures for people who live in different places, to predicting air pollution levels in order to issue health alerts. Some of this is controversial, of course, because pollution is sensitive for many governments around the world. 

Here is an informal article that walks through how to forecast air pollution from such sensor data. 

Environmental Monitoring using Big Data

It uses some fancier math than necessary for the BDA course, and of course it is only a sketch. There are many academic articles on similar topics. 

Finding pollution sensor data. The US EPA has a central location for air pollution data at It includes daily or hourly data from about 5000 sensors around the country covering about 10 different air pollutants. The same page has data on pollutant emissions. 

Mash-ups: Weather data is available for many cities around the world. You could mash the weather data against pollution data, to see how well you can forecast pollution one, two, and three days ahead. A project like this is probably best done for one city or region of one country. Another mash-up could look at health data (such as the Mexico death certificates I mentioned in class) against the pollution data. 

A general search strategy: Find an article that describes a project in one city or region, for one pollutant, health problem, etc. Then look for data that will let you do something similar in another location/different year/different pollutant. There is now so much environmental data (at least in a few countries) that the odds are good you can find the data you need. 


BDA examples: Pollution and health

Added May 2018: EPA pollution data is still available, despite the change in administrators. Here are its sources for US data.  I just poked around on the EPA web site and found hourly data, and daily data on multiple pollutants. It appears that even more fine-grained data is available if you use the API.

========== Original message, 2016 ============

This page lists numerous examples of studies related to data mining on health and pollution, and their relationships. Many of the papers use a geographic framework (GIS), but a full-blown GIS is not needed for this course. Remember that the references in an article are sometimes key: they lead you to other articles, closer to what you are looking for.

  • A recent study of infant health inequalities among Bangladeshi immigrant women in New York City found that their infants were most vulnerable to poor health outcomes, such as low birth weight, when living either in very isolated settings or in areas of the highest ethnic density (11).
  • data from 15 million mobile-phone subscribers, Wesolowski et al. (13) could examine the complex interactions between human and animal movements and the spread of malaria in Kenya.
  • Researchers have also used GIS data, spatial statistics, and interactive mapping to identify HIV concentration hotspots near the Mexico–U.S. border (10). This kind of GIS-based analysis enables proactive and timely delivery of tailored prevention and treatment strategies, such as HIV testing, antiretroviral therapy intervention, and education to the affected communities. This and the previous two examples are all from “Spatial Turn in Health Research” by Douglas B. Richardson et. al., 2013,


  • Fine-Particulate Air Pollution and Life Expectancy in the United States

“This study directly evaluated the changes in life expectancy associated with differential changes in fine particulate air pollution that occurred in the United States during the 1980s and 1990s.”

This study was done in 2009. It mashed data on air pollution, health, pollution, and socioeconomic variables at the county level. But they only looked at 2011 US counties, a small fraction of the total. Nowadays this could be done in many different countries, and at the city or sub-city level.  This study was cited by 500 other articles, so there are many examples.     Interactive map 

  •  Ozone, area social conditions, and mortality in Mexico City  “We investigated whether the association of daily mortality and ambient ozone differs by age and area social conditions of the region of residence using a time-series analysis. “