Data files for multiple-instance regression studies Kiri Wagstaff, January 2007 kiri.wagstaff@jpl.nasa.gov In the multiple-instance setting, each county is a bag, and each county is associated with two targets: corn yield and wheat yield (each year). ---------------------------------------------------------------------- Remote sensing data: - CA01.txt: MODIS data observations for CA in 2001. Similar files exist for 2002, 2003, 2004, 2005. - KS01.txt: MODIS data observations for KS in 2001. Similar files exist for 2002, 2003, 2004, 2005. There are 100 randomly selected pixels included for each county, one per line. Format (per item/pixel/line of the file): countyid latitude longitude surface_reflectance where surface_reflectance contains 92 values: - observations in red for 46 timepoints across the year (every 8 days) - observations in IR for 46 timepoints across the year (every 8 days) These features are arranged in time order, so the feature values are actually: [R1 IR1 R2 IR2 R3 IR3 ... R46 IR46] Notes: - Pixels with zeros or -32767 (both bad values) for feature values have not been removed. - Each pixel represents 250m x 250m on the surface of the Earth. - Original data source: MODIS instrument on the Terra spacecraft. These values are surface reflectance data (the MOD09 product). The full data set can be obtained at the EOS Data Gateway: http://edcdaac.usgs.gov/main.asp The relevant products are: MODIS/TERRA SURFACE REFLECTANCE 8-DAY L3 GLOBAL 250M ISIN GRID V003 MODIS/TERRA SURFACE REFLECTANCE 8-DAY L3 GLOBAL 250M SIN GRID V004 - pixels with bad values (-32767) have not been removed. You may want to prune these out of the analysis. ---------------------------------------------------------------------- Targets: - yields-CA-2001.txt: Crop yields for the state of CA for 2001. Similar files exist for 2002, 2003, 2004, 2005. - yields-KS-2001.txt: Crop yields for the state of KS for 2001. Similar files exist for 2002, 2003, 2004, 2005. Format (yields are in bushels per acre): countyid corn_yield wheat_yield Notes: - Counties with zero yield for a crop have not been removed. You may want to omit these from the regression, since 0 can mean either "did not grow any of this crop" or "data has not been reported". - Original data source: USDA records, obtained from the NASS: http://www.nass.usda.gov/Data_and_Statistics/Quick_Stats/ ----------------------------------------------------------------------