IMBUE: Interactive Machine Learning for Big Data Understanding and Explanation


We seek to develop innovative software solutions to big data problems. Our problem scope includes challenges associated with overwhelming data volumes in streaming applications, massive data archives, and in-situ operations with limited communication bandwidth. The unifying theme is the use of machine learning to elicit, model, and incorporate investigator preferences into fast, automated analysis. Innovative aspects of our solutions include

  1. The ability to adapt to multiple users and priorities,
  2. A focus on producing interpretable summaries and explanations for system decisions, and
  3. Context-aware, introspective systems that can reason about their own capabilities.

These methods can also assist data analysis and real-time decisions in mission operations.

Comparison to state of the art

Several data triage systems exist for detecting rare events of high scientific interest amidst large (but not yet overwhelming) data volumes. The V-FASTR system at the VLBA is at the forefront of commensal (piggyback) real-time detection for radio astronomy [Wayth et al., 2011]. It autonomously detects transient radio events such as pulses from pulsars and other astrophysical phenomena. The Virtual Observatory performs a similar function across a network of optical telescopes, and the OGLE survey autonomously detects rare exoplanet microlensing events.

However, existing data triage systems are model-driven "one-size-fits-none" solutions. They generally obtain detection, excision, and prioritization rules exclusively from prior physical models or laboratory measurements. These seldom extrapolate to new science goals and users, let alone the wide range of observing conditions experienced by a petascale instrument. For example, observing conditions can change from one observation to the next, as is the case in radio transient detection, in which the noise environment fluctuates strongly due to local interference. Further, most instruments must support multiple scientific goals and users, and this is currently achieved through manual discussion and prioritization which is too slow to keep up with anticipated data streams.

Connections to science investigations

We are working with scientists in a variety of disciplines that face big data challenges. Our collaborations include:

  • VLBA Radio astronomy (transients): The commensal V-FASTR (VLBA Fast Transient) detection system has been actively analyzing radio astronomy data collected by the VLBA since July 2011. It serves as a trailblazer in commensal transient detection and a model for similar future systems to be employed by other radio telescope facilities and the Square Kilometre Array. We developed, maintain, and continue to extend the capabilities of this real-time system. The latest results are visible at the V-FASTR web portal.
  • PTF Optical astronomy (transients): The Palomar Transient Factory (PTF) regularly detects thousands of candidate optical transients on a daily basis. We developed a real/bogus classifier to separate true transient events from false detections due to noise or artifacts in the data.
  • Kepler Optical astronomy (exoplanet detection): The Kepler telescope has observed more than 150,000 stars for over 4 years, yielding 17 TB of data. A key challenge is how best to prioritize planetary candidates for potential follow-up with ground-based telescopes to determine which stars are true exoplanet hosts. We are using the DEMUD algorithm to automatically prioritize the candidates and to provide explanations about what features make each one interesting.
  • ChemCam Planetary science (Martian mineralogy): The ChemCam spectrometer on the Mars Science Laboratory rover uses a laser to zap rocks and reveal their chemical composition. The mission has already accumulated thousands of complex, individual spectra. We are analyzing these spectra to identify interesting or unusual observations in the context of the entire mission.

Similar data prioritization challenges exist for Earth orbital multi-angle data (from MISR), hyperspectral data (from AVIRIS), and exoplanet atmospheric spectra (from Spitzer).