Big Data in Time

class: center, middle

.title[Big Data in Time] 
.subtitle[Progress and Challenges from Oceanography] 
  
 
  
  
  
.author[Jonathan Lilly] 
 
.institution[NorthWest Research Associates, Seattle] 
  
.coauthor[Sofia Olhede1, Adam Sykulski1,2, Jeffrey Early2] 
.institution[1University College London, 2NWRA] 
  
 
.date[January 8, 2016] 
  
  
  
.center[[{www.jmlilly.net}](http://www.jmlilly.net)] 
 
.footnote[Created with [{Remark.js}](http://remarkjs.com/) using [{Markdown}](https://daringfireball.net/projects/markdown/) + [{MathJax}](https://www.mathjax.org/)]

---
class: center
#  What is “Big” Data?   
--

Too big for one personal computer, or ... 
--

... just too big for one person.strike[al computer].  
--
        
        
Too big to analyze and inspect manually.  
--

##  Strong Implications for Analysis Methods
--

Must be fully objective and autonomous.    
--

Cannot rely on free or “tunable” parameters.  
--

Must be testable / ground-truthable!

---
class: center
## An Oceanographic Time Series Dataset
 
<img style="width:100%" src="../figures/highresspaghetti.png">
 
Trajectories from NOAA's Global Drifter Program 
30 million data points, 20 thousand time series 
 
(That's big for oceanography!) 
 
---
class: center
## Surface Drifter Basics
 
.left-column[<img style="width:100%;" src="../figures/drifterschematic.gif">] 
.right-column[<img style="width:100%;margin-top:-1em;" src="../figures/drifterpic1.jpg"> 
 
A basketball-sized surface float with a 6 m long drogue centered at 15 m depth.

Position communicated by satellite roughly every hour.]

---
class: center
## How to Deploy a Surface Drifter
 
<img style="width:100%;margin-top:-0.7em;margin-bottom:-0.8em" src="../figures/deploying.jpg">
 
---
class: center
## Animation of the Surface Drifter Dataset
 
<video preload="auto" width="100%" height="auto" data-setup="{}" autoplay loop controls><source src="../videos/driftermovie.mp4" type="video/mp4" /></video>

Note: non-uniform sampling distribution, temporal variation,  
        superposition of spatial scales

---
class: center
## Mean Surface Current Speed
 
<img style="width:100%;margin-top:-0.7em;margin-bottom:-0.8em" src="../figures/globalinertialspeed.png">
 
Formed by binning in latitude and longitude, then averaging. 
Easy to compute maps of low-order statistics: mean, variance, etc.
 
Clearly does not capture full richness of dataset. 
What else can be done?

---
class: left
## Properties of the Surface Drifter Dataset
--
                
1.  Non-homogeneously distributed in space 
--
                
2.  Non-uniformly sampled in time  
--
        
3.  Irregular length or duration  
--
        
4.  Non-stationary statistics         
--
        
5.  Superposition of various physical processes       
--

## Processes of interest  
--
        
1.  Low-frequency diffusive behavior  
--
        
2.  Intermediate-frequency coherent eddy structures  
--
        
3.  Higher-frequency wave motions  
--
        
4.  Small-scale roughness  
--

##Strategy  
--

Construct particular analysis methods aimed at isolating various physical processes.  Here, focus on dispersion (#1) and eddies (#2).   
--

Initially, work within simplified dynamics of numerical models.        
        
---
class: center

### A Numerical Simulation of an Unstable Current
 
<img style="width:65%" src="../figures/qqb.png">
 
A highly idealized version of the Gulf Stream. 
Note formation of vortices or “coherent eddies”.
 
---
class: center

### Identifying Vortices from Particle Trajectories
 
<video poster="../figures/vortexframe68.jpg" preload="auto" style="margin-top:0em;" width="70%" height="auto" data-setup="{}" autoplay loop controls><source src="../videos/vortexmovie.mp4" type="video/mp4" /></video>

Vortices (loops) are identified using only particle trajectories (dots).  
---
class: center

### Extraction of Time-Varying Vortex Currents
 
<img style="width:70%" src="../figures/vortex-decomposition.png">
 
Vortices can be identified, studied, and distinguished from waves. 
From Lilly, Scott, and Olhede (2011), *GRL*.

---
class: center
## Application to the Global Drifter Dataset
 
<img style="width:100%" src="../figures/highresspaghetti.png">
 
Apply eddy detection algorithm to 20K time series. 
Use a range of frequencies compared to the Coriolis frequency `$f$`.
 
---
class: center
### Eddy Detection in the Global Drifter Dataset
 
<img style="width:100%" src="../figures/alleddies.png"> 
red = same direction as Earth's rotation, blue = opposite

---
class: center
### Statistically Significant Eddies 
 
<img style="width:100%" src="../figures/significanteddies.png"> 
After comparison with a null hypothesis of red noise.
 
---
class: left
 
### Analyzing Modulated Multivariate Oscillations
--
 
We have developed a method for analyzing and interpreting *time-varying* properties of quasi-periodic or quasi-oscillatory signals.

*  The notion of the *analytic signal* lets an *instantaneous* amplitude and frequency be associated with a given time series.  
.cite[Gabor (1946), Vakman and Vainshtein (1977), Cohen (1995), Picinbono (1997)]
--

*  Similarly, a bivariate (x,y) signal defines the geometrical properties of a time-varying ellipse. 
.cite[Lilly and Olhede (2010a)]  
--
        
*  This may be extended to 3D (e.g. seismographs) or N-D signals.  .cite[Lilly (2011), Lilly and Olhede (2012a)]
--

*  Modulated oscillations can be extracted from noise using the *wavelet ridge method*, a local best fit onto an oscillatory test function. .cite[Delprat et al. (1992), Mallat (1999), Lilly and Olhede (2010b)]  
--

*  Modulated multivariate oscillations can be treated similarly  
.cite[Lilly and Olhede (2012a)]  
--

*  Errors are proportional to modulation strength, and are minimized by a suitable choice of wavelet.  
.cite[Lilly and Olhede (2010b), Lilly and Olhede (2012a)]   
--

*  Best choice of wavelet is found by considering a superfamily encompassing all other continuous analytic wavelets.   
.cite[Lilly and Olhede (2009, 2012b)]

---
class: center

## Identifying Vortices in 2D Turbulence
 
<video preload="auto" width="100%" style="margin-top:-1.1em;" height="auto" data-setup="{}" autoplay loop controls><source src="../videos/fplanemovie.mp4" type="video/mp4" /></video>
From a 2D model of isotropic forced / dissipative turbulence. 
 
---
class: center

### Particle Trajectories in 2D Turbulence versus a Three-Parameter Stochastic Model
 
<video style="margin-top:-1.3em;margin-left:-7em" preload="auto" width="138%" height="auto" data-setup="{}" autoplay loop controls><source src="../videos/maternparticlemovie.mp4" type="video/mp4" /></video>
Center: all trajectories  |  Middle: eddy-free trajectories 
Right: stochastic model fit to eddy-free trajectories

The stochastic model is *damped* fractional Brownian motion.

---
class: center

## Modeling Dispersion in 2D Turbulence
 
<video style="margin-top:-0.5em;" preload="auto" width="100%" height="auto" data-setup="{}" autoplay loop controls><source src="../videos/dispersionmovie.mp4" type="video/mp4" /></video>

Particle trajetories are all offset to begin at the origin.
The right-hand side is a three-parameter model, fit to the data.
    
---
class: left
        
### Modeling Trajectories in Ocean Turbulence
--
        
We have developed a method for stochastic modeling of trajectories in ocean turbulence, and inferring parameters from large datasets.  
--

*  Create an appropriate stochastic model for particle trajectories. 
.cite[Sykulski, Olhede, Lilly, and Danioux (2015)]  
--

* A key ingredient is a *damped* version of fractional Brownian motion.  .cite[Lilly, Sykulski, Early, and Olhede (2016), in prep]  
--

* Parameter estimation is best done in the frequency domain.  
.cite[Whittle (1953)]  
--
        
* Parameter estimation must be adjusted to handle *complex-valued* time series. .cite[Sykulski, Olhede, Lilly, and Early (2016a), in prep.]    
--
        
* Parameter estimation must be corrected for bias due to small sample size effects.  .cite[Sykulski, Olhede, Lilly, and Early (2016b), in prep.]

---
class: left
# Important Lessons
--
   
Problem: Creating a “hands-free” version of a sophisticated analysis method for a big dataset is really hard!    
--

In the time series literature, new methods are generally prototyped on a handful of test time series—sometimes highly idealized.  
--
             
                
## Distribution of time and effort
--
        
* 5%  Determining what features of time series to examine.
--
        
*  90%  Building and testing a suitable analysis method.
--
    
* 5%  Applying the method to the dataset of interest.
--

## Solution
--
 
* Long-term interactions of experts in theory and application.
--
 
* Clear, two-way communication using a shared vocabulary. 
--
 
* Careful refinement of approach and analysis goals.
--
 
* Patience.

---
class: center, middle 
# Thank you! 
 
.center[Visit [{www.jmlilly.net}](http://www.jmlilly.net) for papers, this talk, and a Matlab toolbox of all numerical code.] 
 
 
 .center[.footnote[P.S. Like the way this presentation looks? Check out [{Liminal}](https://github.com/jonathanlilly/liminal).]]