class: center, middle .title[Big Data in Time] .subtitle[Progress and Challenges from Oceanography] .author[Jonathan Lilly] .institution[NorthWest Research Associates, Seattle] .coauthor[Sofia Olhede
1
, Adam Sykulski
1,2
, Jeffrey Early
2
] .institution[
1
University College London,
2
NWRA] .date[January 8, 2016] .center[[{www.jmlilly.net}](http://www.jmlilly.net)] .footnote[Created with [{Remark.js}](http://remarkjs.com/) using [{Markdown}](https://daringfireball.net/projects/markdown/) + [{MathJax}](https://www.mathjax.org/)] --- class: center # What is “Big” Data? -- Too big for one personal computer, or ... -- ... just too big for one person.strike[al computer]. -- Too big to analyze and inspect manually. -- ## Strong Implications for Analysis Methods -- Must be fully objective and autonomous. -- Cannot rely on free or “tunable” parameters. -- Must be testable / ground-truthable! --- class: center ## An Oceanographic Time Series Dataset
Trajectories from NOAA's Global Drifter Program 30 million data points, 20 thousand time series (That's big for oceanography!) --- class: center ## Surface Drifter Basics .left-column[
] .right-column[
A basketball-sized surface float with a 6 m long drogue centered at 15 m depth. Position communicated by satellite roughly every hour.] --- class: center ## How to Deploy a Surface Drifter
--- class: center ## Animation of the Surface Drifter Dataset
Note: non-uniform sampling distribution, temporal variation, superposition of spatial scales --- class: center ## Mean Surface Current Speed
Formed by binning in latitude and longitude, then averaging. Easy to compute maps of low-order statistics: mean, variance, etc. Clearly does not capture full richness of dataset. What else can be done? --- class: left ## Properties of the Surface Drifter Dataset -- 1. Non-homogeneously distributed in space -- 2. Non-uniformly sampled in time -- 3. Irregular length or duration -- 4. Non-stationary statistics -- 5. Superposition of various physical processes -- ## Processes of interest -- 1. Low-frequency diffusive behavior -- 2. Intermediate-frequency coherent eddy structures -- 3. Higher-frequency wave motions -- 4. Small-scale roughness -- ##Strategy -- Construct particular analysis methods aimed at isolating various physical processes. Here, focus on dispersion (#1) and eddies (#2). -- Initially, work within simplified dynamics of numerical models. --- class: center ### A Numerical Simulation of an Unstable Current
A highly idealized version of the Gulf Stream. Note formation of vortices or “coherent eddies”. --- class: center ### Identifying Vortices from Particle Trajectories
Vortices (loops) are identified using only particle trajectories (dots). --- class: center ### Extraction of Time-Varying Vortex Currents
Vortices can be identified, studied, and distinguished from waves. From Lilly, Scott, and Olhede (2011), *GRL*. --- class: center ## Application to the Global Drifter Dataset
Apply eddy detection algorithm to 20K time series. Use a range of frequencies compared to the Coriolis frequency `\(f\)`. --- class: center ### Eddy Detection in the Global Drifter Dataset
red = same direction as Earth's rotation, blue = opposite --- class: center ### Statistically Significant Eddies
After comparison with a null hypothesis of red noise. --- class: left ### Analyzing Modulated Multivariate Oscillations -- We have developed a method for analyzing and interpreting *time-varying* properties of quasi-periodic or quasi-oscillatory signals. * The notion of the *analytic signal* lets an *instantaneous* amplitude and frequency be associated with a given time series. .cite[Gabor (1946), Vakman and Vainshtein (1977), Cohen (1995), Picinbono (1997)] -- * Similarly, a bivariate (x,y) signal defines the geometrical properties of a time-varying ellipse. .cite[Lilly and Olhede (2010a)] -- * This may be extended to 3D (e.g. seismographs) or N-D signals. .cite[Lilly (2011), Lilly and Olhede (2012a)] -- * Modulated oscillations can be extracted from noise using the *wavelet ridge method*, a local best fit onto an oscillatory test function. .cite[Delprat et al. (1992), Mallat (1999), Lilly and Olhede (2010b)] -- * Modulated multivariate oscillations can be treated similarly .cite[Lilly and Olhede (2012a)] -- * Errors are proportional to modulation strength, and are minimized by a suitable choice of wavelet. .cite[Lilly and Olhede (2010b), Lilly and Olhede (2012a)] -- * Best choice of wavelet is found by considering a superfamily encompassing all other continuous analytic wavelets. .cite[Lilly and Olhede (2009, 2012b)] --- class: center ## Identifying Vortices in 2D Turbulence
From a 2D model of isotropic forced / dissipative turbulence. --- class: center ### Particle Trajectories in 2D Turbulence
versus a Three-Parameter Stochastic Model
Center: all trajectories | Middle: eddy-free trajectories Right: stochastic model fit to eddy-free trajectories The stochastic model is *damped* fractional Brownian motion. --- class: center ## Modeling Dispersion in 2D Turbulence
Particle trajetories are all offset to begin at the origin. The right-hand side is a three-parameter model, fit to the data. --- class: left ### Modeling Trajectories in Ocean Turbulence -- We have developed a method for stochastic modeling of trajectories in ocean turbulence, and inferring parameters from large datasets. -- * Create an appropriate stochastic model for particle trajectories. .cite[Sykulski, Olhede, Lilly, and Danioux (2015)] -- * A key ingredient is a *damped* version of fractional Brownian motion. .cite[Lilly, Sykulski, Early, and Olhede (2016), in prep] -- * Parameter estimation is best done in the frequency domain. .cite[Whittle (1953)] -- * Parameter estimation must be adjusted to handle *complex-valued* time series. .cite[Sykulski, Olhede, Lilly, and Early (2016a), in prep.] -- * Parameter estimation must be corrected for bias due to small sample size effects. .cite[Sykulski, Olhede, Lilly, and Early (2016b), in prep.] --- class: left # Important Lessons -- Problem: Creating a “hands-free” version of a sophisticated analysis method for a big dataset is really hard! -- In the time series literature, new methods are generally prototyped on a handful of test time series—sometimes highly idealized. -- ## Distribution of time and effort -- * 5% Determining what features of time series to examine. -- * 90% Building and testing a suitable analysis method. -- * 5% Applying the method to the dataset of interest. -- ## Solution -- * Long-term interactions of experts in theory and application. -- * Clear, two-way communication using a shared vocabulary. -- * Careful refinement of approach and analysis goals. -- * Patience. --- class: center, middle # Thank you! .center[Visit [{www.jmlilly.net}](http://www.jmlilly.net) for papers, this talk, and a Matlab toolbox of all numerical code.] .center[.footnote[P.S. Like the way this presentation looks? Check out [{Liminal}](https://github.com/jonathanlilly/liminal).]]