Distributional Analysis

Distributional analysis is a term I coined for a very simple yet powerful way of analyzing datasets. It means that you think of the dataset as a distribution within a large multidimensional space, which you then can examine through its marginal statistics in any two-dimensional subspace.

The best way to understand this is through examples. So let's turn our attention to exploring a dataset of freely-drifting subsurface oceanographic floats. These instruments record latitude, longitude, and temperature as they drifter around with the currents at more-or-less fixed pressure levels. You'll need to download "floats.nc" from my web site.

You'll also need to have my jlab toolbox installed.

The first thing you notice is that most of the variables are things called "cells". A cell array is a very useful type of data structure in Matlab, so it's worthwhile to take a few minutes to introduce it to you.

Introduction to Cell Arrays

The floats.mat dataset consists of primarily of the positions of floats as they follow the currents, giving us time series of latitude and longitude. However, all the floats trajectories are of different lengths.

Let's say that we were to try to store this data as a matrix. We could put time in rows, with each column corresponding to a different float. By the way, this is a good time to mention that in my convention, time is always oriented to increase in the row direction. In other words, if I work with a single time series, I always use a column vector, and not a row vector. In Matlab x=[1:10]' is a column vector while x=[1:10] is a row vector.

If we were to store the float data as a matrix, then the number of rows would have to be the same as the number of data points of the longest trajectory. This would be very inefficient because we would then have a lot of empty space in the matrix. It works ok for small datasets, but for large data sets you can quickly run into memory trouble. So it doesn't really make sense to store this data as a matrix.

A cell array offers us an easy solution. A cell array is like a regular array, except the entries can be anything. Here is an example:

Individual entries of cell arrays are accessed through curly braces ...

...whereas parantheses let us access a subset of the cell array itself:

Note that a(1:2) is another cell array, where as a{1} is a string in this case.

Cell arrays can also be of different sizes, and can even combine objects of different types:

Now b{1,1} is a string while b{1,2} is an integer. As you can see cell arrays are pretty flexible. However, in what follows we are going to limit ourselves to working with one particular format, cell arrays of numerical arrays.

To keep track of arrays of different lengths, we can store them as a cell array.

Now c is a cell array in the shape of a column vector, and each of its elements is itself a column vector. We can think of this as being like stacking x, y, and z on top of each other. (Note that if we had just written c{1}=x rather than c{1,1}=x and so forth, c would have become a row vector than a column vector.)

For simplicity we will refer to this format as "cell array format." This is the format used by floats.mat. This turns out to be hugely useful in many applications. I use it so often I wrote a whole module for working with data in this format called jCell, which you can read about by typing "help jcell".

For now though, I just wanted to explain cell arrays and the basic structure of the floats.mat variables.

We see that we have a data set with many different variables and two dimensions: column and segment. The column dimension has all data points from all float trajectories concatenated together, with nans marking the tails, while the segment dimension has one element per float trajectory (e.g. float id number).

Exploring Floats.nc

Let's make a basic plot of floats.nc. We will plot each trajectory with a different color.

After that detour into cell arrays, we're now ready to plot the data. A handy jCell function called cellplot lets us plot data in cell array format without needing to loop. Go ahead and evaluate the following, which might take a while because the data is pretty large:

Note that double-clicking on this plot will make it larger, and will also show it with better resolution.

The first argument to cellplot, 180, cuts the longitude as it crosses from +180 back to -180 or vice-versa, preventing horizontal lines that would otherwise appera in this plot.

Here topoplot is a convenient function for plotting bathymetry and topography. We will look at this more later so you don't need to worry about it too much for now.

The other command, latratio, sets the aspect ratio of the plot correct for the input latitude. In other words, latratio(30) sets the aspect ratio to be 1 to 1 at 30 degrees latitude. This enables us to decide what aspect ratio we want a latitude/longitude plot to be. Otherwise, stretching a figure window in Matlab will stretch latitude and longitude accordingly, and we can often end up with some really distorted looking maps.

Two-Dimensional Histograms

After all of that we are finally ready to begin working with the distributional data analysis. Firstly we want to examine the data density, that is, the number of observations per grid point. To do this we will make a two-dimensional histogram in latitude--longitude space.

Two-dimensional histogram are made in jLab by the routine twodhist. This is essentially the same functionality as Matlab's histcounts2, but as twodhist predates histcounts2, I continue to use the former function.

We see that the observation density is high throughout the Atlantic, and is particuarly high in the Gulf of Mexico, the Gulf of Aden adjacent to the Red Sea, and several other hot spots.

The plotting command jpcolor is a version of Matlab's pcolor that does a better job of plotting the values in our matrix at their correct locations. Unlike pcolor, jpcolor does not cut off one row and one column of the plotted matrix. See the jpcolor help for details.

The call to twodhist requests a two-dimensional histogram of longitude and latitude locations. The last two arguments of twodhist specify the bin edges. In the output of twodhist, xmid and ymid are the bin centers corresponding to the input edge arrays, while mat is the number of data points per bin.

Here we have plotted the logarithm of the data density in order to better visualise the vast geographic differences in this quantity.

Twodhist works when the input arrays are in cell array format, as they are here, or regular numerical arrays. Because we're just counting the number of data points in bins, the format of the input variables doesn't matter. They can be matrices, arrays, or arrays in cell array format, as long as they're all the same size as each other.

A final point is that twodhist works through a clever algorithm that uses no explicit loops, by directly looking up the index into the bin count matrix mat. This enables it to be fast (by Matlab standards) even for very large datasets.

Two-Dimensional Means and Standard Deviations

Another important statistic we can plot is the speed of the mean flow. The two-dimensional mean is computed in jLab with the function twodstats.

In the North Atlantic, we see that the mean flow is large over the Gulf Stream, as expected, as well as around the periphery of Greenland where there is a known boundary current. The high velocities in the South Atlantic and North Pacific correspond to areas where the preceding map showed a low sample density, and therefore, the mean flows in these areas are probably not well resolved.

Another interesting statistics is the velocity standard deviation. This is also implemented by twodstats, where it is output as the fifth output argument.