Distributional Analysis

Distributional analysis is a term I coined for a very simple yet powerful way of analyzing datasets. It means that you think of the dataset as a distribution within a large multidimensional space, which you then can examine through its marginal statistics in any two-dimensional subspace.

The best way to understand this is through examples. So let's turn our attention to exploring a dataset of freely-drifting subsurface oceanographic floats. These instruments record latitude, longitude, and temperature as they drifter around with the currents at more-or-less fixed pressure levels. You'll need to download "floats.nc" from my web site.

We'll be using the Python packages Numpy, Matplotlib, Scipy, and Cartopy. You can find more about those here:

https://numpy.org
https://matplotlib.org
https://www.scipy.org
https://scitools.org.uk/cartopy

In particular, here is a nice tutorial on Cartopy, a mapping package for Python:
https://coderzcolumn.com/tutorials/data-science/cartopy-basic-maps-scatter-map-bubble-map-and-connection-map

To start we'll import the required Python packages.

We see that we have a data set with many different variables and two dimensions: column and segment. The column dimension has all data points from all float trajectories concatenated together, with nans marking the tails, while the segment dimension has one element per float trajectory (e.g. float id number).

Exploring Floats.nc

Let's make a basic plot of floats.nc. We will loop over all of the trajectories in order to plot each trajectory with a different color.

Two-Dimensional Histograms

Now we're going to start exploring this dataset as a distribution. The first step is to plot the two dimensional histogram of observation locations in latitude--longitude space.

In the code below, stats.binned_statistic_2d() is the central function, which we can use to compute a number of different statistics. First we are interested in the number of observations in each bin specified by statistic="count". The third argument, where observation values would normally go, can be "None" in such a case.