Distributional analysis is a term I coined for a very simple yet powerful way of analyzing datasets. It means that you think of the dataset as a distribution within a large multidimensional space, which you then can examine through its marginal statistics in any two-dimensional subspace.
The best way to understand this is through examples. So let's turn our attention to exploring a dataset of freely-drifting subsurface oceanographic floats. These instruments record latitude, longitude, and temperature as they drifter around with the currents at more-or-less fixed pressure levels. You'll need to download "floats.nc" from my web site. (This is included in the full distribution of the course materials.)
We'll be using the Python packages Numpy, Matplotlib, Scipy, and Cartopy. You can find more about those here:
https://numpy.org
https://matplotlib.org
https://www.scipy.org
https://scitools.org.uk/cartopy
In particular, here is a nice tutorial on Cartopy, a mapping package for Python:
https://coderzcolumn.com/tutorials/data-science/cartopy-basic-maps-scatter-map-bubble-map-and-connection-map
Many thanks to Jan-Adrian Kallmyr for helping with translating the Matlab tutorial into Python.
To start we'll import the required Python packages.
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
from scipy import stats
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import warnings #ignore warnings for readability
warnings.filterwarnings('ignore')
datadir = "/Users/lilly/Desktop/Dropbox/Python/time_series_analysis/data/" #path to the directory where you've put floats.nc
floats = xr.open_dataset(datadir + "floats.nc")
floats
<xarray.Dataset> Dimensions: (segment: 3091, column: 1267240) Dimensions without coordinates: segment, column Data variables: expid (segment) float64 1.0 1.0 1.0 1.0 1.0 ... 52.0 52.0 52.0 52.0 52.0 typeid (segment) float64 3.0 3.0 3.0 3.0 3.0 3.0 ... 3.0 3.0 3.0 3.0 3.0 id (segment) float64 1.0 2.0 3.0 4.0 ... 2.191e+03 2.192e+03 2.193e+03 num (column) float64 ... lat (column) float64 ... lon (column) float64 ... u (column) float64 ... v (column) float64 ... p (column) float64 ... t (column) float64 ... filled (column) float64 ... Attributes: title: FLOATS.NC subsurface float dataset link: http://www.aoml.noaa.gov/phod/float_traj/data.php about: For more information, see ABOUT_FLOATS.M. note: In Matlab, try 'ncload floats' using jLab routine ncload. creator: J. M. Lilly timestamp: 13-Jan-2019 16:46:27
We see that we have a data set with many different variables and two dimensions: column and segment. The column dimension has all data points from all float trajectories concatenated together, with nans marking the tails, while the segment dimension has one element per float trajectory (e.g. float id number).
Let's make a basic plot of floats.nc. We will loop over all of the trajectories in order to plot each trajectory with a different color.
lon = floats.lon.values
lat = floats.lat.values
figsize=np.array([18, 12]);
projection=ccrs.PlateCarree();
#some other projections we might try
#projection=ccrs.Robinson();
#projection=ccrs.LambertCylindrical();
#for future reference, define a function to set up our map
def map_setup(figsize,projection):
fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(projection=projection)
ax.add_feature(cfeature.LAND, facecolor='grey')
ax.gridlines(draw_labels=True)
return fig, ax
fig, ax = map_setup(figsize,projection)
#find the head (a) and tail (b) of each nan-separated segment, excluding the nans
b = np.argwhere(np.isnan(lat)) - 1
a = np.roll(b,1)+2
#set lon to nan when we jump across the dateline, for plotting purposes
lon_nojumps = lon
lon_nojumps[np.abs(lon-np.roll(lon, -1))>10] = np.nan; #nan for lon jumps > 10 degrees
#plot each float trajectory with its own color ... this takes a minute
for n in range(0,np.size(floats.id)):
#index into values belonging to the nth trajectory
index = np.arange(a.item(n),b.item(n))
#note you can't use a[n],b[n] because those are length 1 arrays, not scalars
if index.size > 0:
plt.plot(lon_nojumps[index], lat[index],transform=ccrs.PlateCarree())
#The "transform = ccrs.PlateCarree()" specifies that the data are given as latitude and longitude
#By default, the data is assumed to be in the same projection as that of the axis object
Now we're going to start exploring this dataset as a distribution. The first step is to plot the two dimensional histogram of observation locations in latitude--longitude space.
In the code below, stats.binned_statistic_2d() is the central function, which we can use to compute a number of different statistics. First we are interested in the number of observations in each bin specified by statistic="count". The third argument, where observation values would normally go, can be "None" in such a case.
#cmap = plt.cm.get_cmap("plasma_r", 40)
#some other colormaps. Note that the "_r" reverses the map
#cmap = plt.cm.get_cmap("magma_r", 40)
#cmap = plt.cm.get_cmap("inferno_r", 40)
#cmap = plt.cm.get_cmap("viridis_r", 40)
#cmap = plt.cm.get_cmap("RdYlBu_r", 64)
cmap = plt.cm.get_cmap("Spectral_r", 64)
fig, ax = map_setup(figsize,projection)
dlatlon = 0.5
lonbins = np.arange(-180, 180, dlatlon)
latbins = np.arange(-80, 80, dlatlon)
hist = stats.binned_statistic_2d(lat, lon, None, bins=[latbins, lonbins], statistic="count") # Returns a object
hist.statistic[hist.statistic == 0] = np.nan # Sets all values of 0 to nan as log10(0) = -inf
image = plt.pcolormesh(lonbins, latbins, np.log10(hist.statistic), cmap=cmap, shading="flat", transform=ccrs.PlateCarree())
plt.clim(0.5, 3)
fig.colorbar(image, ax=ax, orientation='horizontal', fraction=0.1, aspect=40, pad=0.08, label="Log10 Number of Observations")
<matplotlib.colorbar.Colorbar at 0x7fe5996b0a60>