Date created: April 21, 2023
Author: P. Alexander Burnham
Summary: The main objective of the spacetime python library is to make tasks like loading, rescaling, merging, and conducting mathmatical operations on spatiotemporal (or other D-dimensional data sets) easier for the user by providing a set of concise yet powerful functions. spacetime opperations utilize a cube-like structure for all data sets that makes storing and manipulating large D-dimensional datasets more efficient. For scientists working with spatiotemporal data (such as climate or weather data sets) spacetime is an ideal platform that allows the user to focus on the science rather than the coding. Here we explore the process of loading raster files, rescaling and writing to disc for a number of different data cleaning scenarios.
Spacetime takes file names/paths as lists. Here we set up a list of
geotif files. The initial read_data()
function in spacetime
uses GDAL’s python bindings at it’s base. It can read any file type that
GDAL can read. For the full list of supported types, see the GDAL API. When a great number of
files need to be read in, automated python functions like
grep
can pull lists of file paths from a given directory
into the system based on file extension or some other feature of the
file names. Here we pull in a few different tifs manually.
# create our list of file paths realtive to our directory
data = ["demoData/N0_AKestrel_1976_1980.tif", "demoData/N0_AKestrel_1977_1981.tif", "demoData/N0_AKestrel_1978_1982.tif"]
# spacetime function "read_data()" to create a file object
ds = read_data(data)
We have now created a file object. This object may be queried using a
series of methods documented in the API. These methods return a list of
the length of the number of files read in by the
read_data()
function. For instance, we can look at the EPSG
code for each file. This will return a list of length 3.
# examine the EPSG code for each file
ds.get_epsg_code()
## ['EPSG:4269', 'EPSG:4269', 'EPSG:4269']
Information like nodata values, upperleft corners, number of bands etc. can be queried at this point, too. A full list if methods can be found on the spacetime API
ds.get_nodata_value()
## [-3.3999999521443642e+38, -3.3999999521443642e+38, -3.3999999521443642e+38]
ds.get_UL_corner()
## [[1156212.8177116, 1200926.03642844], [1156212.8177116, 1200926.03642844], [1156212.8177116, 1200926.03642844]]
ds.get_band_number()
## [1, 1, 1]
We can extract a list of numpy arrays that contain the actual data from the files as well. Lets visualize them as an image rather than a large matrix.
# import matplotlib
import matplotlib.pyplot as plt
# extract the data sets from the file object
plotData = ds.get_data_array()
# plot the first file removing values below 0 (i.e. nodata vals)
plt.imshow(plotData[0], vmin=0)
plt.show()
Here we look at the dimensions of the first dataset in the list
# what is the shape of the first file
ds.get_data_array()[0].shape
## (109, 93, 1)
Each file, in the case of the kestral dataset, is a year (or rather a 4 year average). Each file as we saw above has a spatial dimension of 109 x 93 and a single band. We want a temporal spatial cube with the dimension of 109 x 93 x 3 (lat x lot x time).
When we create a cube, we can pass time vectors and variable names to
the make_cube()
function. Otherwise, the function will make
time an integer vector of 0 to T and variables will be assigned as
characters A, B, C… through N. In this case, we only have one variable
contained in the dataset but we have a time dimension that we can make
explicit. First let’s create a time vector of length 3 using the
cube_time()
function.
# make a time vec
yearObj = cube_time(start="1976", length=3, scale = "year")
print(yearObj)
## DatetimeIndex(['1976-12-31', '1977-12-31', '1978-12-31'], dtype='datetime64[ns]', freq=None)
Now we will make a cube object from this file object. When cubes are
created, a .nc4 file is written to disk. This allows for virtualization
through a number of third-party python libraries like dask and netCDF4.
The arguments organizeFiles
and organizeBands
set up the structure of the cube based on the data files. “filestotime”
and “bandstotime” together will take every file and every band within
every file and treat them as time slices when compiling the cube. If
variables are stored in different files or bands, the “filestovar” or
“bandstovar” arguments would be used instead.
# make a cube
cube3d = make_cube(data = ds, fileName = "test1.nc4", organizeFiles = "filestotime", organizeBands="bandstotime", timeObj = yearObj)
Here we have a set of files names 2001 and 2002. Each has three bands that are RGB color bands. Lets read the data in, check the number of bands to make sure, create a time vector, pass the correct variable names and output a cleaned cube object.
# read files
data1 = ["demoData/2001.tif", "demoData/2002.tif"]
# make file object
ds1 = read_data(data1)
# check bands
ds1.get_band_number()
## [3, 3]
We have the correct number of bands, 3. Lets assemble our cube. We use the “filestotime” and “bandstovar” arguments and pass R G and B to varNames as a list.
# make time vector
yearObj1 = cube_time(start="2001", length=2, scale = "year")
# make cube
cube4d = make_cube(data = ds1, fileName = "test2.nc4", organizeFiles = "filestotime", organizeBands="bandstovar", varNames=["R","G","B"], timeObj=yearObj1)
Lets take a look at our cube object. All of the same commands work on a cube object as on a file object. However, what is returned is slightly different. Instead of a list of multiple returns, we have one unified cube now that returns a single value or object. Let’s look at the extracted data from the 3d and 4d cubes we just made.
# look at data
cube3d.get_data_array()
<xarray.DataArray (time: 3, lat: 109, lon: 93)> array([[[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ..., [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]], [[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ... [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]], [[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ..., [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]]], dtype=float32) Coordinates: * lon (lon) float32 1.201e+06 1.211e+06 1.221e+06 ... 2.111e+06 2.121e+06 * lat (lat) float32 1.156e+06 1.146e+06 1.136e+06 ... 8.621e+04 7.621e+04 * time (time) datetime64[ns] 1976-12-31 1977-12-31 1978-12-31
cube4d.get_data_array()
<xarray.DataArray (variables: 3, time: 2, lat: 1001, lon: 1001)> array([[[[ 0., 799., 788., ..., 1061., 1433., 2430.], [ 0., 774., 792., ..., 1027., 1125., 1733.], [ 0., 797., 801., ..., 1046., 1040., 1100.], ..., [ 0., 826., 837., ..., 1562., 1489., 1740.], [ 0., 804., 825., ..., 1553., 1582., 1619.], [ 0., 0., 0., ..., 0., 0., 0.]], [[ 0., 799., 788., ..., 1061., 1433., 2430.], [ 0., 774., 792., ..., 1027., 1125., 1733.], [ 0., 797., 801., ..., 1046., 1040., 1100.], ..., [ 0., 826., 837., ..., 1562., 1489., 1740.], [ 0., 804., 825., ..., 1553., 1582., 1619.], [ 0., 0., 0., ..., 0., 0., 0.]]], [[[ 0., 555., 546., ..., 1106., 1353., 2280.], [ 0., 534., 533., ..., 1074., 1154., 1650.], [ 0., 548., 553., ..., 1077., 1070., 1151.], ... [ 0., 644., 627., ..., 1440., 1351., 1859.], [ 0., 628., 634., ..., 1498., 1605., 1562.], [ 0., 0., 0., ..., 0., 0., 0.]]], [[[ 0., 330., 322., ..., 823., 1263., 2408.], [ 0., 304., 313., ..., 771., 859., 1492.], [ 0., 320., 323., ..., 790., 817., 915.], ..., [ 0., 397., 403., ..., 1562., 1415., 1825.], [ 0., 392., 408., ..., 1573., 1680., 1780.], [ 0., 0., 0., ..., 0., 0., 0.]], [[ 0., 330., 322., ..., 823., 1263., 2408.], [ 0., 304., 313., ..., 771., 859., 1492.], [ 0., 320., 323., ..., 790., 817., 915.], ..., [ 0., 397., 403., ..., 1562., 1415., 1825.], [ 0., 392., 408., ..., 1573., 1680., 1780.], [ 0., 0., 0., ..., 0., 0., 0.]]]], dtype=float32) Coordinates: * variables (variables) <U1 'R' 'G' 'B' * lon (lon) float32 5.905e+05 5.905e+05 ... 6.005e+05 6.005e+05 * lat (lat) float32 5.791e+06 5.791e+06 ... 5.781e+06 5.781e+06 * time (time) datetime64[ns] 2001-12-31 2002-12-31
The xarray data structures that are returned can be operated upon like any numpy array. They also contains some useful metadata like the dimension of the cube and the names of those dimensions. Under coordinates for the 4d cube, we see the additional dimension named variables. This contains our 3 levels of R G and B. For a full list of cube object methods, see the spacetime API.
In the previous section, we set up several cubes from geotif files and explored some of the basic querrying and cube building functionality of spacetime. However, all of the files we were working with, had the same scales and grid sizes. Spacetime has the functionality to rescale rasters of different extents, grid sizes, epsg codes etc. and create a cleaned cube where all of those factors are consistent across the data layers.
Let’s load two rasters that contain population data from India. We can then load the data up as a file object and compare some attributes between them.
# read files and load data
data = ["demoData/LULC_1995.tif", "demoData/India_cropped-area_1km_2016.tif"]
ds = read_data(data)
# check number of bands
ds.get_band_number()
# check the epsg codes
## [1, 1]
ds.get_epsg_code()
# check the raster dimensions
## ['EPSG:32644', 'EPSG:4326']
ds.get_dims()
# check pixel size
## [(29484, 33555), (2373, 3269)]
ds.get_pixel_size()
# check no data values
## [100.0, 0.008868160000000002]
ds.get_nodata_value()
## [127.0, -1.7e+308]
Let’s plot the two files and compare them.
import matplotlib.pyplot as plt
# Get the array data
arrays = ds.get_data_array()
# plot file 1
plt.imshow(arrays[0], vmin=0, vmax=17)
plt.show()
# plot file 2
plt.imshow(arrays[1], vmin=0, vmax=100)
plt.show()
Apart from the fact that both datasets are from approximately the
same region, we can see that almost all of the attributes between these
two files are different. Using raster_scale()
we can get
these two files to have the same spatial reference system,, pixel size
and nodata value with cells aligned to the same lat and lon values.
# lets align these files
alignedDS = raster_align(data=ds, resolution=.08, SRS=4326, noneVal=127)
Let’s look at these attributes now that we have run the align function.
# check epsg codes
alignedDS.get_epsg_code()
# get raster dimensions
## ['EPSG:4326', 'EPSG:4326']
alignedDS.get_dims()
# pixel size
## [(409, 383), (264, 363)]
alignedDS.get_pixel_size()
# check no data values
## [0.08, 0.08]
alignedDS.get_nodata_value()
## [127.0, 127.0]
Lets also look at the rescaled images.
# get rescaled data
scaledArray = alignedDS.get_data_array()
# plot file 1
plt.imshow(scaledArray[0], vmin=0, vmax=17)
plt.show()
# plot file 2
plt.imshow(scaledArray[1], vmin=0, vmax=100)
plt.show()
We can see now that the only difference is the extent of the files. The first file covers more surface area of the globe. Lets crop the files to have the same greatest common bounding box.
# trim the file object
trimmedDS = raster_trim(alignedDS)
# lets compare the dimensions of the new objects
trimmedDS.get_dims()
## [(264, 363), (264, 363)]
Comparing the plotted images reveals that the files are now propperly aligned and trimmed
# get data
trimmedArray = trimmedDS.get_data_array()
# plot file 1
plt.imshow(trimmedArray[0], vmin=0, vmax=17)
plt.show()
# plot file 2
plt.imshow(trimmedArray[1], vmin=0, vmax=100)
plt.show()
Now that we have cleaned data sets, we can assemble them into a cube
as shown in the previous vignette. Lets start be creating a time object.
cube_time()
allows you to create a time vector that skips
user-specified chunks of time using the skips arguemnt.
# create time vec
yearObj = cube_time(start="1995", length=2, scale = "year", skips = 21)
We will now create a cube by assigning files and bands to time.
cube = make_cube(data = trimmedDS, fileName = "indiaCube.nc4", organizeFiles = "filestotime", organizeBands="bandstotime", timeObj = yearObj)
Finally, lets look at the data
cube.get_data_array()
<xarray.DataArray (time: 2, lat: 363, lon: 264)> array([[[127., 127., 127., ..., 127., 127., 127.], [127., 127., 127., ..., 127., 127., 127.], [127., 127., 127., ..., 127., 127., 127.], ..., [127., 127., 127., ..., 127., 127., 127.], [127., 127., 127., ..., 127., 127., 127.], [127., 127., 127., ..., 127., 127., 127.]], [[127., 127., 127., ..., 127., 127., 127.], [127., 127., 127., ..., 127., 127., 127.], [127., 127., 127., ..., 127., 127., 127.], ..., [127., 127., 127., ..., 127., 127., 127.], [127., 127., 127., ..., 127., 127., 127.], [127., 127., 127., ..., 127., 127., 127.]]], dtype=float32) Coordinates: * lon (lon) float32 68.08 68.16 68.24 68.32 ... 88.88 88.96 89.04 89.12 * lat (lat) float32 37.12 37.04 36.96 36.88 36.8 ... 8.4 8.32 8.24 8.16 * time (time) datetime64[ns] 1995-12-31 2016-12-31
Let’s load two raster files that contain 101 bands each. Each band is a year. The datasets contain the same data, however, they are at different spatial scales. We will first create the file object
# read files and load data
data = ["demoData/Carya_ovata_sim_disc_10km.tif", "demoData/Carya_ovata_sim_disc_1km.tif"]
ds = read_data(data)
Next, we will rescale and trim the data to the same grid size. With no arguments specified, the functions will use the first file’s attributes to rescale the rest.
# scale data
scaledData = raster_align(ds)
trimmedData = raster_trim(scaledData)
Finally, we will make a cube Object. For demonstration purposes, we will assign each band to be a month for 101 months. Each file will be assigned to a variable (10km and 1km).
# set up time vec for 101 months starting from 2000-10-12
monthObj = cube_time(start="2000-10-12", length=101, scale = "month")
monthObj
# make cube
## DatetimeIndex(['2000-10-31', '2000-11-30', '2000-12-31', '2001-01-31',
## '2001-02-28', '2001-03-31', '2001-04-30', '2001-05-31',
## '2001-06-30', '2001-07-31',
## ...
## '2008-05-31', '2008-06-30', '2008-07-31', '2008-08-31',
## '2008-09-30', '2008-10-31', '2008-11-30', '2008-12-31',
## '2009-01-31', '2009-02-28'],
## dtype='datetime64[ns]', length=101, freq=None)
cube = make_cube(data = trimmedData, fileName = "cube1.nc4", organizeFiles = "filestovar", organizeBands="bandstotime", timeObj = monthObj, varNames = ["10km", "1km"])
This function takes a cube with a time dimension and rescales along that time dimension based on an intended scale and summarizing method. Here we calculate yearly averages by passing “year” to scale and “mean” to method.
# scale time by year
timeScaled = scale_time(cube=cube, scale="year", method="mean")
Let’s examine the data structure after rescaling
# extract data from cube
timeScaled.get_data_array()
<xarray.DataArray (variables: 2, time: 10, lat: 149, lon: 297)> array([[[[ nan, 0. , 0. , ..., nan, nan, nan], [ nan, 0. , 0. , ..., nan, nan, nan], [ nan, 0. , 0. , ..., nan, nan, nan], ..., [ nan, 90162.84 , 72148.33 , ..., nan, nan, nan], [ nan, 52556.793 , 42711.94 , ..., nan, nan, nan], [ nan, 53957.2 , 45003.793 , ..., nan, nan, nan]], [[ nan, 0. , 0. , ..., nan, nan, nan], [ nan, 0. , 0. , ..., nan, nan, nan], [ nan, 0. , 0. , ..., nan, nan, nan], ... [ nan, 705.8333 , 657.3333 , ..., nan, nan, nan], [ nan, 462.91666, 679.8333 , ..., nan, nan, nan], [ nan, 1292.3334 , 922.75 , ..., nan, nan, nan]], [[ nan, 0. , 0. , ..., nan, nan, nan], [ nan, 0. , 0. , ..., nan, nan, nan], [ nan, 0. , 0. , ..., nan, nan, nan], ..., [ nan, 560. , 544.5 , ..., nan, nan, nan], [ nan, 343. , 536.5 , ..., nan, nan, nan], [ nan, 1195. , 789.5 , ..., nan, nan, nan]]]], dtype=float32) Coordinates: * time (time) datetime64[ns] 2000-12-31 2001-12-31 ... 2009-12-31 * variables (variables) <U4 '10km' '1km' * lon (lon) float32 -83.75 -83.67 -83.58 -83.5 ... -59.25 -59.17 -59.08 * lat (lat) float32 49.0 48.92 48.83 48.75 ... 36.92 36.83 36.75 36.67
Now we have 9 time points at the yearly scale where averages between months are the temporal slices in the cube. The spatial scale has not changed.
In addition to some simple rescaling, which will be expanded upon
soon, we have the ability to select temporal slices within the cube
using the selecct_time()
function. I will add a new time
vector to our cube at the daily scale for demonstration purposes.
# a new time vec
monthObjDay = cube_time(start="2000-10-12", length=101, scale = "day")
# make cube
cubeDay = make_cube(data = trimmedData, fileName = "cubeDay.nc4", organizeFiles = "filestovar", organizeBands="bandstotime", timeObj = monthObjDay, varNames = ["10km", "1km"])
Using the select time function, we can select a specific day/month/year within a range or for the entire dataset. For example, to choose only the first of each month for the entire cube we would write this:
# the first of the month
selectedDay = select_time(cube=cubeDay, range="entire", scale = "day", element=1)
# print the time vector
selectedDay.get_time()
## DatetimeIndex(['2000-11-01', '2000-12-01', '2001-01-01'], dtype='datetime64[ns]', freq=None)
Now we have a cube with a time dimension of length 3 where each slice is the first of each month in the dataset. We could also select months and do so within a range. Let’s return to our monthly data set and select Aprils between 2000 and 2005.
# select aprils between 2001 and 2003
selectedApril = select_time(cube=cube, range=["2000", "2005"], scale = "month", element=4)
# print the time vector
selectedApril.get_time()
## DatetimeIndex(['2001-04-30', '2002-04-30', '2003-04-30', '2004-04-30',
## '2005-04-30'],
## dtype='datetime64[ns]', freq=None)
Let’s start by creating a cube as we have done previously for the yearly Carya ovata data.
We will rescale and trim the data sets
# scale data
scaledData = raster_align(ds)
trimmedData = raster_trim(scaledData)
Finally, we will make a cube Object. Each file will be assigned to a variable (10km and 1km).
# set up time vec for 101 years
yearObj = cube_time(start="2000", length=101, scale = "year")
# make cube
cube = make_cube(data = trimmedData, fileName = "yearCube.nc4", organizeFiles = "filestovar", organizeBands="bandstotime", timeObj = yearObj, varNames = ["10km", "1km"])
We can use a combination of cubes, numpy arrays and scalar values to rescale values in a cube or compute some important cell-wise quantity. Here we take our cube, add 1000 to each cell and raise it to the 2nd power. Parent cube is the cube whose structure you want the output cube to use.
import numpy as np
answer = cube_smasher(eq = "a + 1000 ** c", a=cube, b=1000, c = 2, parentCube = cube)
# pre operations value = 0
cube.get_data_array()[0,0,50,50]
# after the operation = 1000000
<xarray.DataArray ()> array(0., dtype=float32) Coordinates: variables <U4 '10km' lon float32 -79.58 lat float32 44.83 time datetime64[ns] 2000-12-31
answer.get_data_array()[0,0,50,50]
<xarray.DataArray ()> array(1000000.) Coordinates: variables <U4 '10km' lon float32 -79.58 lat float32 44.83 time datetime64[ns] 2000-12-31
That checks out. 0 + 1000 raised to the second power is 1000000.