Date created: April 21, 2023
Author: P. Alexander Burnham
Summary: The main objective of the spacetime python library is to make tasks like loading, rescaling, merging, and conducting mathmatical operations on spatiotemporal (or other D-dimensional data sets) easier for the user by providing a set of concise yet powerful functions. spacetime opperations utilize a cube-like structure for all data sets that makes storing and manipulating large D-dimensional datasets more efficient. For scientists working with spatiotemporal data (such as climate or weather data sets) spacetime is an ideal platform that allows the user to focus on the science rather than the coding. Here we explore the process of loading raster files, rescaling and writing to disc for a number of different data cleaning scenarios.
Spacetime takes file names/paths as lists. Here we set up a list of
geotif files. The initial read_data()
function in spacetime
uses GDAL’s python bindings at it’s base. It can read any file type that
GDAL can read. For the full list of supported types, see the GDAL API. When a great number of
files need to be read in, automated python functions like
grep
can pull lists of file paths from a given directory
into the system based on file extension or some other feature of the
file names. Here we pull in a few different tifs manually.
# create our list of file paths realtive to our directory
data = ["demoData/N0_AKestrel_1976_1980.tif", "demoData/N0_AKestrel_1977_1981.tif", "demoData/N0_AKestrel_1978_1982.tif"]
# spacetime function "read_data()" to create a file object
ds = read_data(data)
We have now created a file object. This object may be queried using a
series of methods documented in the API. These methods return a list of
the length of the number of files read in by the
read_data()
function. For instance, we can look at the EPSG
code for each file. This will return a list of length 3.
# examine the EPSG code for each file
ds.get_epsg_code()
## ['EPSG:4269', 'EPSG:4269', 'EPSG:4269']
Information like nodata values, upperleft corners, number of bands etc. can be queried at this point, too. A full list if methods can be found on the spacetime API
ds.get_nodata_value()
## [-3.3999999521443642e+38, -3.3999999521443642e+38, -3.3999999521443642e+38]
ds.get_UL_corner()
## [[1156212.8177116, 1200926.03642844], [1156212.8177116, 1200926.03642844], [1156212.8177116, 1200926.03642844]]
ds.get_band_number()
## [1, 1, 1]
We can extract a list of numpy arrays that contain the actual data from the files as well. Lets visualize them as an image rather than a large matrix.
# import matplotlib
import matplotlib.pyplot as plt
# extract the data sets from the file object
plotData = ds.get_data_array()
# plot the first file removing values below 0 (i.e. nodata vals)
plt.imshow(plotData[0], vmin=0)
plt.show()
Here we look at the dimensions of the first dataset in the list
# what is the shape of the first file
ds.get_data_array()[0].shape
## (109, 93, 1)
Each file, in the case of the kestral dataset, is a year (or rather a 4 year average). Each file as we saw above has a spatial dimension of 109 x 93 and a single band. We want a temporal spatial cube with the dimension of 109 x 93 x 3 (lat x lot x time).
When we create a cube, we can pass time vectors and variable names to
the make_cube()
function. Otherwise, the function will make
time an integer vector of 0 to T and variables will be assigned as
characters A, B, C… through N. In this case, we only have one variable
contained in the dataset but we have a time dimension that we can make
explicit. First let’s create a time vector of length 3 using the
cube_time()
function.
# make a time vec
yearObj = cube_time(start="1976", length=3, scale = "year")
print(yearObj)
## DatetimeIndex(['1976-12-31', '1977-12-31', '1978-12-31'], dtype='datetime64[ns]', freq=None)
Now we will make a cube object from this file object. When cubes are
created, a .nc4 file is written to disk. This allows for virtualization
through a number of third-party python libraries like dask and netCDF4.
The arguments organizeFiles
and organizeBands
set up the structure of the cube based on the data files. “filestotime”
and “bandstotime” together will take every file and every band within
every file and treat them as time slices when compiling the cube. If
variables are stored in different files or bands, the “filestovar” or
“bandstovar” arguments would be used instead.
# make a cube
cube3d = make_cube(data = ds, fileName = "test1.nc4", organizeFiles = "filestotime", organizeBands="bandstotime", timeObj = yearObj)
Here we have a set of files names 2001 and 2002. Each has three bands that are RGB color bands. Lets read the data in, check the number of bands to make sure, create a time vector, pass the correct variable names and output a cleaned cube object.
# read files
data1 = ["demoData/2001.tif", "demoData/2002.tif"]
# make file object
ds1 = read_data(data1)
# check bands
ds1.get_band_number()
## [3, 3]
We have the correct number of bands, 3. Lets assemble our cube. We use the “filestotime” and “bandstovar” arguments and pass R G and B to varNames as a list.
# make time vector
yearObj1 = cube_time(start="2001", length=2, scale = "year")
# make cube
cube4d = make_cube(data = ds1, fileName = "test2.nc4", organizeFiles = "filestotime", organizeBands="bandstovar", varNames=["R","G","B"], timeObj=yearObj1)
Lets take a look at our cube object. All of the same commands work on a cube object as on a file object. However, what is returned is slightly different. Instead of a list of multiple returns, we have one unified cube now that returns a single value or object. Let’s look at the extracted data from the 3d and 4d cubes we just made.
# look at data
cube3d.get_data_array()
<xarray.DataArray (time: 3, lat: 109, lon: 93)> array([[[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ..., [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]], [[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ... [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]], [[-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], ..., [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38], [-3.4e+38, -3.4e+38, -3.4e+38, ..., -3.4e+38, -3.4e+38, -3.4e+38]]], dtype=float32) Coordinates: * lon (lon) float32 1.201e+06 1.211e+06 1.221e+06 ... 2.111e+06 2.121e+06 * lat (lat) float32 1.156e+06 1.146e+06 1.136e+06 ... 8.621e+04 7.621e+04 * time (time) datetime64[ns] 1976-12-31 1977-12-31 1978-12-31
cube4d.get_data_array()
<xarray.DataArray (variables: 3, time: 2, lat: 1001, lon: 1001)> array([[[[ 0., 799., 788., ..., 1061., 1433., 2430.], [ 0., 774., 792., ..., 1027., 1125., 1733.], [ 0., 797., 801., ..., 1046., 1040., 1100.], ..., [ 0., 826., 837., ..., 1562., 1489., 1740.], [ 0., 804., 825., ..., 1553., 1582., 1619.], [ 0., 0., 0., ..., 0., 0., 0.]], [[ 0., 799., 788., ..., 1061., 1433., 2430.], [ 0., 774., 792., ..., 1027., 1125., 1733.], [ 0., 797., 801., ..., 1046., 1040., 1100.], ..., [ 0., 826., 837., ..., 1562., 1489., 1740.], [ 0., 804., 825., ..., 1553., 1582., 1619.], [ 0., 0., 0., ..., 0., 0., 0.]]], [[[ 0., 555., 546., ..., 1106., 1353., 2280.], [ 0., 534., 533., ..., 1074., 1154., 1650.], [ 0., 548., 553., ..., 1077., 1070., 1151.], ... [ 0., 644., 627., ..., 1440., 1351., 1859.], [ 0., 628., 634., ..., 1498., 1605., 1562.], [ 0., 0., 0., ..., 0., 0., 0.]]], [[[ 0., 330., 322., ..., 823., 1263., 2408.], [ 0., 304., 313., ..., 771., 859., 1492.], [ 0., 320., 323., ..., 790., 817., 915.], ..., [ 0., 397., 403., ..., 1562., 1415., 1825.], [ 0., 392., 408., ..., 1573., 1680., 1780.], [ 0., 0., 0., ..., 0., 0., 0.]], [[ 0., 330., 322., ..., 823., 1263., 2408.], [ 0., 304., 313., ..., 771., 859., 1492.], [ 0., 320., 323., ..., 790., 817., 915.], ..., [ 0., 397., 403., ..., 1562., 1415., 1825.], [ 0., 392., 408., ..., 1573., 1680., 1780.], [ 0., 0., 0., ..., 0., 0., 0.]]]], dtype=float32) Coordinates: * variables (variables) <U1 'R' 'G' 'B' * lon (lon) float32 5.905e+05 5.905e+05 ... 6.005e+05 6.005e+05 * lat (lat) float32 5.791e+06 5.791e+06 ... 5.781e+06 5.781e+06 * time (time) datetime64[ns] 2001-12-31 2002-12-31
The xarray data structures that are returned can be operated upon like any numpy array. They also contains some useful metadata like the dimension of the cube and the names of those dimensions. Under coordinates for the 4d cube, we see the additional dimension named variables. This contains our 3 levels of R G and B. For a full list of cube object methods, see the spacetime API.
In the previous section, we set up several cubes from geotif files and explored some of the basic querrying and cube building functionality of spacetime. However, all of the files we were working with, had the same scales and grid sizes. Spacetime has the functionality to rescale rasters of different extents, grid sizes, epsg codes etc. and create a cleaned cube where all of those factors are consistent across the data layers.
Let’s load two rasters that contain population data from India. We can then load the data up as a file object and compare some attributes between them.
# read files and load data
data = ["demoData/LULC_1995.tif", "demoData/India_cropped-area_1km_2016.tif"]
ds = read_data(data)
# check number of bands
ds.get_band_number()
# check the epsg codes
## [1, 1]
ds.get_epsg_code()
# check the raster dimensions
## ['EPSG:32644', 'EPSG:4326']
ds.get_dims()
# check pixel size
## [(29484, 33555), (2373, 3269)]
ds.get_pixel_size()
# check no data values
## [100.0, 0.008868160000000002]
ds.get_nodata_value()
## [127.0, -1.7e+308]
Let’s plot the two files and compare them.
import matplotlib.pyplot as plt
# Get the array data
arrays = ds.get_data_array()
# plot file 1
plt.imshow(arrays[0], vmin=0, vmax=17)
plt.show()
# plot file 2