The tasks I am faced with is to get predicted daily minimum and maximum temperature values under RCP4.5 and RCP8.5 scenarios, for specific locations in the U.S. Based on the tasks, I need to understand how the following subtasks can be done
- load a .nc file
- subsetting spacially, as I only need continental US data
- spatial interpolation, i.e. compute the temperature value for a specific latitude-longitude location, given values on a spacial grid
- “averaging” two files cell by cell, as I need to compute the average of the daily mean and daily max temperature for each grid point
- mutate values, as I need to get the count of days in each temperature interval
- aggregation over time, as I need to count the number of days in a month with the daily mean temperature between two threshold
There are a few R packages for dealing with NetCDF data: ncdf, ncdf4, RNetCDF, raster, stars. From this post, ncdf4 might be the best, as it is the most tolerate of NetCDF file versions, as fast as C, and have a closer resemblance to the R style. Using RNetCDF might be okay too, as CMIP5 data is restricted to NetCDF 3 (see this documentation).
Following I will proceed to investigate how each of the above tasks is done with RNetCDF and ncdf4, based on a few online tutorials like this.
load a .nc file with RNetCDF
ncdfData = RNetCDF::open.nc(con="tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc")
view the content summary with RNetCDF
This prints the dimensions, variables, and global attributes of the dataset I just loaded.
The following shows the loaded dataset has 4 dimensions. Three variables have fixed size: latitude (lat), longitude (lon), and bounds (bnds). One variable has “UNLIMITED” size. According to here, data can be added along this dimension in the future. The bounds dimension is usually 2, as there are usually a lower and an upper bounds for the variables. According to this google archive page of the NOAA website, “The CF standard for netCDF files defines a bound attribute for coordinate axes, where the upper and lower bounds of the grid cells along an axis are specified by a bounds variable which is of size n*2 for an axis of length N”
time = UNLIMITED ; // (1800 currently)
lat = 73 ;
lon = 96 ;
bnds = 2 ;
The following is the print output of variables in the loaded dataset. Variable time, lat, lon all have a corresponding “_bnds” variable. This means the elements of these three variables have bounded ranges.
double time(time) ;
time:bounds = "time_bnds" ;
time:units = "days since 1859-12-01" ;
time:calendar = "360_day" ;
time:axis = "T" ;
time:long_name = "time" ;
time:standard_name = "time" ;
double time_bnds(bnds, time) ;
double lat(lat) ;
lat:bounds = "lat_bnds" ;
lat:units = "degrees_north" ;
lat:axis = "Y" ;
lat:long_name = "latitude" ;
lat:standard_name = "latitude" ;
double lat_bnds(bnds, lat) ;
double lon(lon) ;
lon:bounds = "lon_bnds" ;
lon:units = "degrees_east" ;
lon:axis = "X" ;
lon:long_name = "longitude" ;
lon:standard_name = "longitude" ;
double lon_bnds(bnds, lon) ;
double height ;
height:units = "m" ;
height:axis = "Z" ;
height:positive = "up" ;
height:long_name = "height" ;
height:standard_name = "height" ;
float tasmin(lon, lat, time) ;
tasmin:standard_name = "air_temperature" ;
tasmin:long_name = "Daily Minimum Near-Surface Air Temperature" ;
tasmin:units = "K" ;
tasmin:original_name = "mo: m01s03i236" ;
tasmin:cell_methods = "time: minimum" ;
tasmin:cell_measures = "area: areacella" ;
tasmin:history = "2011-08-13T07:05:54Z altered by CMOR: Treated scalar dimension: 'height'. 2011-08-13T07:05:54Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20). 2011-08-13T07:05:54Z altered by CMOR: Inverted axis: lat. 2011-08-13T08:28:56Z altered by CMOR: Inverted axis: lat." ;
tasmin:coordinates = "height" ;
tasmin:missing_value = 1e+20 ;
tasmin:_FillValue = 1e+20 ;
tasmin:associated_files = "baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_HadCM3_rcp45_r0i0p0.nc areacella: areacella_fx_HadCM3_rcp45_r0i0p0.nc" ;
Each variable is an array of one type of elements (source). All elements are of floating point type in the following cases. “double time(time)” means a 1-d array named “time”, with its 1st and only dimension called “time”. “float tasmin (lon, lat, time)” means variable “tasmin” has 3 dimensions, namely lon, lat, and time. The 6 indented lines below “double time(time)” are attributes about variable “time”. ” time:bounds = “time_bnds” ” means the “bounds” attribute of “time” variable has one possible value, namely “time_bnds” (source).
load one variable to an R array with RNetCDF
lat = var.get.nc(ncdfData, variable = "lat")
lon = var.get.nc(ncdfData, variable = "lon")
time = var.get.nc(ncdfData, variable = "time")
height = var.get.nc(ncdfData, variable = "height")
tasmin = var.get.nc(ncdfData, variable = "tasmin")
To check the dimension of the arrays, I use the “dim” function.
 96 73 1800
Following this post, time variable is tricky to deal with, as it usually does not follow the typical Gregorian calendar. From the “print” output above, we can see time uses a “360_day” calendar. Each element denotes the number of days since 1859-12-01. It might not be too important to convert the time to the actual “Date” object.
Visualization provides a good sanity check. This post showed a simple visualization using “image”
tasmin1 = tasmin[,,1]
image(lon, lat, tasmin1, col=heat.colors(15))
This gives the following output
This post gives a more map-like visualization
## first convert to lon to 0-180 longitude
lon180 = (lon + 180) %% 360 - 180
## create map
expand.grid(lon180, lat) %>%
dplyr::rename(lon180 = Var1, lat = Var2) %>%
dplyr::mutate(lon180 = ifelse(lon180 > 180, -(360 - lon180), lon180),
tasmin1 = as.vector(tasmin1)) %>%
dplyr::mutate(tasmin1 = convert_temperature(tasmin1, "k", "c")) %>%
ggplot2::geom_point(aes(x = lon180, y = lat, color = tasmin1),
size = 0.8) +
borders("world", colour="black", fill=NA) +
viridis::scale_color_viridis(option = "inferno", name = "Temperature (C)") +
ggtitle("Modeled min daily surface air temperature on 2031-01-01",
subtitle = "HadCM3 model, RCP4.5 experiment, r6i1p1 ensemble member")
This gives the following plot
subsetting spacially to the US
From this source, bounding box latitude longitude for the US is
top = 49.3457868 # north lat
bottom = 24.7433195 # south lat
left180= -124.7844079 # west long
right180 = -66.9513812 # east long
## convert to 0-360 range
left360 = left %% 360
right360 = right %% 360
idx_top = min(which(lat > top))
idx_bottom = max(which(lat < bottom))
idx_left = max(which(lon < left360))
idx_right = min(which(lon > right360))
## change range of longitude from 0 to 360 to -180 to 180
lat_US = lat[idx_bottom:idx_top]
lon_US = lon[idx_left:idx_right]
## first convert to 0-180 longitude
tasmin1_US = tasmin[idx_left:idx_right,idx_bottom:idx_top,1]
Plotting this gives
Averaging two NetCDF variables by cell
For averaging two variables with the same dimension, we can simply average the two variable arrays
## read the daily max data
ncdfData2 = RNetCDF::open.nc(con="tasmax_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc")
tasmax = var.get.nc(ncdfData2, variable = "tasmax")
## averaging the daily min and daily max temperature
tasavg = (tasmin + tasmax) / 2
We assume the time period is the same, as averaging two variables over different time is not a meaningful task. When the spatial resolution is different, for example, when averaging outputs from different climate models with different spatial grids, one might need to first interpolate each output to the same spatial resolution, and then apply the simple averaging as above.
Aggregation one NetCDF variable over time
The simple but less elegant way of aggregating over time is to use the “apply” function demonstrated in this post.
## a very strict version to create a grouping variable along the time dimension
group = sort(rep(1:(length(time) / 30), 30))
tasmin_agg = aperm(apply(tasmin, c(1, 2), by, group, sum), c(2, 3, 1))
A more general way to create the grouping variable is to actually convert the numeric time to a date object, and retrieve the month (assuming we want to get the average monthly value). As is demonstrated here, there is a convenient function in R package netcdf4.helper that deals with this 360-day calendar.
ncdfData = ncdf4::nc_open("tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc")
tas_time <- ncdf4.helpers::nc.get.time.series(ncdfData, v = "tasmin",
time.dim.name = "time")
## prints the starting and ending elements
tas_time[c(1:3, length(tas_time) - 2:0)]
This gives the correct output that matches the 360_day calendar (unlike the raster package) and the start and end date recorded in the file name. To use this helper function, one must use ncdf4 package to read in the .nc file.
 "2031-01-01 12:00:00" "2031-01-02 12:00:00" "2031-01-03 12:00:00"
 "2035-12-28 12:00:00" "2035-12-29 12:00:00" "2035-12-30 12:00:00"
Then get the grouping variable as follows
tas_time_str = strftime(tas_time, format="%Y-%m-%d %H:%M:%S")
group = substr(tas_time_str, start=1, stop=7)
An alternative way of using the “apply” function is to use the “tbl_cube” object. However, it requires user to include a grouping dimension, and modify the measurement array to have the additional dimension as well. The tbl_cube approach is as prone to mistake as apply and considerably slower, so it’s not a good approach.
Spatial interpolation or regridding
This post gives an overview of common regridding methods.
Common approaches to interpolate geospatial data include bilinear interpolation, nearest-neighbor interpolation, bicubic interpolation (used here), and Kriging (here).
Using package raster
This package seems to be convenient as it automatically converts the time numeric value to the actual date time. There are also built-in visualization and summary routines. The fundamental issue is that the date conversion is not correct.
## load the variable in the nc file to a RasterBrick object
tasmin_br = raster::brick("tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc", varname="tasmin")
class : RasterBrick
dimensions : 73, 96, 7008, 1800 (nrow, ncol, ncell, nlayers)
resolution : 3.75, 2.5 (x, y)
extent : -1.875, 358.125, -91.25, 91.25 (xmin, xmax, ymin, ymax)
coord. ref. : NA
data source : /Users/yujiex/Dropbox/thesis/code/cmip5/data-raw/tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc
names : X2030.12.31, X2031.01.01, X2031.01.02, X2031.01.03, X2031.01.04, X2031.01.05, X2031.01.06, X2031.01.07, X2031.01.08, X2031.01.09, X2031.01.10, X2031.01.11, X2031.01.12, X2031.01.13, X2031.01.14, ...
Date : 2030-12-31, 2035-12-30 (min, max)
varname : tasmin
The “names” of the RasterBrick object has the converted time. We can see 1) the starting date is one day earlier than is indicated in the .nc filename and 2) the dates contain the 31st, which should not be present in the 360_day calendar type.
According to this post, the 360_day calendar type should have already been added to the package, but it seems that update has not made into CRAN yet.
The conclusion is that either one has to perform the date conversion by themselves and update the default names of the RasterBrick object, or has to find some other package that does this correctly.
Using package stars
This package reads .nc file into a “stars” object. The issue here is again having the wrong time stamp (see the output after typing “nc”).
nc = stars::read_stars("tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc")
The nc object is as follows
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
from to offset delta refsys point values
x 1 96 -1.875 3.75 NA NA NULL [x]
y 1 73 91.25 -2.5 NA NA NULL [y]
time 1 1800 2028-07-17 12:00:00 1 days POSIXct NA NULL
to be done…