Processing CMIP5 NetCDF daily temperature data with R

The tasks I am faced with is to get predicted daily minimum and maximum temperature values under RCP4.5 and RCP8.5 scenarios, for specific locations in the U.S. Based on the tasks, I need to understand how the following subtasks can be done

  • load a .nc file
  • subsetting spacially, as I only need continental US data
  • spatial interpolation, i.e. compute the temperature value for a specific latitude-longitude location, given values on a spacial grid
  • “averaging” two files cell by cell, as I need to compute the average of the daily mean and daily max temperature for each grid point
  • mutate values, as I need to get the count of days in each temperature interval
  • aggregation over time, as I need to count the number of days in a month with the daily mean temperature between two threshold

There are a few R packages for dealing with NetCDF data: ncdf, ncdf4, RNetCDF, raster, stars. From this post, ncdf4 might be the best, as it is the most tolerate of NetCDF file versions, as fast as C, and have a closer resemblance to the R style. Using RNetCDF might be okay too, as CMIP5 data is restricted to NetCDF 3 (see this documentation).

Following I will proceed to investigate how each of the above tasks is done with RNetCDF and ncdf4, based on a few online tutorials like this.

Using RNetCDF

load a .nc file with RNetCDF

library(RNetCDF)
ncdfData = RNetCDF::open.nc(con="tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc")

view the content summary with RNetCDF

RNetCDF::nc.print(ncfdData)

This prints the dimensions, variables, and global attributes of the dataset I just loaded.

The following shows the loaded dataset has 4 dimensions. Three variables have fixed size: latitude (lat), longitude (lon), and bounds (bnds). One variable has “UNLIMITED” size. According to here, data can be added along this dimension in the future. The bounds dimension is usually 2, as there are usually a lower and an upper bounds for the variables. According to this google archive page of the NOAA website, “The CF standard for netCDF files defines a bound attribute for coordinate axes, where the upper and lower bounds of the grid cells along an axis are specified by a bounds variable which is of size n*2 for an axis of length N”

dimensions:
time = UNLIMITED ; // (1800 currently)
lat = 73 ;
lon = 96 ;
bnds = 2 ;

The following is the print output of variables in the loaded dataset. Variable time, lat, lon all have a corresponding “_bnds” variable. This means the elements of these three variables have bounded ranges.

variables:
        double time(time) ;
                time:bounds = "time_bnds" ;
                time:units = "days since 1859-12-01" ;
                time:calendar = "360_day" ;
                time:axis = "T" ;
                time:long_name = "time" ;
                time:standard_name = "time" ;
        double time_bnds(bnds, time) ;
        double lat(lat) ;
                lat:bounds = "lat_bnds" ;
                lat:units = "degrees_north" ;
                lat:axis = "Y" ;
                lat:long_name = "latitude" ;
                lat:standard_name = "latitude" ;
        double lat_bnds(bnds, lat) ;
        double lon(lon) ;
                lon:bounds = "lon_bnds" ;
                lon:units = "degrees_east" ;
                lon:axis = "X" ;
                lon:long_name = "longitude" ;
                lon:standard_name = "longitude" ;
        double lon_bnds(bnds, lon) ;
        double height ;
                height:units = "m" ;
                height:axis = "Z" ;
                height:positive = "up" ;
                height:long_name = "height" ;
                height:standard_name = "height" ;
        float tasmin(lon, lat, time) ;
                tasmin:standard_name = "air_temperature" ;
                tasmin:long_name = "Daily Minimum Near-Surface Air Temperature" ;
                tasmin:units = "K" ;
                tasmin:original_name = "mo: m01s03i236" ;
                tasmin:cell_methods = "time: minimum" ;
                tasmin:cell_measures = "area: areacella" ;
                tasmin:history = "2011-08-13T07:05:54Z altered by CMOR: Treated scalar dimension: 'height'. 2011-08-13T07:05:54Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20). 2011-08-13T07:05:54Z altered by CMOR: Inverted axis: lat. 2011-08-13T08:28:56Z altered by CMOR: Inverted axis: lat." ;
                tasmin:coordinates = "height" ;
                tasmin:missing_value = 1e+20 ;
                tasmin:_FillValue = 1e+20 ;
                tasmin:associated_files = "baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_HadCM3_rcp45_r0i0p0.nc areacella: areacella_fx_HadCM3_rcp45_r0i0p0.nc" ;

Each variable is an array of one type of elements (source). All elements are of floating point type in the following cases. “double time(time)” means a 1-d array named “time”, with its 1st and only dimension called “time”. “float tasmin (lon, lat, time)” means variable “tasmin” has 3 dimensions, namely lon, lat, and time. The 6 indented lines below “double time(time)” are attributes about variable “time”. ” time:bounds = “time_bnds” ” means the “bounds” attribute of “time” variable has one possible value, namely “time_bnds” (source).

load one variable to an R array with RNetCDF

lat = var.get.nc(ncdfData, variable = "lat")
lon = var.get.nc(ncdfData, variable = "lon")
time = var.get.nc(ncdfData, variable = "time")
height = var.get.nc(ncdfData, variable = "height")
tasmin = var.get.nc(ncdfData, variable = "tasmin")

To check the dimension of the arrays, I use the “dim” function.

> dim(lat)
[1] 73
> dim(lon)
[1] 96
> dim(time)
[1] 1800
> dim(height)
[1] 1
> dim(tasmin)
[1]   96   73 1800

Following this post, time variable is tricky to deal with, as it usually does not follow the typical Gregorian calendar. From the “print” output above, we can see time uses a “360_day” calendar. Each element denotes the number of days since 1859-12-01. It might not be too important to convert the time to the actual “Date” object.

visualizing

Visualization provides a good sanity check. This post showed a simple visualization using “image”

tasmin1 = tasmin[,,1]
image(lon, lat, tasmin1, col=heat.colors(15))

This gives the following output

simple_plot_with_image

This post gives a more map-like visualization

## first convert to lon to 0-180 longitude
lon180 = (lon + 180) %% 360 - 180

## create map
expand.grid(lon180, lat) %>%
  dplyr::rename(lon180 = Var1, lat = Var2) %>%
  dplyr::mutate(lon180 = ifelse(lon180 > 180, -(360 - lon180), lon180),
         tasmin1 = as.vector(tasmin1)) %>%
  dplyr::mutate(tasmin1 = convert_temperature(tasmin1, "k", "c")) %>%
  ggplot2::ggplot() +
  ggplot2::geom_point(aes(x = lon180, y = lat, color = tasmin1),
             size = 0.8) +
  borders("world", colour="black", fill=NA) +
  viridis::scale_color_viridis(option = "inferno", name = "Temperature (C)") +
  theme_void() +
  coord_quickmap() +
  ggtitle("Modeled min daily surface air temperature on 2031-01-01",
          subtitle = "HadCM3 model, RCP4.5 experiment, r6i1p1 ensemble member")

 

This gives the following plot

better_map

subsetting spacially to the US

From this source, bounding box latitude longitude for the US is

top = 49.3457868 # north lat
bottom =  24.7433195 # south lat
left180= -124.7844079 # west long
right180 = -66.9513812 # east long

## convert to 0-360 range
left360 = left %% 360
right360 = right %% 360

idx_top = min(which(lat > top))
idx_bottom = max(which(lat < bottom))
idx_left = max(which(lon < left360))
idx_right = min(which(lon > right360))

## change range of longitude from 0 to 360 to -180 to 180
lat_US = lat[idx_bottom:idx_top]
lon_US = lon[idx_left:idx_right]

## first convert to 0-180 longitude
tasmin1_US = tasmin[idx_left:idx_right,idx_bottom:idx_top,1]

Plotting this gives

better_map_us

Averaging two NetCDF variables by cell

For averaging two variables with the same dimension, we can simply average the two variable arrays

## read the daily max data
ncdfData2 = RNetCDF::open.nc(con="tasmax_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc")
tasmax = var.get.nc(ncdfData2, variable = "tasmax")

## averaging the daily min and daily max temperature
tasavg = (tasmin + tasmax) / 2

We assume the time period is the same, as averaging two variables over different time is not a meaningful task. When the spatial resolution is different, for example, when averaging outputs from different climate models with different spatial grids, one might need to first interpolate each output to the same spatial resolution, and then apply the simple averaging as above.

Aggregation one NetCDF variable over time

The simple but less elegant way of aggregating over time is to use the “apply” function demonstrated in this post.

## a very strict version to create a grouping variable along the time dimension
group = sort(rep(1:(length(time) / 30), 30))

tasmin_agg = aperm(apply(tasmin, c(1, 2), by, group, sum), c(2, 3, 1))

A more general way to create the grouping variable is to actually convert the numeric time to a date object, and retrieve the month (assuming we want to get the average monthly value). As is demonstrated here, there is a convenient function in R package netcdf4.helper that deals with this 360-day calendar.

library(ncdf4)
library(ncdf4.helper)
library(PCICt)

ncdfData = ncdf4::nc_open("tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc")
tas_time <- ncdf4.helpers::nc.get.time.series(ncdfData, v = "tasmin",
time.dim.name = "time")
## prints the starting and ending elements
tas_time[c(1:3, length(tas_time) - 2:0)]

This gives the correct output that matches the 360_day calendar (unlike the raster package) and the start and end date recorded in the file name. To use this helper function, one must use ncdf4 package to read in the .nc file.

[1] "2031-01-01 12:00:00" "2031-01-02 12:00:00" "2031-01-03 12:00:00"
[4] "2035-12-28 12:00:00" "2035-12-29 12:00:00" "2035-12-30 12:00:00"

Then get the grouping variable as follows

tas_time_str = strftime(tas_time, format="%Y-%m-%d %H:%M:%S")
group = substr(tas_time_str, start=1, stop=7)

An alternative way of using the “apply” function is to use the “tbl_cube” object. However, it requires user to include a grouping dimension, and modify the measurement array to have the additional dimension as well. The tbl_cube approach is as prone to mistake as apply and considerably slower, so it’s not a good approach.

Spatial interpolation or regridding

This post gives an overview of common regridding methods.

Common approaches to interpolate geospatial data include bilinear interpolation, nearest-neighbor interpolation, bicubic interpolation (used here), and Kriging (here).

Using package raster

This package seems to be convenient as it automatically converts the time numeric value to the actual date time. There are also built-in visualization and summary routines. The fundamental issue is that the date conversion is not correct.

library(raster)

## load the variable in the nc file to a RasterBrick object
tasmin_br = raster::brick("tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc", varname="tasmin")

> tasmin_br
class       : RasterBrick 
dimensions  : 73, 96, 7008, 1800  (nrow, ncol, ncell, nlayers)
resolution  : 3.75, 2.5  (x, y)
extent      : -1.875, 358.125, -91.25, 91.25  (xmin, xmax, ymin, ymax)
coord. ref. : NA 
data source : /Users/yujiex/Dropbox/thesis/code/cmip5/data-raw/tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc 
names       : X2030.12.31, X2031.01.01, X2031.01.02, X2031.01.03, X2031.01.04, X2031.01.05, X2031.01.06, X2031.01.07, X2031.01.08, X2031.01.09, X2031.01.10, X2031.01.11, X2031.01.12, X2031.01.13, X2031.01.14, ... 
Date        : 2030-12-31, 2035-12-30 (min, max)
varname     : tasmin

The “names” of the RasterBrick object has the converted time. We can see 1) the starting date is one day earlier than is indicated in the .nc filename and 2) the dates contain the 31st, which should not be present in the 360_day calendar type.

According to this post, the 360_day calendar type should have already been added to the package, but it seems that update has not made into CRAN yet.

The conclusion is that either one has to perform the date conversion by themselves and update the default names of the RasterBrick object, or has to find some other package that does this correctly.

Using package stars

This package reads .nc file into a “stars” object. The issue here is again having the wrong time stamp (see the output after typing “nc”).

library(stars)
nc = stars::read_stars("tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc")

The nc object is as follows

> nc
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
 tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc 
 Min.   :200.0                                       
 1st Qu.:253.1                                       
 Median :277.5                                       
 Mean   :273.6                                       
 3rd Qu.:294.5                                       
 Max.   :306.1                                       
dimension(s):
     from   to              offset  delta  refsys point values    
x       1   96              -1.875   3.75      NA    NA   NULL [x]
y       1   73               91.25   -2.5      NA    NA   NULL [y]
time    1 1800 2028-07-17 12:00:00 1 days POSIXct    NA   NULL

Using ncdf4

to be done…

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s