Downloading CMIP5 data

The ESGF portal I used is this one hosted by Department of Energy and Lawrence Livermore National Laboratory. In order to download data, one needs to register an account and get an OpenID. For my use case, I filter by the following criteria

project = CMIP5, experiment = rcp45, time frequency = day, 
variable = tasman,tasmin

This gives me 98 results of about 30 models.

screen shot 2019-01-13 at 1.11.11 pm

The most convenient way to bulk download data is to use the WGET script. First, I add the ones I want to download into the “Data Cart”. Then click “My Data Cart” to go to the cart. You can filter the results by variables by entering the variable name in the top right text box.

screen shot 2019-01-13 at 1.23.42 pm

For example, before filtering, there will be 319 files in the first dataset. After filtering with “tasmin”, when you click “List Files” of the first dataset, there are only 11 files left.

After applying the filters, I click “Select All Datasets”, then click on the “WGET Script” at the very top. This prompt me a small window. I click on the “WGET Script for” link in that window. This will download the WGET script for downloading all selected items in my data cart.

Last, I open the terminal and type the following to start downloading. It will prompt you to enter the OpenID and password.

bash -H

Files will be downloaded in the same folder as the wget script. Depending on time duration, some files are several hundred MBs, and some are over 1GB.

How to read and process the downloaded .nc files is described in another post.

Use CMIP5 data for climate impact projection for climate change scenarios (RCP 4.5 vs 8.5)


My main project is to estimate the association between the total energy consumption of two portfolios and the temperature conditions. I want to add a projection of how much energy the two portfolios would consume under future climate change conditions, using the method in Deschênese and Greenstone.

One sub-task of the impact projection analysis is to download daily minimum and maximum data under two climate change scenarios: RCP4.5 and RCP 8.5. RCP stands for representative concentration pathways. 4.5 and 8.5 means radioactive forcing will increase by 4.5 W/m2 or 8.5 W/m2 by the end of the century (source).

Coupled Model Intercomparison Project Phase 5 (CMIP5) contains climate simulation data of various models under various experimental settings. It is the basis of the IPCC 5th assessment report (source). This page has an overview of CMIP5.

How to get the data

Getting the raw data

I retrieved CMIP5 data from one of the suggested portals, PCMDI: Registration is required to download data. The variables I need to download are tasmax (“Daily Maximum Near-Surface Air Temperature”) and tasmin (“Daily Minimum Near-Surface Air Temperature”). The list of variables and their meanings can be found here.

The searching and downloading portal is here. In my case, I chose the following search filters: project = CMIP5, experiment = rcp45 or rcp85, time frequency = day, variable = tasmax or tasmin. This gives me 166 search results (97 results are in the rcp45 experiment), and 33 models. Each search result contains a list of files, each with format NetCDF (.nc), a self-documenting multi-dimensional array containing one variable.

The R package RNetCDF or ncdf4 can process NetCDF files. This page has a tutorial of using ncdf4 to process NetCDF files. For RNetCDF, see my post here using a sample file

We need to understand several important terms in order correctly use the data, the explanations can be found here and here. I summarized some key concepts based on these two sources

  • experiment: simulation runs regarding one question, for example, rcp45 and rcp85 are experiments. There is also “experimental family”, concerning a larger scale question, for example, “rcp” is the experimental family of the above two experiments.
  • ensemble member: according to source 1, an ensemble member is a simulation run with specific settings of initial states, initialization methods, and physical perturbations. Each such ensemble member is denoted with r<a>i<b>p<c>, with a, b, c being integers. r1i2p3 means the first run with initial state 1, initialization method 2, physical setting (or perturbation) 3. “0” is reserved for variables not changing over time. Historical and rcp simulations with the same realization number can be concatenated
  • ensemble: haven’t found a definition
  • model: this page listed the series of models used in the CMIP5 project. From the table, we can see different models have different spatial resolutions. All models have atmospheric data, some don’t have ocean data or only time-invariant ocean data. Some models have different resolutions for atmosphere and ocean.

A few questions remain. Following is a brief investigation of such questions.

Which model to choose? Does not matter, as long as you use ensemble average of many models.

(Pierce et al. 2009) investigated the model selection and aggregating methods for using CMIP3 data to conduct regional “detection and attribution” studies in the western US. They used surface min temperature as the target variable. 42 metrics are used to evaluate model accuracy. Each metric is a “skill score” derived from spatial MSE of a variable. They that selecting models with top skill scores is not critical for the D&A analysis. The best strategy is to use the ensemble average of several models.

How do I aggregate multiple models? Challenging…

Aggregating multiple models are challenging: (Knutti et al. 2010) discussed some of the challenges in aggregating results of multiple models. The major issues are: the prediction accuracy of the current period might not well reflect the accuracy of future; averaging models could wipe out some important patterns, and there are no agreed-upon rules to rate the quality of a model.

Should I use raw data or calibrate it? Calibrate it.

Also according to this post, it is bad to use one model’s raw data directly for impact analysis.

(Ho et al. 2012) have discussed two major calibration methods: the “bias correction” method and the “change factor” method”. Both use CDF and quantile functions to match changes. The difference is that “bias correction” matches the changes between the model and the observed, and “change factor” matches the changes between the current and future.

How to get the climate prediction for specific locations using the climate models?

to be investigated…

Getting the downscaled data directly

Downloading raw data for all model and ensemble members is itself time and space consuming (a couple hundred GB). After getting the raw data, there’s still the issue of unifying their spatial resolution, and ensemble the model outputs. A neater choice is to get the spatially downscaled and bias-corrected US data from the archive of Downscaled CMIP3 and CMIP5 Climate and Hydrology Projections. According to this page, three downscaling products are available in this archive: BCCA and LOCA have daily data, and BCSD has monthly data. LOCA has a higher spatial resolution (1/6 x 1/6 degree, or around 6km x 6km) than BCCA (1/8 x 1/8 degree, or around 12km x 12km). This page has some information about the downscaling methods.

You can even query a spatial-temporal subset of the downscaled data under “Projections: Subset request“.

Processing CMIP5 NetCDF daily temperature data with R

The tasks I am faced with is to get predicted daily minimum and maximum temperature values under RCP4.5 and RCP8.5 scenarios, for specific locations in the U.S. Based on the tasks, I need to understand how the following subtasks can be done

  • load a .nc file
  • subsetting spacially, as I only need continental US data
  • spatial interpolation, i.e. compute the temperature value for a specific latitude-longitude location, given values on a spacial grid
  • “averaging” two files cell by cell, as I need to compute the average of the daily mean and daily max temperature for each grid point
  • mutate values, as I need to get the count of days in each temperature interval
  • aggregation over time, as I need to count the number of days in a month with the daily mean temperature between two threshold

There are a few R packages for dealing with NetCDF data: ncdf, ncdf4, RNetCDF, raster, stars. From this post, ncdf4 might be the best, as it is the most tolerate of NetCDF file versions, as fast as C, and have a closer resemblance to the R style. Using RNetCDF might be okay too, as CMIP5 data is restricted to NetCDF 3 (see this documentation).

Following I will proceed to investigate how each of the above tasks is done with RNetCDF and ncdf4, based on a few online tutorials like this.

Using RNetCDF

load a .nc file with RNetCDF

ncdfData ="")

view the content summary with RNetCDF


This prints the dimensions, variables, and global attributes of the dataset I just loaded.

The following shows the loaded dataset has 4 dimensions. Three variables have fixed size: latitude (lat), longitude (lon), and bounds (bnds). One variable has “UNLIMITED” size. According to here, data can be added along this dimension in the future. The bounds dimension is usually 2, as there are usually a lower and an upper bounds for the variables. According to this google archive page of the NOAA website, “The CF standard for netCDF files defines a bound attribute for coordinate axes, where the upper and lower bounds of the grid cells along an axis are specified by a bounds variable which is of size n*2 for an axis of length N”

time = UNLIMITED ; // (1800 currently)
lat = 73 ;
lon = 96 ;
bnds = 2 ;

The following is the print output of variables in the loaded dataset. Variable time, lat, lon all have a corresponding “_bnds” variable. This means the elements of these three variables have bounded ranges.

        double time(time) ;
                time:bounds = "time_bnds" ;
                time:units = "days since 1859-12-01" ;
                time:calendar = "360_day" ;
                time:axis = "T" ;
                time:long_name = "time" ;
                time:standard_name = "time" ;
        double time_bnds(bnds, time) ;
        double lat(lat) ;
                lat:bounds = "lat_bnds" ;
                lat:units = "degrees_north" ;
                lat:axis = "Y" ;
                lat:long_name = "latitude" ;
                lat:standard_name = "latitude" ;
        double lat_bnds(bnds, lat) ;
        double lon(lon) ;
                lon:bounds = "lon_bnds" ;
                lon:units = "degrees_east" ;
                lon:axis = "X" ;
                lon:long_name = "longitude" ;
                lon:standard_name = "longitude" ;
        double lon_bnds(bnds, lon) ;
        double height ;
                height:units = "m" ;
                height:axis = "Z" ;
                height:positive = "up" ;
                height:long_name = "height" ;
                height:standard_name = "height" ;
        float tasmin(lon, lat, time) ;
                tasmin:standard_name = "air_temperature" ;
                tasmin:long_name = "Daily Minimum Near-Surface Air Temperature" ;
                tasmin:units = "K" ;
                tasmin:original_name = "mo: m01s03i236" ;
                tasmin:cell_methods = "time: minimum" ;
                tasmin:cell_measures = "area: areacella" ;
                tasmin:history = "2011-08-13T07:05:54Z altered by CMOR: Treated scalar dimension: 'height'. 2011-08-13T07:05:54Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20). 2011-08-13T07:05:54Z altered by CMOR: Inverted axis: lat. 2011-08-13T08:28:56Z altered by CMOR: Inverted axis: lat." ;
                tasmin:coordinates = "height" ;
                tasmin:missing_value = 1e+20 ;
                tasmin:_FillValue = 1e+20 ;
                tasmin:associated_files = "baseURL: gridspecFile: areacella:" ;

Each variable is an array of one type of elements (source). All elements are of floating point type in the following cases. “double time(time)” means a 1-d array named “time”, with its 1st and only dimension called “time”. “float tasmin (lon, lat, time)” means variable “tasmin” has 3 dimensions, namely lon, lat, and time. The 6 indented lines below “double time(time)” are attributes about variable “time”. ” time:bounds = “time_bnds” ” means the “bounds” attribute of “time” variable has one possible value, namely “time_bnds” (source).

load one variable to an R array with RNetCDF

lat =, variable = "lat")
lon =, variable = "lon")
time =, variable = "time")
height =, variable = "height")
tasmin =, variable = "tasmin")

To check the dimension of the arrays, I use the “dim” function.

> dim(lat)
[1] 73
> dim(lon)
[1] 96
> dim(time)
[1] 1800
> dim(height)
[1] 1
> dim(tasmin)
[1]   96   73 1800

Following this post, time variable is tricky to deal with, as it usually does not follow the typical Gregorian calendar. From the “print” output above, we can see time uses a “360_day” calendar. Each element denotes the number of days since 1859-12-01. It might not be too important to convert the time to the actual “Date” object.


Visualization provides a good sanity check. This post showed a simple visualization using “image”

tasmin1 = tasmin[,,1]
image(lon, lat, tasmin1, col=heat.colors(15))

This gives the following output


This post gives a more map-like visualization

## first convert to lon to 0-180 longitude
lon180 = (lon + 180) %% 360 - 180

## create map
expand.grid(lon180, lat) %>%
  dplyr::rename(lon180 = Var1, lat = Var2) %>%
  dplyr::mutate(lon180 = ifelse(lon180 > 180, -(360 - lon180), lon180),
         tasmin1 = as.vector(tasmin1)) %>%
  dplyr::mutate(tasmin1 = convert_temperature(tasmin1, "k", "c")) %>%
  ggplot2::ggplot() +
  ggplot2::geom_point(aes(x = lon180, y = lat, color = tasmin1),
             size = 0.8) +
  borders("world", colour="black", fill=NA) +
  viridis::scale_color_viridis(option = "inferno", name = "Temperature (C)") +
  theme_void() +
  coord_quickmap() +
  ggtitle("Modeled min daily surface air temperature on 2031-01-01",
          subtitle = "HadCM3 model, RCP4.5 experiment, r6i1p1 ensemble member")


This gives the following plot


subsetting spacially to the US

From this source, bounding box latitude longitude for the US is

top = 49.3457868 # north lat
bottom =  24.7433195 # south lat
left180= -124.7844079 # west long
right180 = -66.9513812 # east long

## convert to 0-360 range
left360 = left %% 360
right360 = right %% 360

idx_top = min(which(lat > top))
idx_bottom = max(which(lat < bottom))
idx_left = max(which(lon < left360))
idx_right = min(which(lon > right360))

## change range of longitude from 0 to 360 to -180 to 180
lat_US = lat[idx_bottom:idx_top]
lon_US = lon[idx_left:idx_right]

## first convert to 0-180 longitude
tasmin1_US = tasmin[idx_left:idx_right,idx_bottom:idx_top,1]

Plotting this gives


Averaging two NetCDF variables by cell

For averaging two variables with the same dimension, we can simply average the two variable arrays

## read the daily max data
ncdfData2 ="")
tasmax =, variable = "tasmax")

## averaging the daily min and daily max temperature
tasavg = (tasmin + tasmax) / 2

We assume the time period is the same, as averaging two variables over different time is not a meaningful task. When the spatial resolution is different, for example, when averaging outputs from different climate models with different spatial grids, one might need to first interpolate each output to the same spatial resolution, and then apply the simple averaging as above.

Aggregation one NetCDF variable over time

The simple but less elegant way of aggregating over time is to use the “apply” function demonstrated in this post.

## a very strict version to create a grouping variable along the time dimension
group = sort(rep(1:(length(time) / 30), 30))

tasmin_agg = aperm(apply(tasmin, c(1, 2), by, group, sum), c(2, 3, 1))

A more general way to create the grouping variable is to actually convert the numeric time to a date object, and retrieve the month (assuming we want to get the average monthly value). As is demonstrated here, there is a convenient function in R package netcdf4.helper that deals with this 360-day calendar.


ncdfData = ncdf4::nc_open("")
tas_time <- ncdf4.helpers::nc.get.time.series(ncdfData, v = "tasmin", = "time")
## prints the starting and ending elements
tas_time[c(1:3, length(tas_time) - 2:0)]

This gives the correct output that matches the 360_day calendar (unlike the raster package) and the start and end date recorded in the file name. To use this helper function, one must use ncdf4 package to read in the .nc file.

[1] "2031-01-01 12:00:00" "2031-01-02 12:00:00" "2031-01-03 12:00:00"
[4] "2035-12-28 12:00:00" "2035-12-29 12:00:00" "2035-12-30 12:00:00"

Then get the grouping variable as follows

tas_time_str = strftime(tas_time, format="%Y-%m-%d %H:%M:%S")
group = substr(tas_time_str, start=1, stop=7)

An alternative way of using the “apply” function is to use the “tbl_cube” object. However, it requires user to include a grouping dimension, and modify the measurement array to have the additional dimension as well. The tbl_cube approach is as prone to mistake as apply and considerably slower, so it’s not a good approach.

Spatial interpolation or regridding

This post gives an overview of common regridding methods.

Common approaches to interpolate geospatial data include bilinear interpolation, nearest-neighbor interpolation, bicubic interpolation (used here), and Kriging (here).

Using package raster

This package seems to be convenient as it automatically converts the time numeric value to the actual date time. There are also built-in visualization and summary routines. The fundamental issue is that the date conversion is not correct.


## load the variable in the nc file to a RasterBrick object
tasmin_br = raster::brick("", varname="tasmin")

> tasmin_br
class       : RasterBrick 
dimensions  : 73, 96, 7008, 1800  (nrow, ncol, ncell, nlayers)
resolution  : 3.75, 2.5  (x, y)
extent      : -1.875, 358.125, -91.25, 91.25  (xmin, xmax, ymin, ymax)
coord. ref. : NA 
data source : /Users/yujiex/Dropbox/thesis/code/cmip5/data-raw/ 
names       : X2030.12.31, X2031.01.01, X2031.01.02, X2031.01.03, X2031.01.04, X2031.01.05, X2031.01.06, X2031.01.07, X2031.01.08, X2031.01.09, X2031.01.10, X2031.01.11, X2031.01.12, X2031.01.13, X2031.01.14, ... 
Date        : 2030-12-31, 2035-12-30 (min, max)
varname     : tasmin

The “names” of the RasterBrick object has the converted time. We can see 1) the starting date is one day earlier than is indicated in the .nc filename and 2) the dates contain the 31st, which should not be present in the 360_day calendar type.

According to this post, the 360_day calendar type should have already been added to the package, but it seems that update has not made into CRAN yet.

The conclusion is that either one has to perform the date conversion by themselves and update the default names of the RasterBrick object, or has to find some other package that does this correctly.

Using package stars

This package reads .nc file into a “stars” object. The issue here is again having the wrong time stamp (see the output after typing “nc”).

nc = stars::read_stars("")

The nc object is as follows

> nc
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells: 
 Min.   :200.0                                       
 1st Qu.:253.1                                       
 Median :277.5                                       
 Mean   :273.6                                       
 3rd Qu.:294.5                                       
 Max.   :306.1                                       
     from   to              offset  delta  refsys point values    
x       1   96              -1.875   3.75      NA    NA   NULL [x]
y       1   73               91.25   -2.5      NA    NA   NULL [y]
time    1 1800 2028-07-17 12:00:00 1 days POSIXct    NA   NULL

Using ncdf4

to be done…



Weather normalization

According to Portfolio Manager Technical Reference: Climate and Weather, “weather normalized energy is the energy your building would have used under average conditions”. This is a causal problem, rather than a prediction problem, as the “would have” implies actively setting weather condition to the climate normal. Essentially we try to estimate

E[\text{energy} \mid \text{set weather} = w].

Following is a causal graph. There are many variables affecting the energy consumption of a building, but none of them causally affects weather (neglecting global climate change), so there are no confounding variables to adjust for. This makes it viable to estimate the causal effect of weather using prediction techniques.


The causal graph when thinking of weather as a whole

However, “weather” itself is complicated. Temperature is usually thought of as the most representative variable of weather, but other factors such as solar radiation, cloud cover, and wind speed, can cause changes in weather and changes in building energy consumption. From this point of view, if we wanted to know the causal effect of “temperature” (or sometimes degree day) on building energy consumption, solar radiation, cloud cover, wind, etc. becomes confounding factors (see the causal graph below)


The causal graph when thinking of weather as just temperature

From the above reasoning, the following question arises: is it viable to use usually prediction techniques to conduct model and variable selection when constructing the appropriate “weather” variable? i.e. select which of the weather variables to include in the \vec{\text{weather}} when we estimate E[\text{energy} \mid \text{weather}].

Now if we step back and look at the original goal of weather normalization: comparing the “performance” (efficiency?) of energy consumption over time, then this seems to call for some metric describing efficiency. The causal effect of weather on temperature might be one of them. Using this metric, maybe the “base load” (non-weather-dependent consumption) could be evaluated as E[\text{energy} \mid \text{temperature} = 75F].

After discussing with a stats professor, I got the idea that I need to look into doubly robust estimator to estimate the quantity of E[\text{energy} \mid \text{weather} = w], where w is a set of possible weather variables.