Intro
My main project is to estimate the association between the total energy consumption of two portfolios and the temperature conditions. I want to add a projection of how much energy the two portfolios would consume under future climate change conditions, using the method in Deschênese and Greenstone.
One sub-task of the impact projection analysis is to download daily minimum and maximum data under two climate change scenarios: RCP4.5 and RCP 8.5. RCP stands for representative concentration pathways. 4.5 and 8.5 means radioactive forcing will increase by 4.5 W/m2 or 8.5 W/m2 by the end of the century (source).
Coupled Model Intercomparison Project Phase 5 (CMIP5) contains climate simulation data of various models under various experimental settings. It is the basis of the IPCC 5th assessment report (source). This page has an overview of CMIP5.
How to get the data
Getting the raw data
I retrieved CMIP5 data from one of the suggested portals, PCMDI: http://pcmdi9.llnl.gov/. Registration is required to download data. The variables I need to download are tasmax (“Daily Maximum Near-Surface Air Temperature”) and tasmin (“Daily Minimum Near-Surface Air Temperature”). The list of variables and their meanings can be found here.
The searching and downloading portal is here. In my case, I chose the following search filters: project = CMIP5, experiment = rcp45 or rcp85, time frequency = day, variable = tasmax or tasmin. This gives me 166 search results (97 results are in the rcp45 experiment), and 33 models. Each search result contains a list of files, each with format NetCDF (.nc), a self-documenting multi-dimensional array containing one variable.
The R package RNetCDF or ncdf4 can process NetCDF files. This page has a tutorial of using ncdf4 to process NetCDF files. For RNetCDF, see my post here using a sample file tasmin_day_HadCM3_rcp45_r6i1p1_20310101-20351230.nc.
We need to understand several important terms in order correctly use the data, the explanations can be found here and here. I summarized some key concepts based on these two sources
- experiment: simulation runs regarding one question, for example, rcp45 and rcp85 are experiments. There is also “experimental family”, concerning a larger scale question, for example, “rcp” is the experimental family of the above two experiments.
- ensemble member: according to source 1, an ensemble member is a simulation run with specific settings of initial states, initialization methods, and physical perturbations. Each such ensemble member is denoted with r<a>i<b>p<c>, with a, b, c being integers. r1i2p3 means the first run with initial state 1, initialization method 2, physical setting (or perturbation) 3. “0” is reserved for variables not changing over time. Historical and rcp simulations with the same realization number can be concatenated
- ensemble: haven’t found a definition
- model: this page listed the series of models used in the CMIP5 project. From the table, we can see different models have different spatial resolutions. All models have atmospheric data, some don’t have ocean data or only time-invariant ocean data. Some models have different resolutions for atmosphere and ocean.
A few questions remain. Following is a brief investigation of such questions.
Which model to choose? Does not matter, as long as you use ensemble average of many models.
(Pierce et al. 2009) investigated the model selection and aggregating methods for using CMIP3 data to conduct regional “detection and attribution” studies in the western US. They used surface min temperature as the target variable. 42 metrics are used to evaluate model accuracy. Each metric is a “skill score” derived from spatial MSE of a variable. They that selecting models with top skill scores is not critical for the D&A analysis. The best strategy is to use the ensemble average of several models.
How do I aggregate multiple models? Challenging…
Aggregating multiple models are challenging: (Knutti et al. 2010) discussed some of the challenges in aggregating results of multiple models. The major issues are: the prediction accuracy of the current period might not well reflect the accuracy of future; averaging models could wipe out some important patterns, and there are no agreed-upon rules to rate the quality of a model.
Should I use raw data or calibrate it? Calibrate it.
Also according to this post, it is bad to use one model’s raw data directly for impact analysis.
(Ho et al. 2012) have discussed two major calibration methods: the “bias correction” method and the “change factor” method”. Both use CDF and quantile functions to match changes. The difference is that “bias correction” matches the changes between the model and the observed, and “change factor” matches the changes between the current and future.
How to get the climate prediction for specific locations using the climate models?
to be investigated…
Getting the downscaled data directly
Downloading raw data for all model and ensemble members is itself time and space consuming (a couple hundred GB). After getting the raw data, there’s still the issue of unifying their spatial resolution, and ensemble the model outputs. A neater choice is to get the spatially downscaled and bias-corrected US data from the archive of Downscaled CMIP3 and CMIP5 Climate and Hydrology Projections. According to this page, three downscaling products are available in this archive: BCCA and LOCA have daily data, and BCSD has monthly data. LOCA has a higher spatial resolution (1/6 x 1/6 degree, or around 6km x 6km) than BCCA (1/8 x 1/8 degree, or around 12km x 12km). This page has some information about the downscaling methods.
You can even query a spatial-temporal subset of the downscaled data under “Projections: Subset request“.