Around the world, many climate change models are being developed (100+) under the umbrella of the World Climate Research Programme to assess the rate of climate change. Published data is generally publicly available to download for research and other (non-commercial) purposes through partner organizations in the Earth Systems Grid Federation.
The data are all formatted to comply with the CF Metadata Conventions, a set of standards to support standardization among research groups and published data sets. These conventions greatly facilitate use and analysis of the climate projections because standard processing work flows (should) work across the various data sets.
On the flip side, the CF Metadata Conventions needs to cater to a wide range of modeling requirements and that means that some of the areas covered by the standards are more complex than might be assumed. One of those areas is the temporal dimension of the data sets. The CF Metadata Conventions supports no less than nine different calendar definitions, that, upon analysis, fall into five distinct calendars (from the perspective of computation of climate projections):
standard
or gregorian
: The international
civil calendar that is in common use in many countries around the world,
adopted by edict of Pope Gregory XIII in 1582 and in effect from 15
October of that year. The proleptic_gregorian
calendar is
the same as the gregorian
calendar, but with validity
extended to periods prior to 1582-10-15
.julian
: Adopted in the year 45 BCE, every fourth year
is a leap year. Originally, the julian calendar did not have a
monotonically increasing year assigned to it and there are indeed
several julian calendars in use around the world today with different
years assigned to them. Common interpretation is currently that the year
is the same as that of the standard calendar. The julian calendar is
currently 13 days behind the gregorian calendar.365_day
or noleap
: No years have a leap
day.366_day
or all_leap
: All years have a leap
day.360_day
: Every year has 12 months of 30 days each.The three latter calendars are specific to the CF Metadata
Conventions to reduce computational complexities of working with dates.
These three, and the julian calendar, are not compliant with the
standard POSIXt
date/time facilities in R
and
using standard date/time procedures would quickly lead to problems. In
the below code snippet, the date of 1949-12-01
is the
datum from which other dates are calculated. When adding 43,289
days to this datum for a data set that uses the
360_day
calendar, that should yield a date some 120 years
after the datum:
# POSIXt calculations on a standard calendar - INCORRECT
as.Date("1949-12-01") + 43289
#> [1] "2068-06-08"
# CFtime calculation on a "360_day" calendar - CORRECT
# See below examples for details on the two functions
as_timestamp(CFtime("days since 1949-12-01", "360_day", 43289))
#> [1] "2070-02-30"
Using standard POSIXt
calculations gives a result that
is about 21 months off from the correct date - obviously an undesirable
situation. This example is far from artificial: 1949-12-01
is the datum for all CORDEX data, covering the period 1951 - 2005 for
historical experiments and the period 2006 - 2100 for RCP experiments
(with some deviation between data sets), and several models used in the
CORDEX set use the 360_day
calendar. The
365_day
or noleap
calendar deviates by about 1
day every 4 years (disregarding centurial years), or about 24 days in a
century. The 366_day
or all_leap
calendar
deviates by about 3 days every 4 years, or about 76 days in a
century.
The CFtime
package deals with the complexity of the
different calendars allowed by the CF Metadata Conventions. It properly
formats dates and times (even oddball dates like
2070-02-30
) and it can generate calendar-aware factors for
further processing of the data.
The character of CF time series - a number of numerical offsets from
a base date - implies that there should only be a single time zone
associated with the time series. The time zone offset from UTC is stored
in the datum and can be retrieved with the timezone()
function. If a vector of character timestamps with time zone information
is parsed with the CFparse()
function and the time zones
are found to be different from the datum time zone, a warning message is
generated but the timestamp is interpreted as being in the datum time
zone. No correction of timestamp to datum time zone is performed.
Data sets that are compliant with the CF Metadata Conventions always
include a datum, a specific point in time in reference to a
specified calendar, from which other points in time are
calculated by adding a specified offset of a certain
unit. This approach is encapsulated in the CFtime
package by the S4 class CFtime
.
# Create a CF time object from a definition string, a calendar and some offsets
cf <- CFtime("days since 1949-12-01", "360_day", 19830:90029)
cf
#> CF datum of origin:
#> Origin : 1949-12-01 00:00:00
#> Units : days
#> Calendar: 360_day
#> CF time series:
#> Elements: [2005-01-01 .. 2199-12-30] (average of 1.000000 days between 70200 elements)
#> Bounds : not set
The CFtime()
function takes a datum description
(which is actually a unit - “days” - in reference to a datum -
“1949-12-01”), a calendar description, and a vector of offsets
from that datum. Once a CFtime
instance is created its
datum and calendar cannot be changed anymore. Offsets may be added.
In practice, these parameters will be taken from the data set of
interest. CF Metadata Conventions require data sets to be in the NetCDF
format, with all metadata describing the data set included in a single
file, including the mandatory “Conventions” global attribute which
should have a string identifying the version of the CF Metadata
Conventions that this file adheres to (among possible others). Not
surprisingly, all the pieces of interest are contained in the mandatory
time
dimension of the file. The process then becomes as
follows, for a CMIP6 file of daily precipitation:
# Opening a data file that is included with the package and showing some attributes.
# Usually you would `list.files()` on a directory of your choice.
fn <- list.files(path = system.file("extdata", package = "CFtime"), full.names = TRUE)[1]
nc <- nc_open(fn)
attrs <- ncatt_get(nc, "")
attrs$title
#> [1] "NOAA GFDL GFDL-ESM4 model output prepared for CMIP6 update of RCP4.5 based on SSP2"
# "Conventions" global attribute must have a string like "CF-1.*" for this package to work reliably
attrs$Conventions
#> [1] "CF-1.7 CMIP-6.0 UGRID-1.0"
# Create the CFtime instance from the metadata in the file.
cf <- CFtime(nc$dim$time$units,
nc$dim$time$calendar,
nc$dim$time$vals)
cf
#> CF datum of origin:
#> Origin : 1850-01-01 00:00:00
#> Units : days
#> Calendar: noleap
#> CF time series:
#> Elements: [2015-01-01 12:00:00 .. 2099-12-31 12:00:00] (average of 1.000000 days between 31025 elements)
#> Bounds : not set
You can see from the global attribute “Conventions” that the file
adheres to the CF Metadata Conventions, among others. According to the
CF conventions, units
and calendar
are
required attributes of the time
dimension in the NetCDF
file, and nc$dim$time$vals
are the offset values, or
dimnames()
in R
terms, for the
time
dimension of the data.
The above example (and others in this vignette) use the
ncdf4
package. If you are using the RNetCDF
package, checking for CF conventions and then creating a
CFtime
instance goes like this:
library(RNetCDF)
nc <- open.nc(fn)
att.get.nc(nc, -1, "Conventions")
#> [1] "CF-1.7 CMIP-6.0 UGRID-1.0"
cf <- CFtime(att.get.nc(nc, "time", "units"),
att.get.nc(nc, "time", "calendar"),
var.get.nc(nc, "time"))
cf
#> CF datum of origin:
#> Origin : 1850-01-01 00:00:00
#> Units : days
#> Calendar: noleap
#> CF time series:
#> Elements: [2015-01-01 12:00:00 .. 2099-12-31 12:00:00] (average of 1.000000 days between 31025 elements)
#> Bounds : not set
The corresponding character representations of the time series can be easily generated:
dates <- as_timestamp(cf, format = "date")
dates[1:10]
#> [1] "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" "2015-01-05"
#> [6] "2015-01-06" "2015-01-07" "2015-01-08" "2015-01-09" "2015-01-10"
…as well as the full range of the time series:
Note that in this latter case, if any of the timestamps in the time
series have a time that is other than 00:00:00
then the
time of the extremes of the time series is also displayed. This is a
common occurrence because the CF Metadata Conventions prescribe that the
middle of the time period (month, day, etc) is recorded, which for
months with 31 days would be something like
2005-01-15T12:00:00
.
When working with high resolution climate projection data, typically at a “day” resolution, one of the processing steps would be to aggregate the data to some lower resolution such as a dekad (10-day period), a month or a meteorological season, and then compute a derivative value such as the dekadal sum of precipitation, monthly minimum/maximum daily temperature, or seasonal average daily short-wave irradiance.
It is also possible to create factors for multiple “epochs” in one go. This greatly reduces programming effort if you want to calculate anomalies over multiple future periods. A complete example is provided in the vignette “Processing climate projection data”.
It is easy to generate the factors that you need once you have a
CFtime
instance prepared:
# Create a dekad factor for the whole `cf` time series that was created above
f_k <- CFfactor(cf, "dekad")
str(f_k)
#> Factor w/ 3060 levels "2015D01","2015D02",..: 1 1 1 1 1 1 1 1 1 1 ...
#> - attr(*, "epoch")= int -1
#> - attr(*, "period")= chr "dekad"
#> - attr(*, "CFtime")=Formal class 'CFtime' [package "CFtime"] with 4 slots
#> .. ..@ datum :Formal class 'CFdatum' [package "CFtime"] with 5 slots
#> .. .. .. ..@ definition: chr "days since 1850-01-01"
#> .. .. .. ..@ unit : int 4
#> .. .. .. ..@ origin :'data.frame': 1 obs. of 8 variables:
#> .. .. .. .. ..$ year : int 1850
#> .. .. .. .. ..$ month : num 1
#> .. .. .. .. ..$ day : num 1
#> .. .. .. .. ..$ hour : num 0
#> .. .. .. .. ..$ minute: num 0
#> .. .. .. .. ..$ second: num 0
#> .. .. .. .. ..$ tz : chr "+0000"
#> .. .. .. .. ..$ offset: num 0
#> .. .. .. ..@ calendar : chr "noleap"
#> .. .. .. ..@ cal_id : int 4
#> .. ..@ resolution: num 14.9
#> .. ..@ offsets : num [1:3060] 60230 60240 60250 60261 60271 ...
#> .. ..@ bounds : logi TRUE
# Create monthly factors for a baseline epoch and early, mid and late 21st century epochs
baseline <- CFfactor(cf, epoch = 1991:2020)
future <- CFfactor(cf, epoch = list(early = 2021:2040, mid = 2041:2060, late = 2061:2080))
str(future)
#> List of 3
#> $ early: Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "epoch")= int 20
#> ..- attr(*, "period")= chr "month"
#> $ mid : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "epoch")= int 20
#> ..- attr(*, "period")= chr "month"
#> $ late : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "epoch")= int 20
#> ..- attr(*, "period")= chr "month"
For the “epoch” version, there are two interesting things to note here:
NA
values where the time series values are not falling in
the epoch. This ensures that the factor is compatible with the data set
which the time series describes, such that functions like
tapply()
will not throw an error.There are six periods defined for CFfactor()
:
year
, to summarize data to yearly timescalesseason
, the meteorological seasons. Note that the month
of December will be added to the months of January and February of the
following year, so the date “2020-12-01” yields the factor value
“2021S1”.quarter
, the standard quarters of the year.month
, monthly summaries, the default period.dekad
, 10-day period. Each month is subdivided in
dekads as follows: (1) days 01 - 10; (2) days 11 - 20; (3) remainder of
the month.day
, to summarize sub-daily data.A CFtime instance describes the “time” dimension of an associated
data set. When you process that dimension of the data set using
CFfactor()
or another method to filter or otherwise subset
the “time” dimension, the resulting data set will have a different
“time” dimension. To associate a proper CFtime instance with your
processing result, the methods in this package return that CFtime
instance as an attribute:
new_time <- attr(f_k, "CFtime")
new_time
#> CF datum of origin:
#> Origin : 1850-01-01 00:00:00
#> Units : days
#> Calendar: noleap
#> CF time series:
#> Elements: [1974-12-26 12:00:00 .. 2099-12-16 00:00:00] (average of 14.911572 days between 3060 elements)
#> Bounds : regular and consecutive
In the vignette “Processing climate projection data” is a fully worked out example of this.
You can test if your time series is complete with the function
CFcomplete()
. A time series is considered complete if the
time steps between the two extreme values are equally spaced. There is a
“fuzzy” assessment of completeness for time series with a datum unit of
“days” or smaller where the time steps are months or years apart - these
have different lengths in days in different months or years (e.g. a leap
year).
If your time series is incomplete, for instance because it has missing time steps, you should recognize that in your further processing. As an example, you might want to filter out months that have fewer than 90% of daily data from further processing or apply weights based on the actual coverage.
# Is the time series complete?
is_complete(cf)
#> [1] TRUE
# How many time units fit in a factor level?
CFfactor_units(cf, baseline)
#> [1] 31 28 31 30 31 30 31 31 30 31 30 31
# What's the absolute and relative coverage of our time series
CFfactor_coverage(cf, baseline, "absolute")
#> [1] 186 168 186 180 186 180 186 186 180 186 180 186
CFfactor_coverage(cf, baseline, "relative")
#> [1] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
The time series is complete but coverage of the baseline epoch is
only 20%! Recall that the time series starts in 2015 while the baseline
period in the factor is for 1991:2020
so that’s only 6
years of time series data out of 30 years of the baseline factor.
An artificial example of missing data:
# 4 years of data on a `365_day` calendar, keep 80% of values
n <- 365 * 4
cov <- 0.8
offsets <- sample(0:(n-1), n * cov)
cf <- CFtime("days since 2020-01-01", "365_day", offsets)
cf
#> CF datum of origin:
#> Origin : 2020-01-01 00:00:00
#> Units : days
#> Calendar: 365_day
#> CF time series:
#> Elements: [2020-01-01 .. 2023-12-31] (average of 1.250214 days between 1168 elements)
#> Bounds : not set
# Note that there are about 1.25 days between observations
mon <- CFfactor(cf, "month")
CFfactor_coverage(cf, mon, "absolute")
#> [1] 26 25 28 27 26 25 26 21 24 23 24 25 26 26 24 26 25 23 26 24 22 21 20 19 26
#> [26] 18 21 25 27 24 24 26 22 27 22 25 26 22 27 27 27 20 27 28 23 23 27 22
CFfactor_coverage(cf, mon, "relative")
#> [1] 0.8387097 0.8928571 0.9032258 0.9000000 0.8387097 0.8333333 0.8387097
#> [8] 0.6774194 0.8000000 0.7419355 0.8000000 0.8064516 0.8387097 0.9285714
#> [15] 0.7741935 0.8666667 0.8064516 0.7666667 0.8387097 0.7741935 0.7333333
#> [22] 0.6774194 0.6666667 0.6129032 0.8387097 0.6428571 0.6774194 0.8333333
#> [29] 0.8709677 0.8000000 0.7741935 0.8387097 0.7333333 0.8709677 0.7333333
#> [36] 0.8064516 0.8387097 0.7857143 0.8709677 0.9000000 0.8709677 0.6666667
#> [43] 0.8709677 0.9032258 0.7666667 0.7419355 0.9000000 0.7096774
Keep in mind, though, that there are data sets where the time unit is
lower than the intended resolution of the data. Since the CF conventions
recommend that the coarsest time unit is “day”, many files with monthly
data sets have a definition like days since 2016-01-01
with
offset values for the middle of the month like
15, 44, 74, 104, ...
. Even in these scenarios you can
verify that your data set is complete with the function
CFcomplete()
.
The CF Metadata Conventions support nine different calendars. Of
these, only the standard
, gregorian
and
proleptic_gregorian
calendars are fully compatible with
POSIXt. The other calendars have varying degrees of discrepancies:
julian
: Every fouth year is a leap year. Dates like
2100-02-29
and 2200-02-29
are valid.365_day
or noleap
: No leap year exists.
2020-02-29
does not occur.366_day
or all_leap
: All years are leap
years.360_day
: All months have 30 days in every year. This
means that 31 January, March, May, July, August, October and December
never occur, while 29 and 30 February occur in every year.Converting time series using these incompatible calendars to
Date
s is likely to produce problems. This is most
pronounced for the 360_day
calendar:
# Days in January and February
cf <- CFtime("days since 2023-01-01", "360_day", 0:59)
cf_days <- as_timestamp(cf, "date")
as.Date(cf_days)
#> [1] "2023-01-01" "2023-01-02" "2023-01-03" "2023-01-04" "2023-01-05"
#> [6] "2023-01-06" "2023-01-07" "2023-01-08" "2023-01-09" "2023-01-10"
#> [11] "2023-01-11" "2023-01-12" "2023-01-13" "2023-01-14" "2023-01-15"
#> [16] "2023-01-16" "2023-01-17" "2023-01-18" "2023-01-19" "2023-01-20"
#> [21] "2023-01-21" "2023-01-22" "2023-01-23" "2023-01-24" "2023-01-25"
#> [26] "2023-01-26" "2023-01-27" "2023-01-28" "2023-01-29" "2023-01-30"
#> [31] "2023-02-01" "2023-02-02" "2023-02-03" "2023-02-04" "2023-02-05"
#> [36] "2023-02-06" "2023-02-07" "2023-02-08" "2023-02-09" "2023-02-10"
#> [41] "2023-02-11" "2023-02-12" "2023-02-13" "2023-02-14" "2023-02-15"
#> [46] "2023-02-16" "2023-02-17" "2023-02-18" "2023-02-19" "2023-02-20"
#> [51] "2023-02-21" "2023-02-22" "2023-02-23" "2023-02-24" "2023-02-25"
#> [56] "2023-02-26" "2023-02-27" "2023-02-28" NA NA
31 January is missing from the vector of Date
s because
the 360_day
calendar does not include it and 29 and 30
February are NA
s because POSIXt rejects them. This will
produce problems later on when processing your data.
The general advice is therefore: do not convert CFtime
objects to Date objects unless you are sure that the
CFtime
object uses a POSIXt-compatible calendar.
One reason to convert the “time” dimension from different climate projection data sets is to be able to compare the data from different models and produce a multi-model ensemble. The correct procedure to do this is to first calculate for each data set individually the property of interest (e.g. average daily rainfall per month anomaly for some future period relative to a baseline period), which will typically involve aggregation to a lower resolution (such as from daily data to monthly averages), and only then combine the aggregate data from multiple data sets to compute statistically interesting properties (such as average among models and standard deviation, etc).
Once data is aggregated from daily or higher-resolution values, the
different calendars no longer matter (although if you do need to convert
averaged data to absolute data you should use
CFfactor_units()
to make sure that you use the correct
scaling factor).
Otherwise, there really shouldn’t be any reason to convert the time
series in the data files to Date
s. Climate projection data
is virtually never compared on a day-to-day basis between different
models and neither does complex date arithmetic make much sense (such as
adding intervals) - CFtime
can support basic arithmetic by
manipulation the offsets of the CFtime
object. The
character representations that are produced are perfectly fine to use
for dimnames()
on an array or as rownames()
in
a data.frame
and these also support basic logical
operations such as "2023-02-30" < "2023-03-01"
. So ask
yourself, do you really need Date
s when working with
unprocessed climate projection data? (If so, open an issue on
GitHub).
A complete example of creating a multi-model ensemble is provided in the vignette “Processing climate projection data”.