Green Friday - Is climate change real?

1. Climate change: reality or fake news?
2. Lets get started
3. Going back in time…
- a. looking at distributions
- looking at time series
4. Is the trend significant?
5. Additional analysis
6. Further data

1. Climate change: reality or fake news?

Everyone is talking about climate change, and many are doubting that it really exists… But climate is one of the best documented scientific area, and there are tons of data available, we just need to crunch the numbers with the right tools!

So let us use the german weather records, and see if the effects of climate change in Germany (temperatures, rain, …) can be seen in the data. We will use the data from the deutscher Wetterdienst (DWD) which is made available using a R library rdwd (see this page). This library contains functions which allow to query the huge database of the DWD.

2. Lets get started

a. accessing the data

First, we need to install some packages:

## install the package
#install.packages("rdwd") # You might need to install RTools to build the rdwd package; see https://cran.r-project.org/bin/windows/Rtools/
## load the package
library(rdwd)

We can query the database using the name of a weather station (e.g. Potsdam), and collect a time series of various wheather variables. This time series is available for difference time intervalls, for example daily or monthly. If you want to use another location, you can check this interactive map, and look for the blue dots.

## this creates a link to the requested data
link = selectDWD("Potsdam", res="daily", var="kl", per="recent")
## this download the corresponding file
file = dataDWD(link, read=FALSE, dir="~/", quiet=TRUE, force=NA, overwrite=TRUE)
## and this reads the content of the file into R
clim = readDWD(file, varnames=TRUE)

Have a look at the downloaded table to see what kind of variables are available:

clim[1:5,]

  STATIONS_ID MESS_DATUM QN_3 FX.Windspitze FM.Windgeschwindigkeit QN_4
1        3987 2018-05-27   10           9.2                    3.2    3
2        3987 2018-05-28   10          10.0                    4.5    3
3        3987 2018-05-29   10          12.5                    5.2    3
4        3987 2018-05-30   10          10.8                    3.6    3
5        3987 2018-05-31   10           8.1                    3.3    3
  RSK.Niederschlagshoehe RSKF.Niederschlagsform SDK.Sonnenscheindauer
1                    0.4                      6                10.250
2                    0.0                      0                13.433
3                    0.0                      0                15.383
4                    0.0                      0                10.317
5                    0.0                      0                11.733
  SHK_TAG.Schneehoehe NM.Bedeckungsgrad VPM.Dampfdruck PM.Luftdruck
1                   0               4.3           15.7      1008.94
2                   0               4.5           16.0      1007.71
3                   0               0.8           13.3      1004.02
4                   0               3.3           15.6      1002.91
5                   0               3.2           16.6      1003.17
  TMK.Lufttemperatur UPM.Relative_Feuchte TXK.Lufttemperatur_Max
1               20.5                67.58                   28.4
2               24.3                56.46                   32.2
3               25.6                43.17                   32.6
4               24.1                53.42                   33.2
5               24.4                57.63                   31.9
  TNK.Lufttemperatur_Min TGK.Lufttemperatur_5cm_min eor
1                   13.8                       11.4 eor
2                   15.8                       13.1 eor
3                   18.2                       14.9 eor
4                   18.2                       15.1 eor
5                   17.1                       14.4 eor

b. comparing variables

We have downloaded a recent dataset, containing the data for the last 2 years; we can have a look at the time series for a certain variable, for example TMK.Lufttemperatur: we will plot the data with time as the x-axis, and temperature in the y-axis

plot(clim$MESS_DATUM, clim$TMK.Lufttemperatur,type='l',xlab='',ylab='temperature')

try to plot the sunshine duration for the same period

plot(clim$MESS_DATUM, clim$TMK.Lufttemperatur,type='l',xlab='',ylab='temperature')

  plot(clim$MESS_DATUM,clim$SDK.Sonnenscheindauer,type='l',col='red',xlab='',ylab='temperature')

This seems to be tighly correlated to the temperature!

We can look for correlations between temperature and sunshine duration

cor(clim$SDK.Sonnenscheindauer,clim$TMK.Lufttemperatur)

[1] 0.6359194

plot(clim$SDK.Sonnenscheindauer,clim$TMK.Lufttemperatur,pch=20,
     xlab='sunshine duration (hours)',
     ylab='temperature (degree)'
     )

c. comparing location

Is Freiburg warmer than Potsdam? Let us get the data from these 2 locations and compare the monthly temperatures over the last 2 years:

## Start with Potsdam
## this creates a link to the requested data
link = selectDWD("Potsdam", res="monthly", var="kl", per="recent")
## this download the corresponding file
file = dataDWD(link, read=FALSE, dir="~/", quiet=TRUE, force=NA, overwrite=TRUE)
## and this reads the content of the file into R
clim.po = readDWD(file, varnames=TRUE)
## same for Freiburg
## this creates a link to the requested data
link = selectDWD("Freiburg", res="monthly", var="kl", per="recent")
## this download the corresponding file
file = dataDWD(link, read=FALSE, dir="~/", quiet=TRUE, force=NA, overwrite=TRUE)
## and this reads the content of the file into R
clim.fr = readDWD(file, varnames=TRUE)

More rain? More heat?

##
rain.po = clim.po$MO_RR.Niederschlagshoehe
rain.fr = clim.fr$MO_RR.Niederschlagshoehe
##
##
sun.po = clim.po$MO_SD_S.Sonnenscheindauer
sun.fr = clim.fr$MO_SD_S.Sonnenscheindauer
##
temp.po = clim.po$MO_TT.Lufttemperatur
temp.fr = clim.fr$MO_TT.Lufttemperatur
#
rain = list(Freiburg=rain.fr,
            Potsdam=rain.po)
sun = list(Freiburg=sun.fr,
           Potsdam=sun.po)
temp = list(Freiburg=temp.fr,
            Potsdam=temp.po)
##

##
par(mfrow=c(2,2),mar=c(2,2,2,2))
boxplot(rain,main='Rain')
boxplot(sun,main='Sunshine')
boxplot(temp,main='Temperature')

Are these differences statistically significant? Or could it be simply due to statistical fluctuations in this time period? Since the time points match between the 2 locations, we can perform a paired t-test:

## Test Rain
t.test(rain.fr,rain.po,paired=TRUE)


    Paired t-test

data:  rain.fr and rain.po
t = 2.789, df = 17, p-value = 0.01259
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7.056499 50.899056
sample estimates:
mean of the differences 
               28.97778

## Test Sun
t.test(sun.fr,sun.po,paired=TRUE)


    Paired t-test

data:  sun.fr and sun.po
t = -1.0272, df = 18, p-value = 0.318
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -40.70075  13.97128
sample estimates:
mean of the differences 
              -13.36474

## Test Temperature
t.test(temp.fr,temp.po,paired=TRUE)


    Paired t-test

data:  temp.fr and temp.po
t = 1.0981, df = 18, p-value = 0.2866
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2441508  0.7788877
sample estimates:
mean of the differences 
              0.2673684

For which of these variables is the difference significant?

3. Going back in time…

Climate change happens at a larger time scale than 2 years. Hence, we need to download historical data! We will use monthly intervals:

## this creates a link to the requested data
link = selectDWD("Potsdam", res="monthly", var="kl", per="historical")
## this download the corresponding file
file = dataDWD(link, read=FALSE, dir="~/", quiet=TRUE, force=NA, overwrite=TRUE)
## and this reads the content of the file into R
clim = readDWD(file, varnames=TRUE)

a. looking at distributions

Let us explore the distribution of sunshine duration in one month of the year (for example April) accross many years: April will be encoded as <year>-04-15 in the column MESS_DATUM of the clim table. We need to find the rows of the table for which the column MESS_DATUM contains the pattern xxx-04-15:

## the function grep search for a certain string in a vector of strings, and returns the index of the entries which contain this string
rows.april = grep('04-15',clim$MESS_DATUM)
##
year = clim$MESS_DATUM[rows.april]
temp.april = clim$MO_TT.Lufttemperatur[rows.april]

Let us plot the distribution of these values as a histogram:

hist(temp.april,breaks=15,xlab='Temperature in April',main='',col='lightgrey')

Compute the mean and standard deviation of these April temperatures:

m = mean(temp.april)
s = sd(temp.april)

We can compare this histogram with a “theoretical” normal distribution with same mean and standard deviation:

## generate a vector of x values from 4 to 14 with 0.1 increments
x.norm = seq(4,14,by=.1)
## compute the y value according to a normal distribution
y.norm = dnorm(x.norm,mean=m,sd=s)

Now overlay the histogram with the theoretical distribution:

hist(temp.april,breaks=15,xlab='Temperature in April',main='',freq = FALSE,col='lightgrey');lines(x.norm,y.norm,lwd=3,col='blue')

We can also compare the histogram to a normal distribution using a QQ-plot:

qqnorm(temp.april);qqline(temp.april)

Kind of…

looking at time series

Like we did for the recent data, we can now look at the monthly temperatures for April on a large time range.

##
plot(year,temp.april,type='l',ylab='Temperature',main='Temperature in April')

Do we see a tendency? Fake news? Try to overlay onto this plot the temperature profile for July

rows.july = grep('07-15',clim$MESS_DATUM)
temp.july = clim$MO_TT.Lufttemperatur[rows.july]
##
plot(year,temp.april,type='l',ylab='Temperature',main='Temperatures',ylim=c(5,25));lines(year,temp.july,type='l',col='red')

Is there a correlation between time and temperature increase? We can encode time as a numerical vector (1,2,…) and compute a Spearman correlation between this time vector and the temperatures:

time = 1:length(year)
##
cor(time,temp.april,method='spearman')

[1] 0.3427779

Damn … Das war’s mit Lars!!

Can you determine for which month of the year this correlation is highest?

4. Is the trend significant?

Let us compare the April temperature of the years 1900-1918 and 2000-2018:

# extract the rows corresponding to April
clim.month = clim[grep('04-15',clim$MESS_DATUM),]
##
## now extract the rows corresponding to the early and late time period:
i.early = which(clim.month$MESS_DATUM_BEGINN >= 19000000 & clim.month$MESS_DATUM_ENDE <= 19190000)
i.late = which(clim.month$MESS_DATUM_BEGINN >= 20000000 & clim.month$MESS_DATUM_ENDE <= 20190000)
##
temp.19 = clim.month$MO_TT.Lufttemperatur[i.early]
temp.20 = clim.month$MO_TT.Lufttemperatur[i.late]

Let us visualize the data as a boxplot:

boxplot(list(nineteenth=temp.19,
             twentieth=temp.20))

There is obviously a difference in these 2 distributions; but is this difference really statistically significant?

## is there a significant difference between nineteenth and twentieth century?
t.test(temp.19,temp.20)


    Welch Two Sample t-test

data:  temp.19 and temp.20
t = -4.6079, df = 35.689, p-value = 5.026e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.653731 -1.419953
sample estimates:
mean of x mean of y 
 7.648421 10.185263

what kind of test was performed? Interpret the output of the test! Try performing a single-sided t-test

Can you see a similar effect for other climate variables (rain, wind,…)?

5. Additional analysis

redo this analysis for other climate variables, such as rain or sunshine duration. Do you also see a temporal trend?

6. Further data

Check on Kaggle for related datasets with climate data: here is a list
Additional R packages and tools here