Cherry Blossom prediction using Time Series

Cherry Blossom is one of the most scenic visuals one can experience. Cherry blossom season marks the arrival of spring season and can be considered as transition from winter to summer. People try to make plans travel to enjoy this phenomenon. So how about using some simple statistical techniques to try and forecast / predict the peak cherry blossom time ?

Along with some of my fellow PhD classmates, I participated in the International Cherry Blossom Prediction Competition hosted by George Mason university. We explored a lot of models and I am going to show a very basic model which I tried during the early stages. The model is the Autogegressive (AR) model. The notation of this model is AR(1) model is as follows

\[ Y_{t} = \beta_{1} Y_{t-1} + \epsilon_{t} \] where $Y_{t}$ is the bloom day for year $t$, $\beta_{i}$ is the model parameter and $\epsilon_{t}$ is the white noise

This simply means the present value of the response variable $Y$ (in our case the bloom day for this year) is influenced by previous value of the response variable (in our case the bloom day of the previous year). If you are aware of the simple linear regression, think of this as the explanatory variable being the same as the predictor variable in rough sense. In the competition, we tried to predict the bloom date for multiple location across the world based on the data available provide by the university. However, for the purpose of this post, I will show the analysis only for one location, Kyoto in Japan.

Let us start with first reading in the dataset and loading the R packages required for the analysis

# Load the packages
library(forecast)
library(ggplot2)
library(fpp2)
library(dplyr)
library(vars)

# Load the dataset
kyoto <- read.csv("https://raw.githubusercontent.com/GMU-CherryBlossomCompetition/peak-bloom-prediction/main/data/kyoto.csv",header=T)

# Plot of the bloom date over the years
ggplot(kyoto,aes(x=year,y=bloom_doy))+
  geom_point()+
  labs(x="Year",y="Bloom Day")+
  ggtitle("Bloom Day by Year")

As we can see from the plot, towards the later end (in the recent past), the bloom day has started to go down. This means in the recent past, the bloom day is happening earlier than before. Let us look at the plot only from the year 1950.

# Filter data only for year since 1951
kyoto_new <- kyoto %>% filter(year>1950)

# Plot of the bloom date over the years
ggplot(kyoto_new,aes(x=year,y=bloom_doy))+
  geom_point()+
  labs(x="Year",y="Bloom Day")+
  ggtitle("Bloom Day by Year")

As we can see from the plot for year 1951 onward, there seems to be a downward trend which indicates the bloom date is in general arriving earlier.

We will use the data from 1951 to 2022 to predict the bloom date for year 2023 and compare that to actual bloom date. For this, we will use the bloom day as response and the year as the predictor

# Prepare the data for ARIMA(1,0,0) model
y_kyoto <- kyoto_new$bloom_doy # bloom day as the response

Exclude the year 2023 from response and explanatory variable to test the model on year 2023

# First test on 2023 model
ytest <- y_kyoto[-length(y_kyoto)] # exclude the bloom day for 2023 year

# Model based on year 1951 to 2022
fit_kyoto_test <- Arima(ytest, order=c(1,0,0)) # order=c(1,0,0) indicates AR(1) model
fit_kyoto_test

Series: ytest 
ARIMA(1,0,0) with non-zero mean 

Coefficients:
         ar1     mean
      0.2821  97.2885
s.e.  0.1182   0.7483

sigma^2 = 21.84:  log likelihood = -215.16
AIC=436.32   AICc=436.67   BIC=443.19

#Forecast
fcast_kyoto_test <- forecast(fit_kyoto_test)
fcast_kyoto_test

   Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
74       93.53975 87.55120  99.5283 84.38105 102.6984
75       96.23094 90.00866 102.4532 86.71479 105.7471
76       96.99013 90.74963 103.2306 87.44611 106.5342
77       97.20430 90.96235 103.4463 87.65806 106.7505
78       97.26472 91.02265 103.5068 87.71830 106.8111
79       97.28176 91.03969 103.5238 87.73533 106.8282
80       97.28657 91.04450 103.5286 87.74014 106.8330
81       97.28793 91.04585 103.5300 87.74150 106.8344
82       97.28831 91.04624 103.5304 87.74188 106.8347
83       97.28842 91.04634 103.5305 87.74199 106.8349

# Check actual bloom date for 2023
y_kyoto[length(y_kyoto)]

[1] 95

The AR(1) model predicts 96 as the bloom day for year 2023 whereas the actual bloom day was 84 for year 2023 which is a difference of 12 days. Considering its a basic model, this does not seem to be too bad.

Now we check the performance of the model using some charts where we first check the prediction plot, then the Regression and model errors.

# Plot the prediction
autoplot(fcast_kyoto_test) + xlab("Year") +
  ylab("Percentage change")

# recover estimates of nu(t) and epsilon(t) 
cbind("Regression Errors" = residuals(fit_kyoto_test, type="regression"),
      "ARIMA errors" = residuals(fit_kyoto_test, type="innovation")) %>%
  autoplot(facets=TRUE)

There does not seem to be any issues with either of the plots. Now we check the residuals to see if they are normally distributed.

# Check the residuals
checkresiduals(fit_kyoto_test)


    Ljung-Box test

data:  Residuals from ARIMA(1,0,0) with non-zero mean
Q* = 13.92, df = 9, p-value = 0.1252

Model df: 1.   Total lags used: 10

The residuals seem to be normally distributed and the Ljung-Box test indicates we have little evidence against the null hypothesis of independently distributed errors.

Now we will use the data from 1951 upto 2023 to predict the bloom date for the year 2024. Basically its the same model with one extra data point available

# Now use data upto 2023 to predict 2024
fit_kyoto <- Arima(y_kyoto, order=c(1,0,0)) # order=c(1,0,0) indicates AR(1) model
fit_kyoto

Series: y_kyoto 
ARIMA(1,0,0) with non-zero mean 

Coefficients:
         ar1     mean
      0.2707  97.3162
s.e.  0.1111   0.7264

sigma^2 = 21.56:  log likelihood = -217.65
AIC=441.3   AICc=441.64   BIC=448.21

#Forecast
fcast_kyoto <- forecast(fit_kyoto)
fcast_kyoto

   Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
75       96.68923 90.73853 102.6399 87.58842 105.7900
76       97.14647 90.98163 103.3113 87.71816 106.5748
77       97.27023 91.08999 103.4505 87.81837 106.7221
78       97.30373 91.12237 103.4851 87.85015 106.7573
79       97.31280 91.13135 103.4943 87.85909 106.7665
80       97.31526 91.13380 103.4967 87.86154 106.7690
81       97.31592 91.13447 103.4974 87.86220 106.7696
82       97.31610 91.13465 103.4976 87.86238 106.7698
83       97.31615 91.13469 103.4976 87.86243 106.7699
84       97.31616 91.13471 103.4976 87.86244 106.7699

The model predicts the cherry blossom to bloom on day 93 which is 2nd April 2024 (due to 2024 being a leap year) for Kyoto, Tokyo. Now lets look at the diagnostics of the model to see if the model is reasonable.

# Plot the forecast
autoplot(fcast_kyoto) + xlab("Year") +
  ylab("Percentage change")

# recover estimates of nu(t) and epsilon(t) 
cbind("Regression Errors" = residuals(fit_kyoto, type="regression"),
      "ARIMA errors" = residuals(fit_kyoto, type="innovation")) %>%
  autoplot(facets=TRUE)

# Check the residuals
checkresiduals(fit_kyoto)


    Ljung-Box test

data:  Residuals from ARIMA(1,0,0) with non-zero mean
Q* = 13.9, df = 9, p-value = 0.1259

Model df: 1.   Total lags used: 10

Again, the residuals seem to be normally distributed and the Ljung-Box test indicates we have little evidence against the null hypothesis of independently distributed errors.