Creating a simple Forecast model¶
In this notebook we will show how to create a simple forecast model using the mosqlient package. The package has a baseline model that uses the ARIMA model to forecast the number of cases of a disease.
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv("API_KEY")
import pandas as pd
from datetime import date
from mosqlient.datastore import Infodengue
from mosqlient.forecast import Arima
Before applying the Baseline model, it is necessary to load the data. In this example, we will use the dengue data in Rio de Janeiro city (geocode = 3304557). The baseline model is univariate, so it will define just one column to be forecast, the casos column. This column must be renamed' y' before being passed to the model. It's also necessary to set the index of the dataframe as a datetime index.
disease = 'dengue'
geocode = 3304557
end_date = date.today().strftime('%Y-%m-%d')
df = Infodengue.get(api_key = api_key,
disease = disease,
start = "2010-01-01",
end = end_date,
geocode = geocode)
df = pd.DataFrame(df)
df['data_iniSE'] = pd.to_datetime(df['data_iniSE'])
df.set_index('data_iniSE', inplace = True )
df = df[['casos']].rename(columns = {'casos':'y'})
df = df.resample('W-SUN').sum()
df.head()
100%|██████████| 2/2 [00:02<00:00, 1.08s/requests]
| y | |
|---|---|
| data_iniSE | |
| 2010-01-03 | 30 |
| 2010-01-10 | 44 |
| 2010-01-17 | 46 |
| 2010-01-24 | 47 |
| 2010-01-31 | 68 |
The cell below calls the class associated with the Arima baseline model.
The ARIMA model is defined by:
$$y_t = c + \phi_1 y_{t-1} + \dots + \phi_p y_{t-p} + \theta_1 \epsilon_{t-1} + \dots + \theta_q \epsilon_{t-q} + \epsilon_t,$$
$p$ - order of the autoregressive part;
$d$ - degree of first differencing involved;
$q$ - order of the moving average part.
If $d=1$: $y_t = Y_t - Y_{t-1},$
If $d=2$: $y_t = (Y_t - Y_{t-1}) - (Y_{t-1} - Y_{t-2}),$ and so on.
To call the class, you must instantiate the df parameter, which will be used to train the model and make predictions of the model in-sample and the out-of-sample.
Internally, before the model's training and application, the data is transformed using a boxcox transformation. The lambda parameter is estimated by optimization. Also, during the training step, the model's parameters are optimized using the aic metric and the auto_arima function of the pmdarima package.
m_arima = Arima(df = df)
m_arima
<mosqlient.forecast.baseline.Arima at 0x7ae14dd6c440>
Train the model¶
To use this method is necessary to define a data of begin and end of training. This filter by date is applied on the df passed in the last step.
model = m_arima.train(train_ini_date='2010-01-01', train_end_date = '2021-12-31')
model
Performing stepwise search to minimize aic ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=-292.056, Time=0.12 sec ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=-263.166, Time=0.03 sec ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=-297.770, Time=0.04 sec ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=-289.846, Time=0.06 sec ARIMA(0,1,0)(0,0,0)[0] : AIC=-265.069, Time=0.02 sec ARIMA(2,1,0)(0,0,0)[0] intercept : AIC=-302.929, Time=0.05 sec ARIMA(3,1,0)(0,0,0)[0] intercept : AIC=-302.144, Time=0.06 sec ARIMA(2,1,1)(0,0,0)[0] intercept : AIC=-309.026, Time=0.16 sec ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=-301.338, Time=0.10 sec ARIMA(3,1,1)(0,0,0)[0] intercept : AIC=-307.027, Time=0.25 sec ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=-305.684, Time=0.20 sec ARIMA(3,1,2)(0,0,0)[0] intercept : AIC=-306.429, Time=0.35 sec ARIMA(2,1,1)(0,0,0)[0] : AIC=-310.932, Time=0.06 sec ARIMA(1,1,1)(0,0,0)[0] : AIC=-303.191, Time=0.05 sec ARIMA(2,1,0)(0,0,0)[0] : AIC=-304.794, Time=0.03 sec ARIMA(3,1,1)(0,0,0)[0] : AIC=-308.934, Time=0.08 sec ARIMA(2,1,2)(0,0,0)[0] : AIC=inf, Time=0.20 sec ARIMA(1,1,0)(0,0,0)[0] : AIC=-299.603, Time=0.02 sec ARIMA(1,1,2)(0,0,0)[0] : AIC=-307.585, Time=0.10 sec ARIMA(3,1,0)(0,0,0)[0] : AIC=-304.014, Time=0.04 sec ARIMA(3,1,2)(0,0,0)[0] : AIC=-308.337, Time=0.10 sec Best model: ARIMA(2,1,1)(0,0,0)[0] Total fit time: 2.126 seconds
ARIMA(2,1,1)(0,0,0)[0]In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| order | (2, ...) | |
| seasonal_order | (0, ...) | |
| start_params | None | |
| method | 'lbfgs' | |
| maxiter | 100 | |
| suppress_warnings | True | |
| out_of_sample_size | 0 | |
| scoring | 'mse' | |
| scoring_args | {} | |
| trend | None | |
| with_intercept | False |
Predictions in sample¶
Performace of the model in the sample
df_in_sample = m_arima.predict_in_sample(plot = True)
df_in_sample.head()
| lower_95 | upper_95 | lower_90 | upper_90 | lower_80 | upper_80 | lower_50 | upper_50 | pred | date | data | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 17.195727 | 54.459095 | 18.759022 | 49.336901 | 20.762191 | 44.089349 | 24.666946 | 36.657726 | 29.999999 | 2010-01-10 | 44.0 |
| 2 | 22.919422 | 73.066754 | 25.014745 | 66.154560 | 27.701173 | 59.078259 | 32.942249 | 49.066560 | 40.108696 | 2010-01-17 | 46.0 |
| 3 | 26.956948 | 87.411540 | 29.456736 | 79.019537 | 32.666442 | 70.444107 | 38.942106 | 58.341994 | 47.548751 | 2010-01-24 | 47.0 |
| 4 | 27.526324 | 89.185832 | 30.077199 | 80.629434 | 33.352272 | 71.885250 | 39.755079 | 59.543504 | 48.534852 | 2010-01-31 | 68.0 |
| 5 | 35.896928 | 120.884538 | 39.330306 | 108.901185 | 43.753112 | 96.706100 | 52.442703 | 79.592217 | 64.438812 | 2010-02-07 | 56.0 |
Predictions out of sample¶
The out-of-sample performance of the models starts after the last data used to train the model, which is the train_end_date parameter defined in the train() method.
In this method, it is necessary to define the end date for evaluation and the horizon that the model will forecast. During the out-of-sample prediction, what is done is essentially the forecast step multiple times. For example, if our data has indices 1-16 and the horizon is 4, the first forecast is for indices 1-4. After that, we update the Arima model using the actual observations of indices 1-4 and forecast the indices 5-8. This is done until the end date.
df_out = m_arima.predict_out_of_sample(horizon = 4, end_date = '2023-12-31', plot = True)
df_out.head()
| lower_95 | upper_95 | lower_90 | upper_90 | lower_80 | upper_80 | lower_50 | upper_50 | pred | date | data | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.380135 | 5.611532 | 2.541974 | 5.220175 | 2.744047 | 4.806410 | 3.123123 | 4.194429 | 3.614649 | 2022-01-02 | 21.0 |
| 1 | 2.039724 | 5.982174 | 2.213044 | 5.457454 | 2.433656 | 4.915421 | 2.859450 | 4.139010 | 3.433217 | 2022-01-09 | 14.0 |
| 2 | 1.717638 | 6.491374 | 1.897142 | 5.785903 | 2.130666 | 5.076974 | 2.596284 | 4.099097 | 3.252114 | 2022-01-16 | 22.0 |
| 3 | 1.490893 | 7.007500 | 1.671502 | 6.119112 | 1.910814 | 5.247761 | 2.401294 | 4.084722 | 3.118683 | 2022-01-23 | 19.0 |
| 0 | 36.296197 | 129.720212 | 39.933812 | 116.222528 | 44.643670 | 102.575269 | 53.967930 | 83.592642 | 66.974354 | 2022-01-30 | 31.0 |
Forecast¶
To forecast is necessary to train the model before. The forecast will be done after the last day used in the training step.
model = m_arima.train( train_ini_date='2010-01-01', train_end_date = '2023-12-31')
df_for = m_arima.forecast(horizon = 4, plot = True, last_obs = 10)
df_for.head()
Performing stepwise search to minimize aic ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=-217.169, Time=0.36 sec ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=-171.494, Time=0.05 sec ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=-201.036, Time=0.03 sec ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=-194.051, Time=0.04 sec ARIMA(0,1,0)(0,0,0)[0] : AIC=-173.284, Time=0.02 sec ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=-206.652, Time=0.23 sec ARIMA(2,1,1)(0,0,0)[0] intercept : AIC=-219.023, Time=0.27 sec ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=-208.193, Time=0.13 sec ARIMA(2,1,0)(0,0,0)[0] intercept : AIC=-209.672, Time=0.09 sec ARIMA(3,1,1)(0,0,0)[0] intercept : AIC=-217.074, Time=0.28 sec ARIMA(3,1,0)(0,0,0)[0] intercept : AIC=-208.066, Time=0.16 sec ARIMA(3,1,2)(0,0,0)[0] intercept : AIC=-217.652, Time=0.44 sec ARIMA(2,1,1)(0,0,0)[0] : AIC=-220.856, Time=0.07 sec ARIMA(1,1,1)(0,0,0)[0] : AIC=-209.924, Time=0.04 sec ARIMA(2,1,0)(0,0,0)[0] : AIC=-211.428, Time=0.03 sec ARIMA(3,1,1)(0,0,0)[0] : AIC=-218.906, Time=0.11 sec ARIMA(2,1,2)(0,0,0)[0] : AIC=-219.000, Time=0.13 sec ARIMA(1,1,0)(0,0,0)[0] : AIC=-202.743, Time=0.02 sec ARIMA(1,1,2)(0,0,0)[0] : AIC=-208.393, Time=0.11 sec ARIMA(3,1,0)(0,0,0)[0] : AIC=-209.830, Time=0.04 sec ARIMA(3,1,2)(0,0,0)[0] : AIC=-219.483, Time=0.15 sec Best model: ARIMA(2,1,1)(0,0,0)[0] Total fit time: 2.832 seconds
| lower_95 | upper_95 | lower_90 | upper_90 | lower_80 | upper_80 | lower_50 | upper_50 | pred | date | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 979.362067 | 5024.709106 | 1107.330069 | 4364.576361 | 1277.962061 | 3718.601511 | 1630.683211 | 2859.986027 | 2151.813739 | 2024-01-07 |
| 1 | 892.690685 | 7411.080508 | 1043.224008 | 6152.763743 | 1252.035583 | 4983.911469 | 1710.156843 | 3535.853971 | 2444.298253 | 2024-01-14 |
| 2 | 731.871572 | 10361.523055 | 885.822456 | 8163.881869 | 1108.574490 | 6241.143478 | 1629.777828 | 4041.398186 | 2542.431130 | 2024-01-21 |
| 3 | 627.565873 | 14406.506677 | 783.035174 | 10804.031635 | 1016.402917 | 7824.478511 | 1594.361430 | 4657.239879 | 2689.506657 | 2024-01-28 |