Quick Intro: auto_arima from pmdarima package

Steven Kyle
6 min readApr 14, 2021
Photo by Isaac Smith on Unsplash

*This blog post is intended for newer data scientists that know the basics of time series and are starting to get their feet wet with modeling.*

Time series analysis can quickly become very confusing when first diving in. It is therefore important for data scientists to gather as many helpful tools as possible. ARIMA/SARIMA modeling are some of the top choice modeling techniques that are used for time series analysis. These models require a handful of parameters that need to be known to create an accurate model. There are different methods that can be used to find the most optimal parameters. My favorite that I want to highlight in this blogpost is pmdarima auto_arima method. We will cover how this method can be used to do most of the heavy lifting when finding the best initial parameters to start your time series modeling.

Typically when implementing ARIMA/SARIMA models the parameters p, d, and q for ARIMA modeling and p, d, q, P, D, Q, and m for SARIMA modeling need to be known.
p is the trend autoregressive order
d is the trend differencing order
q is the trend moving average order
P is the seasonal autoregressive order
D is the seasonal differencing order
Q is the seasonal moving average order
m is the number of steps in a seasonal period

pmdarima.auto_arima() is a useful tool that will do a grid search to find the best possible parameters. The only one that it will not find is the m parameter, this is something that we will have to figure out and input ourselves. We can figure this out by doing a seasonal_decompose to find the number of steps in a seasonal period. (It is also possible to know this parameter by simply knowing if your data is collected monthly, quarterly or yearly)

Walkthrough of implementing pmdarima.auto_arima on housing prices

For a walkthrough of pmdarima.auto_arima we will use the housing prices of zipcode “83703” provided through Zillow. I have chosen to look at the prices from 2011–01–01 to 2018–04–01. First to get familiar with the data we should see if there is any seasonality for this time series. To do this we can just use a seasonal_decomposition from statsmodels.

Median housing prices for Zipcode 83703
from statsmodels.tsa.seasonal import seasonal_decompose# Taking the decomposition
decomposition = seasonal_decompose(BlogData)
# Gathering and plotting the trend, seasonality, and residuals
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

When taking the decomposition of a time series it will separate into three aspects (trend, seasonal, and residual). We can then plot these out to see if we can gather any useful insights.

The top graph shows the original raw data. The second graph shows the trend. The third graph shows the seasonality trend. And lastly the fourth graph shows the residuals after the trend and seasonality has been removed.

A pattern can be clearly seen in the seasonality plot. The purpose of this seasonal decompose is to get more familiar with the data and to also see if we can find the number of steps in a seasonal period (m). We will now look closer at the seasonal trend to see if we can identify the m value. To find the seasonality period, we can look at the minimum and maximums to see how far apart they are.

ax = seasonal.plot(label='Seasonality', color='blue')min_ = seasonal.idxmin()
max_ = seasonal.idxmax()
min_2 = seasonal[max_:].idxmin()
max_2 = seasonal[min_2:].idxmax()
ax.axvline(min_,label='min 1',c='red')
ax.axvline(min_2,label='min 2',c='red', ls=':')
ax.axvline(max_,label='max 1',c='green')
ax.axvline(max_2,label='max 2',c='green', ls=':')
plt.legend(loc='upper right', fontsize='x-small')
print(f'The time difference between the two minimums is {min_2-min_}')

As you can see in the output/graph the seasonal period is one year. In our case the data is broken into monthly recordings so m=12 since 12 months make up a year. Whenever the seasonality period is one year and the data is monthly m will always be set to 12. Now that we have the m value we can go ahead and apply pmdarima auto_arima method to our data to get the best parameters.

Installing pmdarima

First we can install pmdarimas package by using pip install:

$ pip install pmdarima

Once installed we can import the package and use it on our time series, we will also be using the train_test_split method from the package. We will split so that we will train on 6 years of data and then use the rest to check our model.

Implementing auto_arima

import pmdarima as pm
from pmdarima.model_selection import train_test_split
#Train/Test split
train, test = train_test_split(BlogData, train_size=72)
#Running the auto_arima
auto_model = pm.auto_arima(y=train, start_p=0, start_q=0, max_p=5,
max_q=5, start_P=0, start_Q=0, max_P=5,
max_Q=5, m=12, max_order=None,
trace=True)

The auto_arima will do a stepwise search to find the best parameters within the bounds that are set. More details about the options on setting the auto_arima bounds can be found here.

Once the auto_arima has finished the search we can take a look at the summary output and also the diagnostics of the model.

display(auto_model.summary())
auto_model.plot_diagnostics(figsize=(12,6));

The summary output shows the aspects of the model and the diagnostic plot can be used to make sure that the model is following the assumptions that are needed. We will not go into detail on what the assumptions mean but the model here is following all assumptions except for the residual Correlogram (bottom right). There should be no correlation for the residuals but there seems to be a negative correlation with the lag of 2. For the simplicity of this article we will not go into trying to un-correlate the residuals.

We can look at the model parameters by calling the model and respectively using .order or .seasonal order

auto_model.order
auto_model.seasonal_order

Using the parameters the auto_arima discovered

Now that we have the model parameters from the auto_arima we can put them into a SARIMAX model and see how it does against our test data.

# Creating Final SARIMAX model
Final_model = SARIMAX(train.astype('int'),
order=auto_model.order,
seasonal_order=auto_model.seasonal_order,
enforce_invertibility=False,
enforce_stationarity=False)
Final_output = Final_model.fit()
# Displaying the model summary and diagnostics
display(Final_output.summary());
Final_output.plot_diagnostics();
# Getting the forecast for 16 months
forecast = Final_output.get_forecast(steps = 16)
forecast_conf = forecast.conf_int()
# Plot observed values
ax = BlogData['2013':].plot()
# Plot forecasted values
forecast.predicted_mean.plot(ax=ax, label='Forecast', color='red', alpha=0.9)
# Plot the range for confidence intervals
ax.fill_between(forecast_conf.index,
forecast_conf.iloc[:, 0],
forecast_conf.iloc[:, 1], color='g', alpha=0.1)
# Set axes labels
ax.set_xlabel('Date')
ax.set_ylabel('Price')
plt.legend()
plt.show()

Conclusion

We can see that even with the correlation in the residuals the model did fairly well in predicting future values. In conclusion, the auto_arima function in the pmdarima package is a great tool to use when finding initial parameters for the model. This model can most likely get better with a little fine tuning but auto_arima gives us a great starting out set of initial parameters to look at.

--

--

Steven Kyle

25 year old Texan in the midst of a career change into DataScience.