Skip to main content

Overview

Spark is an open-source distributed compute framework designed for large-scale data processing. This guide demonstrates how to use TimeGPT with Spark to perform forecasting and cross-validation across distributed clusters. Spark is ideal for enterprise environments with existing Hadoop infrastructure and datasets exceeding 100 million observations. Its robust distributed architecture handles massive-scale time series forecasting with fault tolerance and high performance.

Why Use Spark for Time Series Forecasting?

Spark offers unique advantages for enterprise-scale time series forecasting:
  • Enterprise-grade scalability: Handle datasets with 100M+ observations across distributed clusters
  • Hadoop integration: Seamlessly integrate with existing HDFS and Hadoop ecosystems
  • Fault tolerance: Automatic recovery from node failures ensures reliable computation
  • Mature ecosystem: Leverage Spark’s rich ecosystem of tools and libraries
  • Multi-language support: Work with Python (PySpark), Scala, or Java
Choose Spark when you have enterprise infrastructure, datasets exceeding 100 million observations, or need robust fault tolerance for mission-critical forecasting. What you’ll learn:
  • Install Fugue with Spark support for distributed computing
  • Convert pandas DataFrames to Spark DataFrames
  • Run TimeGPT forecasting and cross-validation on Spark clusters

Prerequisites

Before proceeding, make sure you have an API key from Nixtla. If executing on a distributed Spark cluster, ensure the nixtla library is installed on all worker nodes for consistent execution.

How to Use TimeGPT with Spark

Open In Colab

Step 1: Install Fugue and Spark

Fugue provides a convenient interface to distribute Python code across frameworks like Spark. Install Fugue with Spark support:
pip install fugue[spark]
To work with TimeGPT, make sure you have the nixtla library installed as well.

Step 2: Load Your Data

Load the dataset into a pandas DataFrame. In this example, we use hourly electricity price data from different markets:
import pandas as pd

df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv',
    parse_dates=['ds'],
)
df.head()

Step 3: Initialize Spark

Create a Spark session and convert your pandas DataFrame to a Spark DataFrame:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark_df = spark.createDataFrame(df)
spark_df.show(5)

Step 4: Use TimeGPT on Spark

To use TimeGPT with Spark, provide a Spark DataFrame to Nixtla’s client methods instead of a pandas DataFrame. The main difference from local usage is working with Spark DataFrames instead of pandas DataFrames. Instantiate the NixtlaClient class to interact with Nixtla’s API:
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    api_key='my_api_key_provided_by_nixtla'
)
You can use any method from the NixtlaClient, such as forecast or cross_validation.
  • Forecast Example
  • Cross-validation Example
fcst_df = nixtla_client.forecast(spark_df, h=12)
fcst_df.show(5)
When using Azure AI endpoints, specify model="azureai":
nixtla_client.forecast(
    spark_df,
    h=12,
    model="azureai"
)
The public API supports two models: timegpt-1 (default) and timegpt-1-long-horizon. For long horizon forecasting, see the long-horizon model tutorial.

Step 5: Stop Spark

After completing your tasks, stop the Spark session to free resources:
spark.stop()

Working with Exogenous Variables

TimeGPT with Spark also supports exogenous variables. Refer to the Exogenous Variables Tutorial for details. Simply substitute pandas DataFrames with Spark DataFrames—the API remains identical. Explore more distributed forecasting options: