Bulk Data Extracts

Many power market participants wish to use price forecasts as inputs into downstream activities, whether it's in-house optimizations, asset management policies or other trading techniques. Sometimes, it's important for this use case to use forecasts from a specific machine learning model for backtesting and in production.

Enertel offers bulk extracts as part of our paid subscriptions, so that users can train their decision layer using our price forecasts as inputs. We run those same model_ids in production alongside more frequently updated models.

This tutorial will demonstrate how to combine bulk data extracts from a given model_id alongside live forecasts from the same model_id for use in backtesting and optimization.

info

Soon, you'll be able to download a large data extract of forecasts without stitching them together - stay tuned!

Request Extracts

It takes a couple days to run custom data extracts for clients if it doesn't already exist, but once ready, you will receive a link to download csv or parquet files.

Data Assertions

We'll make sure it's clean before we send it to you, but it's always good to enforce some assumptions!

import pandas as pd
import requests
import arrow

df = pd.read_csv('<path_to_file>')

model_ids = df.model_id.unique().tolist() 

# Confirm only one model_id - success!
assert len(model_ids) == 1 


# Check for duplicates or missing data
assert df.isnull().sum().sum() == 0

assert df.duplicated(subset=['timestamp', 'feature_name', 'scheduled_at']).sum() == 0

You'll want to inspect the file to make sure it has enough data for your backtesting use case

# find out the min and max scheduled_at in the dataset
print(df['scheduled_at'].min())
print(df['scheduled_at'].max())

# You may also want to know which price nodes are included in your extract
print(df.object_name.unique())

# ...and which series you're forecasting
print(df.series_name.unique())

Accessing Live Data

Now, we'll query for live data and stitch it together for downstream use. We're limited by the dashboard/forecasts API to query live data in two-week increments, so we'll start by preparing the ranges of dates

range_start = arrow.get(df['scheduled_at'].max()) # start from the max scheduled_at of the bulk data extract
range_end = arrow.get().floor('day') # end at the current date

# produce a list of two-week ranges
ranges = []
while range_start < range_end:
    ranges.append((range_start.format('YYYY-MM-DDTHH:mm:ss'), (range_start + pd.Timedelta(days=14)).format('YYYY-MM-DDTHH:mm:ss')))
    range_start += pd.Timedelta(days=14)

print(f"We'll query the live forecasts {len(ranges)} times")

Great, now before we query the live forecasts, we'll add a helper function that parses the response into a more readable dataframe.

def parse_forecasts(response):
    forecasts = []

    for obj in response: # each object is a price node
        for target in obj["targets"]: # each target is a unique iso/data series/horizon (for example, ERCOT DALMP 72 hours ahead)
            for vintage in target["vintages"]: # each vintage is a unique time the models were scheduled to generated forecasts
                for batch in vintage["batches"]: # each batch is a unique model_id/feature_id combination (many models can run for the same feature and vintage)
                    for forecast in batch["forecasts"]: # each forecast is a unique timestamp in the horizon
                        forecasts.append({
                              "object_name" : obj["object_name"],
                              "series_name" : target["series_name"],
                              "target_id": target["target_id"],
                              "target_description": target["description"],
                              "scheduled_at": vintage["scheduled_at"],
                              "batch_id": batch["batch_id"],
                              "model_id": batch["model_id"],
                              "feature_id": batch["feature_id"],
                              **forecast
                        }
                        )

    return pd.DataFrame(forecasts)

To access live data from our API, you'll need an API Token.

token = <api_token>

url = f"http://app.enertel.ai/api/dashboard/forecasts"
live_dfs = []

for start, end in ranges:
    params = {
        "start" : start,
        "end" : end,
        "models" : model_id,
    }

    resp = requests.get(
        url,
        headers={"Authorization": f"Bearer {token}"},
        params=params,
    )
    print(resp.json())

    _df = parse_enertel_json(resp.json())

    live_dfs.append(_df)

live_df = pd.concat(live_dfs)

Combining the data

Now you have two dataframes - bulk_df (which is the bulk dataframe) and live_df (which is every forecast made since the bulk data extract ended). Now, let's stitch them together and check some assumptions.

# Stich them together
df = pd.concat([bulk_df, live_df])

# Check multiple assumptions
print(f"Number of duplicated rows: {df.duplicated(subset=['timestamp', 'feature_id', 'scheduled_at', 'model_id']).sum()}")

print(f"Number of model_ids: {len(set(df.model_id.unique()))}")

print(f"Range of data: {min(df.scheduled_at)} to {max(df.scheduled_at)}") 

Conclusion

In this tutorial, you've read in a bulk data extract from Enertel, queried our live API for forecasts that are being made daily, and stitched them together to have no gaps. All the forecasts are from the same model_id, which may be preferred for use in your own optimizations or trading strategies. The model_ids will be run every morning alongside other models that may be retrained more often.

Querying for Specific Models

You will need to specify the model ids for which you want forecasts in your data pipelines. Otherwise, forecasts from the best model (which changes over time) will be returned. More info at the dashboard forecasts API.

Bulk Data Extracts

Request Extracts​

Data Assertions​

Accessing Live Data​

Combining the data​

Conclusion​

Request Extracts

Data Assertions

Accessing Live Data

Combining the data

Conclusion