Skip to main content

Data Exploration

This document provides step-by-step instructions for exploring and visualizing forecast data. This should be the first step in sorting through a data extract from Enertel.

Concepts Overview

Prior to diving in, users may want to review the Enertel-specific concepts we use for the retrieval and organization of forecasts.

Sample of Data

We provide samples of data to be used as inputs into this tutorial at your price nodes of interest as part of our free trial. Just reach out!


Setup

Import the necessary libraries for data manipulation, API requests, and time handling.

import pandas as pd
import requests
import arrow

Load and Explore Data

Load Data

Read the CSV file into a Pandas DataFrame:

df = pd.read_csv('data_sample_from_enertel.csv')
df.info()

Basic Data Exploration

Check for Duplicate Composite Keys

The composite key in the dataset is formed by the combination of timestamp, feature_id, scenario_id, and model_id. To confirm there are no duplicates, group by these columns and check the size:

df.groupby(['timestamp', 'feature_id', 'scenario_id', 'model_id']).size().sort_values(ascending=False).head()

Check the Forecast Timestamp Range

Find the minimum and maximum timestamps in the dataset to determine the range of forecasted intervals. Ensure the timestamps are timezone-aware:

print(df['timestamp'].min())
print(df['timestamp'].max())

Identify Price Nodes

To identify the price nodes included in the dataset and the distinct series, use:

print(df.object_name.unique())
print(df.series_name.unique())

Analyze Forecast Performance

To analyze how the forecasts performed when actual prices were at their highest, sort the DataFrame by the actual column in descending order. Select relevant columns for inspection:

df.sort_values(by='actual', ascending=False)[['object_name', 'series_name', 'timestamp', 'scheduled_at', 'p50', 'p95', 'p99', 'actual']].head()

Visualize Data

Histograms of Forecasts

Generate a histogram of the p50 forecasts for each unique object_name. Align the axes vertically for consistency.

import matplotlib.pyplot as plt

# Assuming 'df' is your dataframe
unique_objects = df['object_name'].unique()
num_objects = len(unique_objects)

# Create a vertical subplot for each unique object_name
fig, axes = plt.subplots(num_objects, 1, figsize=(8, 4 * num_objects), sharex=True)

# If there's only one object_name, axes won't be an array
if num_objects == 1:
axes = [axes]

for ax, obj_name in zip(axes, unique_objects):
data = df[df['object_name'] == obj_name]['p50']
ax.hist(data, bins=20, edgecolor='black', alpha=0.7)
ax.set_title(f'Histogram of p50 for {obj_name}', fontsize=14)
ax.set_ylabel('Frequency', fontsize=12)
ax.grid(True, linestyle='--', alpha=0.6)

# Add a shared x-label
fig.text(0.5, 0.04, 'p50', ha='center', fontsize=14)

plt.tight_layout(rect=[0, 0.05, 1, 1]) # Adjust layout for the shared x-label
plt.show()

Notes

  • Ensure data_sample_from_enertel.csv exists in your working directory.
  • Replace column names if your dataset uses different terminology.

Happy Exploring! 🚀