Data Exploration
This document provides step-by-step instructions for exploring and visualizing forecast data. This should be the first step in sorting through a data extract from Enertel.
Prior to diving in, users may want to review the Enertel-specific concepts we use for the retrieval and organization of forecasts.
We provide samples of data to be used as inputs into this tutorial at your price nodes of interest as part of our free trial. Just reach out!
Setup
Import the necessary libraries for data manipulation, API requests, and time handling.
import pandas as pd
import requests
import arrow
Load and Explore Data
Load Data
Read the CSV file into a Pandas DataFrame:
df = pd.read_csv('data_sample_from_enertel.csv')
df.info()
Basic Data Exploration
Check for Duplicate Composite Keys
The composite key in the dataset is formed by the combination of timestamp
, feature_id
, scenario_id
, and model_id
. To confirm there are no duplicates, group by these columns and check the size:
df.groupby(['timestamp', 'feature_id', 'scenario_id', 'model_id']).size().sort_values(ascending=False).head()
Check the Forecast Timestamp Range
Find the minimum and maximum timestamps in the dataset to determine the range of forecasted intervals. Ensure the timestamps are timezone-aware:
print(df['timestamp'].min())
print(df['timestamp'].max())
Identify Price Nodes
To identify the price nodes included in the dataset and the distinct series, use:
print(df.object_name.unique())
print(df.series_name.unique())
Analyze Forecast Performance
To analyze how the forecasts performed when actual prices were at their highest, sort the DataFrame by the actual
column in descending order. Select relevant columns for inspection:
df.sort_values(by='actual', ascending=False)[['object_name', 'series_name', 'timestamp', 'scheduled_at', 'p50', 'p95', 'p99', 'actual']].head()
Visualize Data
Histograms of Forecasts
Generate a histogram of the p50
forecasts for each unique object_name
. Align the axes vertically for consistency.
import matplotlib.pyplot as plt
# Assuming 'df' is your dataframe
unique_objects = df['object_name'].unique()
num_objects = len(unique_objects)
# Create a vertical subplot for each unique object_name
fig, axes = plt.subplots(num_objects, 1, figsize=(8, 4 * num_objects), sharex=True)
# If there's only one object_name, axes won't be an array
if num_objects == 1:
axes = [axes]
for ax, obj_name in zip(axes, unique_objects):
data = df[df['object_name'] == obj_name]['p50']
ax.hist(data, bins=20, edgecolor='black', alpha=0.7)
ax.set_title(f'Histogram of p50 for {obj_name}', fontsize=14)
ax.set_ylabel('Frequency', fontsize=12)
ax.grid(True, linestyle='--', alpha=0.6)
# Add a shared x-label
fig.text(0.5, 0.04, 'p50', ha='center', fontsize=14)
plt.tight_layout(rect=[0, 0.05, 1, 1]) # Adjust layout for the shared x-label
plt.show()
Notes
- Ensure
data_sample_from_enertel.csv
exists in your working directory. - Replace column names if your dataset uses different terminology.
Happy Exploring! 🚀