Predictive Modelling for Data Centre Energy Optimisation: A Python Case Study

Data centres consume approximately 415 TWh of electricity globally in 2024, with cooling alone accounting for 30 to 40% of total facility power. The difference between a well-optimised and a poorly managed data centre can be hundreds of thousands of dollars in annual energy costs and thousands of tonnes of CO2 emissions. Yet most operators still rely on static schedules and manual adjustments rather than data-driven predictions to manage energy consumption.

The complete dataset (17,520 hourly observations across 2 years) and all code are provided for readers to replicate the analysis.

The Dataset

Our dataset contains hourly operational data from a mid-sized data centre (approximately 4.5 MW IT load) over a 2-year period (January 2024 to December 2025). Each row represents one hour of operations, with 20 variables capturing IT load, cooling energy, outdoor conditions, power distribution, water usage, carbon intensity, and server utilisation. The dataset is available for download as an Excel file accompanying this article.

timestamp	Date and time (hourly)	datetime
outdoor_temp_c	Outdoor temperature	Celsius
humidity_pct	Relative humidity	%
it_load_mw	IT equipment power consumption	MW
cooling_energy_mw	Cooling system energy consumption	MW
total_facility_power_mw	Total facility power (all systems)	MW
pue	Power Usage Effectiveness	Ratio
water_usage_litres	Cooling water consumed	Litres/hour
grid_carbon_intensity_gco2kwh	Grid emission factor	gCO2/kWh
co2_emissions_kg	Hourly CO2 emissions	kg
renewable_energy_fraction	Share of renewable electricity	0 to 1
server_utilisation_pct	Average server CPU utilisation	%
cooling_mode	Active cooling strategy	Category

Synthetic data:

https://docs.google.com/spreadsheets/d/1nC5AKLY6dJra80Ek4KfsVf0N0UX11xpF/edit?usp=drive_link&ouid=117752587074715035224&rtpof=true&sd=true

Step 1: Load and Explore the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_excel('dc_energy_dataset.xlsx')
print(df.shape)
print(df.describe())

Shape: (17520 rows, 20 columns)

Key statistics:

IT Load: 4.58 MW average (range: 3.78 - 5.84 MW)

PUE: 1.445 average (range: 1.335 - 1.693)

Total Power: 6.63 MW average

Cooling: 1.60 MW average (24.1% of total)

CO2: 41,030 tonnes over 2 years

Water: 23.2 million litres over 2 years

Initial Observation: The PUE ranges from 1.335 (near best-in-class) to 1.693 (below average). This 0.358 gap represents approximately 1.6 MW of wasted energy at peak, or roughly $1.2 million in annual electricity costs at $0.10/kWh. Understanding what drives PUE fluctuation is the key to optimisation.

Step 2: Exploratory Data Analysis

2.1 Cooling Energy vs Outdoor Temperature

# Visualise the relationship between temperature and cooling energy
plt.figure(figsize=(12, 6))
plt.scatter(df['outdoor_temp_c'], df['cooling_energy_mw'],
           alpha=0.05, c=df['hour'], cmap='viridis', s=5)
plt.colorbar(label='Hour of Day')
plt.xlabel('Outdoor Temperature (C)')
plt.ylabel('Cooling Energy (MW)')
plt.title('Cooling Energy vs Outdoor Temperature')
plt.axvline(x=18, color='red', linestyle='--', label='Free cooling threshold (18C)')
plt.legend()
plt.tight_layout()
plt.savefig('cooling_vs_temp.png', dpi=150)

# Statistical analysis: two-regime behaviour
from scipy import stats

below18 = df[df['outdoor_temp_c'] < 18]
above18 = df[df['outdoor_temp_c'] >= 18]

print("Cooling energy below 18C:", below18['cooling_energy_mw'].describe())
print("Cooling energy above 18C:", above18['cooling_energy_mw'].describe())

# Correlation: what drives cooling?
r_temp, _ = stats.pearsonr(df['outdoor_temp_c'], df['cooling_energy_mw'])
r_it, _ = stats.pearsonr(df['it_load_mw'], df['cooling_energy_mw'])
print(f"Correlation: temp r={r_temp:.3f}, IT load r={r_it:.3f}")

# Above 18C: linear regression (temp effect on cooling)
slope, intercept, r, p, se = stats.linregress(above18['outdoor_temp_c'], above18['cooling_energy_mw'])
print(f"Above 18C: slope={slope:.4f} MW/degree, R2={r**2:.3f}")

Cooling energy statistics:

Below 18C (free cooling): mean = 1.37 MW (range: 0.96 - 1.83 MW)

Above 18C (mechanical): mean = 1.77 MW (range: 1.09 - 2.85 MW)

Hours below 18C: 10,240 (58.4%)

Hours above 18C: 7,280 (41.6%)

Correlation with cooling energy: outdoor_temp_c: r = 0.762 (strongest driver) it_load_mw: r = 0.479 (secondary driver)

Above 18C regression: slope = 0.066 MW per degree | R2 = 0.821

Key Finding: The scatter plot reveals two distinct regimes separated by the 18 degree Celsius free cooling threshold. Below 18 degrees, cooling energy averages 1.37 MW but varies between 0.96 and 1.83 MW, driven primarily by IT load variation (base cooling is approximately 0.3 times IT load, not a fixed value). Above 18 degrees, outdoor temperature becomes the dominant driver: each additional degree adds approximately 0.066 MW of cooling demand on top of the IT-load-dependent baseline, with an R-squared of 0.82 for the linear relationship. At peak summer temperatures (above 30 degrees), cooling energy reaches 2.5 to 2.85 MW. The annual cost of each degree above the free cooling threshold is approximately $58,000 at $0.10/kWh. This two-regime behaviour (IT-load-driven below 18 degrees, temperature-driven above 18 degrees) is the single most important operational insight for cooling optimisation.

2.2 PUE by Hour of Day and Season

# PUE heatmap by hour and month
pue_pivot = df.pivot_table(
    values='pue', index='hour', columns='month', aggfunc='mean'
)

plt.figure(figsize=(12, 8))
sns.heatmap(pue_pivot, annot=True, fmt='.3f', cmap='RdYlGn_r',
            vmin=1.38, vmax=1.58)
plt.title('Average PUE by Hour and Month')
plt.xlabel('Month')
plt.ylabel('Hour of Day')
plt.tight_layout()
plt.savefig('pue_heatmap.png', dpi=150)

# Summary statistics
print("PUE by month:")
print(df.groupby('month')['pue'].mean().round(3))
print(f"\nBest cell:  {pue_pivot.min().min():.3f} (Month {pue_pivot.min().idxmin()}, Hour {pue_pivot[pue_pivot.min().idxmin()].idxmin()})")
print(f"Worst cell: {pue_pivot.max().max():.3f} (Month {pue_pivot.max().idxmax()}, Hour {pue_pivot[pue_pivot.max().idxmax()].idxmax()})")
print(f"\nSummer afternoon (Jun-Aug, 12-18): {df[(df['month'].between(6,8)) & (df['hour'].between(12,18))]['pue'].mean():.3f}")
print(f"Winter night (Dec-Feb, 0-6):       {df[(df['month'].isin([12,1,2])) & (df['hour'].between(0,6))]['pue'].mean():.3f}")

PUE by month:

Jan: 1.411 | Feb: 1.411 | Mar: 1.416 | Apr: 1.449 May: 1.498 | Jun: 1.526 | Jul: 1.510 | Aug: 1.462 Sep: 1.421 | Oct: 1.411 | Nov: 1.411 | Dec: 1.411

Best cell: 1.402 (Month 10, Hour 20)

Worst cell: 1.583 (Month 6, Hour 12)

Summer afternoon (Jun-Aug, 12-18): 1.526

Winter night (Dec-Feb, 0-6): 1.408

Key Finding: Month-to-month variation dominates the PUE pattern far more than hour-to-hour variation. Monthly PUE ranges from 1.411 (winter months: Jan, Feb, Oct, Nov, Dec) to 1.526 (June), a spread of 0.115 PUE units driven by seasonal outdoor temperature. Within a given day, hourly PUE varies by only about 0.046 units (1.423 at midnight to 1.469 at midday), reflecting the daily temperature cycle. The worst single cell is June at noon (PUE 1.583), where peak summer heat drives maximum mechanical cooling demand. The best cells cluster in autumn and winter evenings (PUE around 1.40) when free cooling operates at full capacity and ambient temperatures are lowest. The practical implication: scheduling flexible compute workloads (AI model training, batch processing, backups) during winter months delivers a larger PUE benefit than shifting them to night-time within the same day. The combined seasonal and daily shift from summer midday (1.526) to winter night (1.408) represents an 8.4% reduction in overhead energy, translating to approximately 0.54 MW of savings during those periods.

2.3 Correlation Analysis

# Correlation matrix for key variables
cols = ['outdoor_temp_c', 'humidity_pct', 'it_load_mw',
        'cooling_energy_mw', 'total_facility_power_mw',
        'pue', 'server_utilisation_pct',
        'water_usage_litres', 'co2_emissions_kg']

corr = df[cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True)
plt.title('Correlation Matrix: Key Variables')
plt.tight_layout()
plt.savefig('correlation_matrix.png', dpi=150)

Key correlations with total_facility_power_mw:

it_load_mw r = +0.88 (strongest)

cooling_energy_mw r = +0.83

co2_emissions_kg r = +0.80

water_usage_litres r = +0.74

server_utilisation_pct r = +0.62

pue r = +0.59 outdoor_temp_c r = +0.52

humidity_pct r = +0.05

Key correlations with pue:

cooling_energy_mw r = +0.93 (strongest)

water_usage_litres r = +0.93

outdoor_temp_c r = +0.79

total_facility_power_mw r = +0.59

co2_emissions_kg r = +0.45

server_utilisation_pct r = +0.19

it_load_mw r = +0.14

humidity_pct r = +0.01

Key Finding: The strongest correlations with total facility power are IT load (r = 0.88), cooling energy (r = 0.83), and server utilisation (r = 0.62). Outdoor temperature shows a moderate correlation with total power (r = 0.52) because its effect operates indirectly through cooling energy rather than directly on total power. However, PUE is most strongly correlated with outdoor temperature (r = 0.79) and cooling energy (r = 0.93), confirming that cooling efficiency, not IT load, is the primary driver of PUE variation. This distinction matters: to reduce total power, focus on IT load management. To reduce PUE (overhead efficiency), focus on cooling optimisation and temperature management.

Step 3: Feature Engineering

# Create features for prediction
df['temp_above_18'] = (df['outdoor_temp_c'] - 18).clip(lower=0)
df['temp_below_18'] = (18 - df['outdoor_temp_c']).clip(lower=0)
df['is_business_hours'] = df['hour'].between(8, 18).astype(int)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['it_x_temp'] = df['it_load_mw'] * df['outdoor_temp_c']

# Encode cooling mode
le = LabelEncoder()
df['cooling_mode_enc'] = le.fit_transform(df['cooling_mode'])

print(f"Total features: {len(df.columns)}")

Total columns after engineering: 29

New features added: 9

Sample of engineered features (first 3 rows): temp_above_18

temp_below_18 is_business_hours hour_sin it_x_temp cooling_mode_enc

Feature engineering transforms raw data into signals that machine learning models can use effectively. The key engineered features are: temp_above_18 captures the non-linear relationship between temperature and cooling energy (cooling demand increases only above the free cooling threshold), cyclical encoding of hour and month (using sine/cosine transforms to capture the circular nature of time), and it_x_temp which captures the interaction between IT load and temperature (the combined effect is greater than either variable alone).

Step 4: Build the Prediction Model

4.1 Target Variable: Total Facility Power

# Define features and target
features = ['outdoor_temp_c', 'humidity_pct', 'it_load_mw',
            'server_utilisation_pct', 'hour', 'month',
            'is_weekend', 'temp_above_18', 'temp_below_18',
            'is_business_hours', 'hour_sin', 'hour_cos',
            'month_sin', 'month_cos', 'it_x_temp',
            'cooling_mode_enc', 'renewable_energy_fraction']

X = df[features]
y = df['total_facility_power_mw']

# Train/test split (80/20, chronological)
split_idx = int(len(df) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f"Training: {len(X_train)} rows | Test: {len(X_test)} rows")

Training set: 14,016 rows (80%)

Jan 2024 to Aug 2025 Test set: 3,504 rows (20%)

Aug 2025 to Dec 2025

Features: 17

Target: total_facility_power_mw

4.2 Random Forest Model

# Train Random Forest
rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_leaf=10,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

print(f"Random Forest:")
print(f"  MAE:  {mean_absolute_error(y_test, rf_pred):.4f} MW")
print(f"  R2:   {r2_score(y_test, rf_pred):.4f}")

4.3 Gradient Boosting Model

# Train Gradient Boosting
gb = GradientBoostingRegressor(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    random_state=42
)
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)

print(f"Gradient Boosting:")
print(f"  MAE:  {mean_absolute_error(y_test, gb_pred):.4f} MW")
print(f"  R2:   {r2_score(y_test, gb_pred):.4f}")

Random Forest: MAE: 0.0782 MW R2: 0.9641

Gradient Boosting: MAE: 0.0689 MW R2: 0.9665

Model Performance: Both models achieve strong predictive accuracy (R-squared above 0.96), with Gradient Boosting slightly outperforming Random Forest. The mean absolute error of 0.07 MW means the model predicts total facility power within approximately 70 kW on average, which translates to a PUE prediction accuracy of approximately 0.015 units. This is more than sufficient for operational decision-making.

Step 5: Feature Importance Analysis

# Feature importance from Gradient Boosting
importance = pd.DataFrame({
    'feature': features,
    'importance': gb.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
plt.barh(importance['feature'], importance['importance'], color='#2a9d8a')
plt.xlabel('Feature Importance')
plt.title('What Drives Data Centre Energy Consumption?')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)

print(importance.to_string(index=False))

feature importance it_load_mw 0.7088 it_x_temp 0.2803 humidity_pct 0.0020 server_utilisation 0.0020 outdoor_temp_c 0.0008 hour_cos 0.0008 hour 0.0006 temp_above_18 0.0006 (remaining 9) 0.0041

Key Finding: IT load dominates at 70.9% of the model's predictive power. The engineered interaction feature it_x_temp (IT load multiplied by outdoor temperature) captures an additional 28.0%. Together these two features explain 99% of total energy variation. Pure outdoor temperature appears at only 0.1% because the interaction term already encodes the temperature effect: when IT load is high and temperatures are high simultaneously, total facility power surges due to compounding cooling demand. This is a critical insight for operators. The model is telling us that managing IT load is by far the most powerful lever, and that the impact of temperature is not independent of load. A hot day with low IT load wastes far less energy than a moderately warm day at peak load. The practical implication: workload scheduling (controlling when compute-heavy jobs run relative to temperature forecasts) offers the largest single optimisation opportunity.

Step 6: Optimal Energy Consumption Patterns

6.1 Identify the Best and Worst Operating Hours

# Find the most and least efficient operating conditions
df['efficiency_ratio'] = df['server_utilisation_pct'] / df['pue']

best_hours = df.nlargest(100, 'efficiency_ratio')
worst_hours = df.nsmallest(100, 'efficiency_ratio')

print("BEST 100 hours (highest efficiency):")
print(f"  Avg PUE: {best_hours['pue'].mean():.3f}")
print(f"  Avg Temp: {best_hours['outdoor_temp_c'].mean():.1f}C")
print(f"  Avg Utilisation: {best_hours['server_utilisation_pct'].mean():.1f}%")
print(f"  Most common hours: {best_hours['hour'].mode().values}")
print(f"  Cooling mode: {best_hours['cooling_mode'].value_counts().to_dict()}")

print(f"\nWORST 100 hours (lowest efficiency):")
print(f"  Avg PUE: {worst_hours['pue'].mean():.3f}")
print(f"  Avg Temp: {worst_hours['outdoor_temp_c'].mean():.1f}C")
print(f"  Avg Utilisation: {worst_hours['server_utilisation_pct'].mean():.1f}%")

BEST 100 hours (highest efficiency):

Avg PUE: 1.352

Avg Temp: 4.2C

Avg Utilisation: 72.8%

Most common hours: [13, 14, 15]

Cooling mode: {'Free Cooling': 100}

WORST 100 hours (lowest efficiency):

Avg PUE: 1.648

Avg Temp: 31.4C

Avg Utilisation: 24.1%

Most common hours: [3, 4, 5]

Cooling mode: {'Mechanical Cooling': 100}

Critical Insight: The most efficient hours combine high server utilisation (73%) with free cooling (cold weather). The least efficient hours have low utilisation (24%) with mechanical cooling (hot weather). The optimal pattern is clear: maximise workload during cool periods and maintain high utilisation at all times. An idle server at high PUE is the worst possible combination. Running AI training jobs during winter nights could improve the efficiency ratio by 40% compared to running them during summer afternoons.

6.2 Carbon-Aware Scheduling Simulation

# Simulate shifting 20% of daytime load to low-carbon hours
df['optimal_co2'] = df['co2_emissions_kg'].copy()

# For each day, find the 6 hours with lowest grid carbon intensity
for date in df['timestamp'].dt.date.unique():
    mask = df['timestamp'].dt.date == date
    day_data = df[mask]
    
    # Find 6 lowest-carbon hours
    low_carbon_hours = day_data.nsmallest(6, 'grid_carbon_intensity_gco2kwh').index
    high_carbon_hours = day_data.nlargest(6, 'grid_carbon_intensity_gco2kwh').index
    
    # Shift 20% of high-carbon load to low-carbon hours
    shift_amount = df.loc[high_carbon_hours, 'co2_emissions_kg'] * 0.20
    df.loc[high_carbon_hours, 'optimal_co2'] -= shift_amount
    df.loc[low_carbon_hours, 'optimal_co2'] += shift_amount * 0.65  # lower carbon = less CO2 per MW

savings = df['co2_emissions_kg'].sum() - df['optimal_co2'].sum()
print(f"Annual CO2 savings from carbon-aware scheduling: {savings/1000:.0f} tonnes")
print(f"Percentage reduction: {savings/df['co2_emissions_kg'].sum()*100:.1f}%")

Annual CO2 savings from carbon-aware scheduling: 2,847 tonnes | Percentage reduction: 6.9%

Key Finding: Simply shifting 20% of flexible workloads to the 6 lowest-carbon hours each day reduces annual CO2 emissions by approximately 2,847 tonnes (6.9%). This requires no new hardware, no capital expenditure, and no reduction in computing capacity. It is a pure scheduling optimisation that tools like Electricity Maps and WattTime can enable in production environments today.

Step 7: PUE Prediction and Anomaly Detection

# Build PUE prediction model
pue_features = ['outdoor_temp_c', 'humidity_pct', 'it_load_mw',
                'hour_sin', 'hour_cos', 'month_sin',
                'month_cos', 'temp_above_18', 'is_weekend']

X_pue = df[pue_features]
y_pue = df['pue']

X_tr, X_te = X_pue[:split_idx], X_pue[split_idx:]
y_tr, y_te = y_pue[:split_idx], y_pue[split_idx:]

pue_model = GradientBoostingRegressor(
    n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42
)
pue_model.fit(X_tr, y_tr)
pue_pred = pue_model.predict(X_te)

print(f"PUE Prediction MAE: {mean_absolute_error(y_te, pue_pred):.4f}")
print(f"PUE Prediction R2:  {r2_score(y_te, pue_pred):.4f}")

# Anomaly detection: flag hours where actual PUE exceeds predicted by > 0.05
df_test = df[split_idx:].copy()
df_test['pue_predicted'] = pue_pred
df_test['pue_anomaly'] = (df_test['pue'] - df_test['pue_predicted']) > 0.05

anomaly_count = df_test['pue_anomaly'].sum()
anomaly_waste = (df_test.loc[df_test['pue_anomaly'], 'pue'] -
                 df_test.loc[df_test['pue_anomaly'], 'pue_predicted']).mean()

print(f"\nAnomalous hours detected: {anomaly_count} ({anomaly_count/len(df_test)*100:.1f}%)")
print(f"Average excess PUE during anomalies: +{anomaly_waste:.3f}")

PUE Prediction MAE: 0.0128 PUE Prediction R2: 0.9534

Anomalous hours detected: 312 (8.9%)

Average excess PUE during anomalies: +0.072

Key Finding: The PUE model predicts efficiency with remarkable accuracy (MAE of 0.013 PUE units). Using this model to detect anomalies, we identify 312 hours (8.9% of the test period) where actual PUE exceeds the prediction by more than 0.05 units. These anomalies represent periods where the cooling system is underperforming relative to what the model expects given the conditions. Investigating and resolving these anomalies (likely caused by equipment faults, suboptimal setpoints, or control system errors) could recover approximately 0.072 PUE units during affected hours, translating to roughly 180 MWh of wasted energy per year.

Step 8: Actionable Recommendations

Based on the analysis, six specific optimisation strategies emerge, each quantified with the expected energy and carbon savings:

Strategy 1	Shift 20% of flexible workloads to low-carbon grid hours	Minimal (scheduling only)	2,847 t/year (6.9%)	Software only
Strategy 2	Fix PUE anomalies (8.9% of hours with excess PUE > 0.05)	~180 MWh/year	~63 t/year	Maintenance + controls
Strategy 3	Extend free cooling threshold from 18C to 22C with economiser upgrades	~800 MWh/year	~280 t/year	$50K-$150K capex
Strategy 4	Increase server utilisation from 45% avg to 60% (consolidation)	~500 MWh/year	~175 t/year	Virtualisation software
Strategy 5	Implement AI-powered HVAC optimisation (a la DeepMind)	~2,000 MWh/year	~700 t/year	$100K-$300K
Strategy 6	Increase renewable energy fraction from 25% to 60%	N/A (same energy, lower carbon)	~8,200 t/year	PPA dependent
Total	Combined potential	~3,480 MWh/year	~12,265 t/year (30%)

Bottom Line: Combining all six strategies could reduce annual CO2 emissions by approximately 30% (12,265 tonnes) while saving roughly $350,000 in electricity costs. Strategies 1 and 2 require zero capital expenditure and can be implemented within weeks using the prediction models built in this analysis.

Conclusion

This case study demonstrates that predictive modelling applied to data centre energy data yields immediately actionable insights. With 17 features and two years of hourly data, a Gradient Boosting model predicts total facility power within 70 kW (R-squared 0.97) and PUE within 0.013 units (R-squared 0.95). The analysis reveals that IT load and its interaction with outdoor temperature together explain 99% of energy variation, that the 18 degree free cooling threshold marks the boundary between two fundamentally different operational regimes, that carbon-aware scheduling alone can reduce emissions by 7% with zero capital expenditure, and that anomaly detection can identify equipment underperformance that wastes energy invisibly.

For data centre operators, the practical message is clear: you are sitting on the data you need to optimise. Every hour of operational telemetry, from temperature sensors to power meters to server utilisation logs, contains information that machine learning can convert into energy savings, cost reductions, and carbon abatement. The code in this article is a starting point. The models improve with more data, more features (weather forecasts, electricity price signals, planned maintenance schedules), and more sophisticated algorithms (LSTM neural networks for time series, reinforcement learning for real-time control). The gap between the best and worst hours in this dataset represents a 22% PUE difference. Closing even half of that gap would save this facility over $600,000 and 4,000 tonnes of CO2 per year.

Share this post