Back to Overview
03
From Reactive to Predictive

Automated Intelligence Gateway

Betfred's infrastructure spans AWS cloud, on-premise retail shop systems, and hybrid workloads. Today, these are managed with separate tools, manual processes, and reactive monitoring. This initiative creates a unified intelligence gateway that proactively manages, monitors, and optimises the entire estate — predicting issues before they impact customers.

Current State Assessment

Where Betfred Stands Today: Manual, Reactive, Fragmented

Betfred's infrastructure operations rely on manual provisioning, reactive monitoring, and separate toolchains for cloud and on-premise environments. During peak events like the Grand National, the team scales manually based on experience rather than predictive analytics.

Capability
Current Approach
Gap
Status
Infrastructure Provisioning
Manual Terraform/CloudFormation with human-initiated deploys
Slow provisioning (hours), human error risk, no self-service
Manual Process
Scaling
Reactive auto-scaling with fixed thresholds (CPU > 80%)
Reacts after load arrives — too late for Grand National spikes
Reactive Only
Monitoring
CloudWatch + Datadog with separate dashboards per environment
Fragmented visibility, no unified cross-environment view
Fragmented
Incident Response
PagerDuty alerts → manual investigation → manual remediation
MTTR 45+ minutes, human bottleneck at 3am
Manual Process
Deployment
Jenkins CI/CD with manual approval gates
Slow rollouts, no canary/blue-green automation
Manual Process
Cost Management
Monthly AWS billing review, manual rightsizing
Over-provisioned 60% of the time, wasting £280K/year on idle compute
Reactive Only
Hybrid Management
Separate tooling for cloud (AWS) and on-premise (retail shops)
No unified control plane, inconsistent policies
Fragmented
Compliance
Manual audit trails, periodic security scans
Compliance drift between audits, no continuous enforcement
Reactive Only
Cost of Staying Reactive
45 min
Average MTTR
£340K
Annual downtime cost
60%
Over-provisioned time
3+
Separate tool chains
Before & After

Infrastructure Transformation: Toggle to Compare

Interactive Architecture View

Current: Manual Ops with Separate Tools
Current: Manual Ops with Separate Tools
☁️
AWS Cloud
CloudWatch + manual scaling
🏢
On-Premise
Separate monitoring stack
🔧
Jenkins CI/CD
Manual approval gates
📟
PagerDuty
Alert → human → manual fix
📊
Datadog
Dashboards, no automation
CURRENT PAIN POINTS
Manual scaling — team pre-provisions based on gut feel for Grand National
45-minute average MTTR — human investigation bottleneck
Separate dashboards for cloud vs. on-premise — no unified view
Jenkins pipelines with manual approval gates — slow deployments
Over-provisioned 60% of the time — wasting £280K/year on idle compute
No predictive capability — always reacting, never anticipating
Financial Case

Total Cost of Ownership: Current vs. Intelligence Gateway

Annual Operational Cost Comparison

Cost CategoryManual Ops (Current)Intelligence Gateway (Future)Annual Saving
Over-Provisioned Compute60% idle time → right-sized with predictive scaling£280,000£70,000£210,000
Downtime & Incident Cost45-min MTTR → 5-min MTTR with self-healing£340,000£68,000£272,000
Ops Team (Manual Tasks)Manual provisioning/monitoring → 80% automated£520,000£260,000£260,000
Multiple Monitoring ToolsCloudWatch + Datadog + custom → unified platform£180,000£95,000£85,000
Intelligence Gateway PlatformKubernetes + ArgoCD + Crossplane + ML pipeline£0£320,000-£320,000
Peak Event PreparationManual war rooms → automated predictive scaling£120,000£30,000£90,000
Compliance & AuditManual audit trails → continuous policy enforcement£95,000£25,000£70,000
Deployment FailuresManual rollbacks → automated canary with rollback£85,000£15,000£70,000
Total Annual Cost£1,620,000£883,000£737,00045% reduction

Net Annual Saving

After platform investment, Betfred saves £737K annually while gaining predictive capabilities, self-healing automation, and unified hybrid management.

£737K
Annual Saving
99.95%
Target Uptime
5 min
Target MTTR
Solution Architecture

Four Pillars of the Intelligence Gateway

The gateway is built on four interconnected pillars, each addressing a critical gap in Betfred's current operations.

01

Unified Control Plane

Single pane of glass for all infrastructure
Today

Separate CloudWatch, Datadog, and on-premise monitoring tools with no cross-environment correlation.

Future

Crossplane + Kubernetes providing a single API for managing AWS, on-premise, and edge resources. One dashboard, one policy engine, one source of truth.

CrossplaneKubernetesArgoCDGrafana
02

Predictive Scaling Engine

Scale before demand arrives
Today

Reactive auto-scaling with fixed CPU thresholds. Manual pre-provisioning for Grand National based on previous year's estimates.

Future

ML models trained on 3 years of Betfred event data (Grand National, Cheltenham, Premier League) predicting load 2 hours ahead. KEDA scales pods based on predictions, not reactions.

KEDAProphetLSTMSageMaker
03

Self-Healing Automation

Fix issues before humans notice
Today

PagerDuty alert → on-call engineer wakes up → investigates → manually remediates. Average 45-minute MTTR.

Future

Automated runbooks triggered by anomaly detection. 80% of incidents resolved without human intervention. Escalation only for novel or critical scenarios.

Argo EventsOPAPrometheusCustom Operators
04

Intelligent Deployment

Zero-downtime progressive rollouts
Today

Jenkins pipelines with manual approval gates. Full rollouts with manual rollback if issues detected. Average 2-hour deployment cycle.

Future

GitOps-driven canary deployments with automated observability gates. Traffic progressively shifts from 5% → 100% with automatic rollback if error rate exceeds threshold.

Argo RolloutsFlaggerGitHub ActionsOPA
Self-Healing Detail

Automated Remediation Runbooks

Each runbook defines a trigger condition, automated action, and escalation path. These replace the current manual investigation and remediation workflow.

Sev.
Trigger Condition
Automated Action
Escalation Path
Medium
API latency > 500ms for 2 min
Scale API pods horizontally (2x)
If latency persists after 5 min → alert on-call
High
Database connection pool > 85%
Spin up read replica, redirect read traffic
If pool > 95% → failover to standby
High
Memory usage > 90% on node
Evict low-priority pods, cordon node
If OOM kill detected → replace node
Critical
Certificate expiry < 7 days
Auto-renew via cert-manager
If renewal fails → alert security team
Medium
Disk usage > 80%
Archive old logs to S3, compress temp files
If > 90% → expand volume automatically
High
Health check failure (3 consecutive)
Restart pod, route traffic to healthy replicas
If restart fails 3x → replace node
Critical
Grand National T-24h
Pre-scale all services to 5x baseline
Activate war room, continuous monitoring
Deployment Architecture

Progressive Delivery Pipeline

From code commit to production in under 35 minutes with automated quality gates, canary analysis, and instant rollback.

1
Code Commit
GitHub Actions
Lint, unit tests, security scan (Snyk)
~3 min
2
Build & Package
Docker + ECR
Multi-stage build, vulnerability scan
~5 min
3
Integration Test
K8s Ephemeral Env
Spin up isolated namespace, run E2E tests
~8 min
4
Canary Deploy
Argo Rollouts
5% → 25% → 50% → 100% with automated rollback
~15 min
5
Observability Gate
Prometheus + OPA
Error rate < 0.1%, latency p99 < target
Continuous
6
Production
ArgoCD GitOps
Declarative state, drift detection, auto-sync
~2 min
Peak Event Management

Grand National: Predictive Scaling in Action

The Grand National generates 10x normal traffic in a 15-minute window. Today, Betfred manually pre-provisions based on last year's numbers. The Intelligence Gateway uses ML to predict and pre-scale automatically.

Current Approach (Manual)

Preparation
2 weeks of manual planning
Scaling Trigger
Manual, based on last year's data
Pre-Scale Window
24 hours before (over-provisioned)
Cost
£120K per event (over-provisioned)
Risk
Under-provision = outage, over-provision = waste
Post-Event
Manual scale-down over 2–3 days

Future Approach (Predictive)

Preparation
Automated — ML model trained on 3 years of data
Scaling Trigger
ML prediction 2 hours ahead of demand
Pre-Scale Window
2 hours — precise, not over-provisioned
Cost
£30K per event (right-sized)
Risk
Automated with human override available
Post-Event
Automatic scale-down within 30 minutes

Future Solution: What Betfred Will See

End-State Vision

Intelligence Gateway Dashboard
Intelligence Gateway Dashboard

Unified control plane showing all infrastructure across AWS, on-premise, and edge. Real-time health status, resource utilisation, cost tracking, and anomaly alerts in a single view.

Projected Business Impact

£737K
Annual Saving
99.95%
Target Uptime
< 5 min
Target MTTR
80%
Auto-Remediated