Utilities at scale: the real cost of moving from pilot to 100k devices
Pilots can feel like they won't rack up the cost when scaled. A handful of devices, a friendly network, a short change window, a team that can WhatsApp each other if something breaks. Then you scale: 10k, 50k, 100k devices across regions and partners. The monthly bill tells one story; your actual cost tells another: truck rolls, overtime, audits, incident reviews, and the drag of “temporary” exceptions that never went away.

Where pilots hide cost (and how it explodes later)
1) Truck rolls and on-site swaps
One site visit can wipe out months of nominal savings. At fleet scale, even a small site-visit rate adds up. Track site-visit rate and first-time-fix rate — those two numbers dominate field effort at scale.
2) Roaming and single-IMSI fragility
Dropouts look like “data noise” in pilots. At scale they become missed reads, retries, and SLA credits. Watch attach success, packet loss, and missed read % by region.
3) VPN and jump-host overhead
Flat site-to-site VPNs seem free. The real weight is standing access you must review, exceptions you never closed, and investigations that take hours because logs show “user connected”, not which device or flow.
4) Firmware and change windows
Pilots tolerate wide-open windows. At 100k devices you need scope-limited updates, rollback, and evidence. Otherwise you pay in rollbacks, after-hours work, and rework.
5) Complex billing and data waste
Per-SIM bundles look tidy in a lab. At fleet size you need pooling, anomaly detection, and automatic quarantine for runaway devices to stabilise consumption.
6) Compliance debt
NIS2/IEC 62443 evidence isn’t free. If your connectivity layer can’t produce session-level audit and device-type policies, teams carry that burden manually.
A simple TCO frame you can actually use
Think of cost in four buckets. Fill each with counts, rates and hours; then stress-test with a “scale to 100k” multiplier.
-
Connectivity & platform (run) – SIM/data, eUICC management, private routing, policy control.
-
Operations (change) – provisioning, updates, planned windows.
-
Field (unplanned) – truck rolls, swap-outs, urgent site access.
-
Risk & compliance (assurance) – investigation hours, audit prep, exception clean-up.
Rule of thumb: if field + assurance (visits and investigation hours) outweigh run + change, your design is costing more than your data plan ever will.
Design for scale: choices that bend the TCO curve
Multi-IMSI + eUICC by default
-
Why it helps: fewer dropouts, fewer visits. Switch profiles and steer by policy when a network degrades; meet local profile rules without site swaps.
-
Measure: attach success %, packet loss %, round-trip time, profile swap success rate, visits avoided.
Private by design
-
Why it helps: smaller blast radius, fewer incidents, faster approvals. Keep traffic on private routes (private APN or direct to SCADA/MDM/cloud).
-
Measure: public endpoints (target zero), external scan hits (target zero).
Per-session maintenance (no flat VPNs)
-
Why it helps: less firefighting, faster reviews. One engineer, one device, one job, auto-expiry and usable audit.
-
Measure: median time to approve access, % sessions that auto-expire, investigation time per incident.
Protocol/FQDN allow-lists per device type
-
Why it helps: kills “temporary any/any” rules and sideways risk.
-
IEC-104/DNP3 → named SCADA masters
-
DLMS/COSEM → MDM FQDNs (no wildcards)
-
MQTT → broker FQDN (TLS/mTLS)
-
-
Measure: exceptions opened/closed, policy-breach alerts, blocked off-policy attempts.
Pooled data with guardrails
-
Why it helps: less waste, fewer overages, faster anomaly detection.
-
Measure: pool variance %, % devices <10% or >200% of expected, time-to-quarantine.
Change windows that revert themselves
-
Why it helps: fewer sticky exceptions, less out-of-hours work. Signed firmware, cohorts, and auto-revert when the window closes.
-
Measure: completion rate per wave, rollback count, exceptions left open (target zero).
What scales cleanly from day one (practical build list)
-
Identity at provisioning: bind SIM/eUICC to the device in the warehouse; attach a baseline policy before first boot.
-
Two policy sets: one for AMI, one for substation/DER (different thresholds and failover priorities).
-
Health-based steering: operator priorities per country plus thresholds for loss/latency/attach; evaluate on breach and on a schedule.
-
Named contractor access: no shared jump hosts; pre-approve standard tasks with time-boxed sessions.
-
Logs you can use: stream who/what/when/policy to your SIEM; keep retention aligned to regulation.
-
Pooled plans + alerts: daily variance checks; quarantine policy for outliers.
-
Rollback muscle memory: treat profile swaps and firmware as reversible by default; rehearse on small cohorts.
A worked example (purely in ops terms)
-
Fleet: 100,000 devices across 8 countries.
-
Baseline: single-IMSI, broad APN tunnels, jump-host access.
-
Pain today: 1.2% annual visit rate; 2.5 hours average investigation per incident; dropouts in two border regions.
Introduce: multi-IMSI + eUICC, policy-based steering, per-session access, allow-lists, and pooled data.
After three months (typical outcomes):
-
Visit rate halves: 1,200 → 600 visits per year (−600).
-
Investigation time drops: 2.5h → 1.0h on ~45 incidents per quarter (−67.5 hours per quarter).
-
Data waste reduces by ~11 percentage points (e.g., 18% → 7%).
-
Two roaming hotspots stabilise; missed reads down 40–60%.
Even if platform overhead rises slightly to add eUICC/multi-IMSI and policy, field effort and investigation hours usually drop more, bending the total cost curve in your favour.
Checklist: bake these into tenders and SOWs
-
Resilience: multi-IMSI on every SIM; eUICC with fleet-scale swap and rollback; operator priorities per country.
-
Security: private routing; per-device protocol/FQDN allow-lists; per-session maintenance with auto-expiry.
-
Operations: cohort-based firmware with auto-revert; pooled data with anomaly quarantine; named contractor access.
-
Observability: per-device roaming metrics; session-level logs to SIEM; before/after reports on policy/profile changes.
-
KPIs: attach success %, missed reads %, investigation hours per incident, visit rate, exceptions left open, data waste %.
At 100k devices, the expensive line items aren’t megabytes. They’re visits, exceptions and investigation hours. Build for identity, policy and proof from the start: multi-IMSI + eUICC for resilience, private routes, per-session access, and allow-lists per device type, and your total cost bends the right way as you scale.