“Uptime is 99.9%.”
Systems are reliable. SLAs are met. Monitoring is comprehensive. The lights are green. Infrastructure is invisible because it works.
Uptime is 99.9%. The experience is not.
The system didn’t go down. It just got slow. Not slow enough to trigger an alert — slow enough for users to start building workarounds. The page loads. The query returns. Just not fast enough for the person who runs it 200 times a day. Your SLA says you’re fine. The person on the other end says something different. The gap between uptime and usability is where trust erodes without ever tripping a monitor.
When the SLA is green and users are frustrated, the metric isn’t measuring what the user experiences.
“Submit a ticket and we’ll handle it.”
The service desk is staffed. Tickets are triaged. Response times are tracked. There’s a process for everything.
They stopped submitting tickets. They just work around it.
The ticket takes two days. The workaround takes five minutes. So people stopped asking IT and started solving it themselves — unsecured personal tools, unapproved apps, data in spreadsheets that should be in systems. Your ticket volume is down. Your shadow IT is up. The service desk thinks things are getting better. The security team will eventually discover they’re not.
When ticket volume drops without service improvement, the problems migrated — they didn’t disappear.
“Our systems are integrated.”
We have an ERP. We have a CRM. The data flows. Systems talk to each other. The architecture was designed to connect.
The integration broke two years ago. Someone built a script. It’s still running.
The API between finance and HR hasn’t synced correctly since the last upgrade. A contractor wrote a Python script to reconcile the data nightly. The contractor is gone. The script runs on a server nobody monitors. If it stops, payroll will be wrong and nobody will know for two weeks. Your integrated architecture has a human-shaped patch in the middle that nobody budgeted to maintain or replace.
When integration depends on scripts nobody owns, the architecture has already been replaced by artifacts.
“Our infrastructure is well-documented.”
Network diagrams are current. Change management is logged. The CMDB is maintained. We know what’s running and where.
You know what’s running. You don’t know why it’s configured that way.
There’s a firewall rule that allows traffic from an IP range nobody recognizes. The change log says it was added in 2020 by someone who left in 2021. The comment field says “temp — vendor integration.” The vendor contract ended. The rule is still there. Removing it might break something. Keeping it is a security risk. Your infrastructure is full of decisions that outlived the people and the reasons that created them.
When the infrastructure has rules nobody can explain, the system is governed by inherited risk.
“IT keeps the business running.”
Infrastructure, security, support, compliance, vendor management — IT handles the backbone. The team is small but effective.
The team is small because nobody sees the load until it breaks.
Your team of six manages infrastructure for 3,000 employees. When it works, nobody thinks about you. When it breaks, everyone calls at once. The budget is justified by uptime. The work that prevents downtime — the patching, the monitoring, the vendor wrangling, the 2am alerts — none of that appears in the business review. You’re invisible when you succeed and blamed when you don’t. That’s not a staffing problem. It’s a visibility problem.
When a function is only visible during failure, the organization will always underinvest in it.
Every lens sees the same system. Shared language is how the system starts to learn.
These aren’t failures of people. They’re the physics of organizations operating at scale and speed.