Network Monitoring Is a Racket Built on Free Protocols

Let me tell you how a $20 billion industry works. A switch exposes its metrics over SNMP — a protocol that’s been free and open since 1988. A monitoring platform runs snmpwalk against that switch, stores the numbers in a database, and draws a line graph. Then it charges you $150 per node per year for the privilege.

That’s it. That’s the business.

I’m oversimplifying, of course. There are MIBs and discovery engines and alerting rules and dashboards and role-based access and compliance reports. It’s a real product. But the foundation — the thing that makes all the pretty graphs possible — is a handful of protocols that ship free with every network device on the planet. SNMP, ICMP, LLDP, CDP, syslog, NetFlow. The raw materials cost nothing. The packaging costs a fortune.

And I think AI is about to blow the doors off this entire racket.

The State of Things: Paying Premium for Commodity Data

If you’ve ever worked in network operations, you know the drill. You pick a monitoring platform — SolarWinds, PRTG, WhatsUp Gold, Nagios, Zabbix, LogicMonitor, Datadog, pick your poison — and you start adding devices. Each device gets polled via SNMP (usually v2c because v3 is a pain to configure, and nobody wants to admit that in the security audit). The platform discovers interfaces, pulls counters, graphs bandwidth, monitors CPU and memory, and sends you an email at 3 AM when something crosses a threshold you set six months ago and forgot about.

# This is basically what your $200k/year monitoring platform does
# Behind all the dashboards and enterprise branding

# SNMP walk a switch for interface data
snmpwalk -v2c -c public 10.0.1.1 1.3.6.1.2.1.2.2.1

# Get system uptime
snmpget -v2c -c public 10.0.1.1 1.3.6.1.2.1.1.3.0

# Check interface status (up/down)
snmpwalk -v2c -c public 10.0.1.1 1.3.6.1.2.1.2.2.1.8

# Pull interface traffic counters (in/out octets)
snmpwalk -v2c -c public 10.0.1.1 1.3.6.1.2.1.31.1.1.1.6  # ifHCInOctets
snmpwalk -v2c -c public 10.0.1.1 1.3.6.1.2.1.31.1.1.1.10 # ifHCOutOctets

# Get CPU utilization (Cisco)
snmpget -v2c -c public 10.0.1.1 1.3.6.1.4.1.9.9.109.1.1.1.1.8.1

# That's it. That's the $150/node/year secret sauce.

The more advanced platforms add path analysis. Cisco ThousandEyes, for example, deploys agents that trace routes to destinations and measure latency, jitter, packet loss, and MOS scores across every hop. That’s genuinely useful — it tells you not just that a service is slow, but where it’s slow. But even that is built on traceroute, ping, and TCP probes. Tools that have been free on every Unix system since before most of the engineers using them were born.

# ThousandEyes-style path analysis, from your terminal

# TCP traceroute to a specific port (like ThousandEyes does)
traceroute -T -p 443 api.example.com

# Continuous ping with timestamps for jitter analysis
ping -D -i 0.5 api.example.com | while read line; do
  echo "$(date '+%H:%M:%S') $line"
done

# MTR combines traceroute + ping for per-hop stats
mtr --report --report-cycles 100 api.example.com

# Check a specific TCP service
nc -zv -w5 api.example.com 443

# Measure HTTP response time
curl -o /dev/null -s -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" https://api.example.com

Here’s what a typical enterprise is paying for this:

Enterprise Network Monitoring — Typical Annual Costs
═════════════════════════════════════════════════════

SolarWinds NPM + NCM + NTA:
  500 nodes:           $30,000-50,000/year (with maintenance)
  2,000 nodes:         $100,000-180,000/year
  5,000 nodes:         $250,000-400,000/year

LogicMonitor:
  Per-device pricing:  $15-25/device/month
  500 devices:         $90,000-150,000/year
  2,000 devices:       $360,000-600,000/year

Datadog Infrastructure:
  Per-host pricing:    $15-23/host/month
  500 hosts:           $90,000-138,000/year
  + network monitoring add-on: another 30-50%

Cisco ThousandEyes:
  Per-agent licensing:  Don't even ask
  Enterprise deal:      "Contact sales" (translation: a lot)

What the underlying technology actually costs:
  SNMP:                Free (built into every device)
  Ping/traceroute:     Free (built into every OS)
  A Linux VM to run it: $5/month on any cloud provider
  Time-series database: Free (VictoriaMetrics, Prometheus)
  Grafana dashboards:   Free (open source)

Look at that gap. The gap between the raw technology cost and what enterprises pay is not a 2x or 5x markup. It’s a 100x to 1000x markup. And the vendors will tell you that you’re paying for support, for the MIB library, for the enterprise-grade reliability, for the compliance certifications. And they’re not wrong — those things have value. But a thousand-fold markup on commodity data collection? That’s not a value proposition. That’s a toll booth.

Why Nobody’s Disrupted This Yet (And Why AI Changes That)

The incumbents have survived this long for a few reasons, and they’re all starting to crack.

Reason 1: Integration Is Hard (Was Hard)

Stitching together SNMP polling, a time-series database, an alerting engine, a dashboard system, and a discovery mechanism used to be genuinely difficult. It required deep systems knowledge, weeks of configuration, and ongoing maintenance. The open-source alternatives — Nagios, Zabbix, LibreNMS — work, but they demand a dedicated engineer who actually enjoys writing XML configuration files (these people exist, but they are rare).

AI changes this because the integration work is exactly the kind of thing LLMs are good at. Generating configuration files, building API connectors, writing data transformation pipelines — this is boilerplate code that an AI can produce in minutes. The “putting it all together” moat has evaporated.

Reason 2: MIB Support Is a Moat (Was a Moat)

SNMP devices expose their data through MIBs — Management Information Bases — which are essentially schema definitions for what data a device can report. A Cisco Catalyst switch has different MIBs than a Palo Alto firewall, which has different MIBs than a NetApp storage array. The big monitoring platforms have spent decades building libraries of vendor-specific MIBs and the code to interpret them.

But here’s the thing: MIBs are text files. They follow a standard syntax (ASN.1). An LLM can read and interpret a MIB file better than most junior network engineers. You feed it a MIB, and it can tell you exactly what OIDs to poll and what the values mean. The MIB library moat is now about as deep as a puddle.

# A MIB is literally a text file that describes what data is available
# Any LLM can read this and know what to poll

-- From CISCO-PROCESS-MIB
cpmCPUTotal5secRev OBJECT-TYPE
    SYNTAX          Gauge32
    MAX-ACCESS      read-only
    STATUS          current
    DESCRIPTION
        "The overall CPU busy percentage in the last
         5 second period."
    ::= { cpmCPUTotalEntry 1 }

# An AI reads this and immediately knows:
# OID: 1.3.6.1.4.1.9.9.109.1.1.1.1.6
# What it means: CPU utilization over 5 seconds
# How to use it: poll every 30-60s, graph it, alert if > 80%

Reason 3: Enterprise Sales Cycles Protect Incumbents (Still True, But…)

The people who buy SolarWinds have procurement processes, compliance checklists, and vendor evaluation matrices that take 6-12 months. You don’t just switch monitoring platforms on a whim. There are integrations, training, historical data migrations. The switching costs are real.

This is the incumbents’ strongest remaining defense. But it’s also a ceiling on their innovation. When your customers are locked in for 3-5 year contracts, you have very little incentive to make the product dramatically better. You just need to make it incrementally better — enough to justify the renewal, not enough to justify the development cost.

Reason 4: Nobody’s Built the AI-Native Alternative (Yet)

This is the actual opportunity. Not “SolarWinds but cheaper.” Not “another open-source monitoring tool.” Something fundamentally different: a monitoring system where the primary interface is a conversation, not a dashboard.

What an AI-Native Network Monitor Actually Looks Like

Here’s where the speculation gets fun. What if you started from scratch — no legacy code, no “we’ve always done it this way” — and designed a network monitoring system assuming AI exists?

The Collection Layer: Stupid Simple

Lightweight agents — tiny Go or Rust binaries, under 10MB — that run on a Linux VM in each site or even on a Raspberry Pi. Each agent does exactly three things:

SNMP polling on a schedule (device discovery via LLDP/CDP, interface counters, system health)
Active probing (ping, traceroute, TCP port checks, HTTP/HTTPS response time)
CLI scraping when SNMP isn’t enough (SSH into devices, run show commands, parse the output)

// Conceptual agent — discovery + polling loop
// The entire collection agent is maybe 2,000 lines of Go

type Device struct {
    IP          string
    Hostname    string
    Vendor      string
    Model       string
    Interfaces  []Interface
    LastSeen    time.Time
}

func (a *Agent) DiscoverNetwork() []Device {
    // 1. ARP scan the local subnets
    // 2. SNMP query each responding IP for sysDescr/sysName
    // 3. Pull LLDP/CDP neighbor tables to build topology
    // 4. Classify device type from sysObjectID
    // Result: complete network inventory, zero configuration
}

func (a *Agent) PollDevice(d Device) Metrics {
    // Standard MIBs cover 80% of what anyone needs:
    // - IF-MIB (interfaces, bandwidth, errors)
    // - HOST-RESOURCES-MIB (CPU, memory, disk)
    // - ENTITY-MIB (hardware inventory)
    // - IP-MIB (routing table, ARP)
    // Poll interval: 60 seconds for counters, 300 for inventory
}

The agents stream metrics to a central time-series database. VictoriaMetrics, Prometheus, InfluxDB — all free, all capable of handling millions of data points. There’s no reason to build a custom storage engine. This problem has been solved.

The AI Layer: Where It Gets Interesting

This is where everything changes. Instead of a human staring at dashboards, trying to correlate spikes across 47 graphs to figure out why the VPN is slow, the AI is continuously analyzing every metric, every log, every config change, and building a model of “normal” for your specific network.

Natural language troubleshooting:

Imagine walking into the NOC on Monday morning and instead of clicking through twelve dashboards, you type:

“Anything unusual over the weekend?”

And the system responds:

“Two things. First, the uplink on core-sw-02 port Te1/0/1 hit 94% utilization Saturday at 2 AM for about 40 minutes — looks like the backup window overlapping with the replication job again. I’ve seen this three Saturdays in a row now. Second, the WAN latency to the Dallas office jumped from 12ms baseline to 45ms starting Friday at 6 PM and hasn’t come back down. The ISP’s looking glass shows a route change at their edge — they’re sending traffic through Chicago now instead of direct. Might want to call them.”

No clicking. No hunting through graphs. No “let me check the interface utilization on… wait which switch is the uplink on again?” Just the answer.

Anomaly detection without threshold tuning:

Traditional monitoring: “Alert when CPU > 85%.” Then you get paged at 3 AM because a switch hit 87% during a scheduled process, and now you have alert fatigue, and now you ignore alerts, and now you miss the actual problem.

AI monitoring: the system learns that this switch’s CPU spikes to 90% every night at 2 AM during the backup window, and that’s normal. It learns that interface Gi0/24 usually runs at 40% utilization during business hours. When that same interface hits 40% at 3 AM on a Sunday, that’s the anomaly — even though it’s technically below any threshold you’d set.

No threshold tuning. No alert fatigue. The system learns your network’s rhythms and tells you when something breaks the pattern.

Predictive failures:

Your monitoring platform sees that CRC errors on a fiber uplink have gone from 0 per day to 3 per day to 12 per day over the past two weeks. A traditional monitor won’t alert until you hit whatever arbitrary threshold you set. The AI sees the trend and says: “This optic is degrading. Based on the error acceleration curve, you’ve probably got 2-3 weeks before it starts dropping packets. Here’s the part number for a replacement SFP.”

Cross-domain correlation:

The help desk reports that users in Building B are complaining about slow internet. Traditional approach: check the WAN link (fine), check the firewall (fine), check the building switch (fine), check the AP utilization (fine), check the DNS server (fine), check the proxy server — oh, the proxy server’s disk is full and it’s swapping to a degraded RAID array, causing 500ms latency on every web request.

That investigation takes a human 45 minutes of SSH sessions and dashboard hopping. An AI that ingests metrics from network, server, and application layers simultaneously spots it in seconds: “Building B slowness is caused by disk I/O latency on proxy-01. The RAID array has a degraded disk and the filesystem hit 95%, triggering swap. Users are experiencing 400-600ms additional latency on HTTP requests routed through this proxy.”

The Interface: Chat First, Dashboards Second

The biggest conceptual shift: the primary interface isn’t a dashboard. It’s a conversation.

Dashboards still exist — you need them for NOC wall displays, for quick visual overviews, for executives who want to see green icons. But the daily workflow for a network engineer is conversational:

“Show me the top 10 interfaces by utilization right now”
“Which devices haven’t been backed up in the last 7 days?”
“What changed on the firewall config yesterday?”
“If I shut down this link for maintenance, what traffic will be affected?”
“Draft a change request for upgrading the IOS on the distribution switches”

Each of these questions currently requires navigating to a specific tool, finding the right report, applying the right filters, and interpreting the results. In an AI-native system, you just ask.

Config Management as Conversation

This one might be the sleeper feature. Network configuration management — compliance checks, drift detection, change tracking — is typically a separate product (SolarWinds NCM, Oxidized, RANCID) that requires its own setup and maintenance.

In an AI-native system, config management is just another data source the AI understands:

“Make sure no switch has telnet enabled.”

The AI SSHes into every switch, checks the running config, and reports back: “Three switches still have telnet enabled: access-sw-14, access-sw-22, and dist-sw-03. Want me to generate the remediation commands?”

“Yes, and add a note to disable telnet on any new switch deployments.”

The AI generates the config snippets, adds the compliance rule to its knowledge base, and flags any future config that includes transport input telnet. No manual compliance rules. No XML policy definitions. You told it what you want in English and it remembers.

The Economics: Why This Could Actually Work

Here’s the part that makes this more than a thought experiment.

Cost structure — AI-native network monitor
══════════════════════════════════════════

Collection agents:
  Development cost:    One-time (Go binary, open-source it)
  Deployment cost:     A Linux VM or Raspberry Pi per site
  Running cost:        Negligible (SNMP polling is lightweight)

Time-series storage:
  VictoriaMetrics:     Free (open source)
  Storage costs:       ~$0.02/GB/month (commodity storage)
  1,000 devices:       Maybe 50GB/month of metrics
  Monthly cost:        ~$1

AI inference:
  This is the real cost
  Per-query (Claude/GPT): $0.01-0.10 depending on complexity
  Daily automated analysis: Maybe $1-5/day for a mid-size network
  Monthly AI cost:     $30-150 for most deployments

Total monthly cost for 500 devices:
  Infrastructure:      $20-50
  AI inference:        $30-150
  Total:               $50-200/month

vs. SolarWinds for 500 devices:
  $30,000-50,000/year ($2,500-4,200/month)

Even with generous estimates for AI inference costs, the economics are absurdly favorable. The collection is free (open protocols, open-source tools). The storage is free (open-source databases). The only real cost is AI inference, and that’s dropping fast.

You could price this at $5-10/device/month — which feels expensive relative to the actual cost but cheap relative to incumbents — and still have 80%+ margins. Or you could go flat-rate per site and make the per-device pricing model obsolete entirely.

What Would It Take to Build This

Let’s get specific about what an MVP looks like.

Phase 1: Collection + Storage (2-4 weeks)

A Go agent that does SNMP discovery and polling, pushes metrics to VictoriaMetrics. Handles the standard MIBs (IF-MIB, HOST-RESOURCES, ENTITY-MIB). Discovers topology via LLDP/CDP. This is solved-problem territory — there are open-source libraries for all of it.

Phase 2: AI Analysis Layer (2-4 weeks)

Connect the metrics data to an LLM. Build the query interface — user asks a question, the system translates it to metrics queries, retrieves the data, and generates a natural-language response. Start with simple stuff: “what’s the bandwidth utilization on this link?” Then build up to correlation: “why is this link saturated?”

Phase 3: Baseline + Anomaly Detection (4-6 weeks)

The system watches your network for 2-4 weeks and builds a model of “normal.” After that, it can detect anomalies without configured thresholds. This is where it starts feeling like magic compared to traditional tools.

Phase 4: Config Management + Compliance (2-4 weeks)

SSH into devices, pull configs, store them, diff them, and let the AI understand them. “Is our network PCI compliant?” becomes an answerable question instead of a month-long audit project.

Total time to MVP: maybe 3-4 months of focused work. Total infrastructure cost: close to zero until you have customers. The AI API costs are the main variable, and they’re pay-per-use.

The Honest Challenges

I’m not going to pretend this is all upside. There are real obstacles.

Enterprise trust: “The AI says the network is fine” is a terrifying sentence for a network manager whose job depends on uptime. You need to show your work — every AI conclusion needs to link back to the raw data that supports it. Transparency isn’t optional, it’s the entire credibility model.

SNMP is dying (slowly): The industry is moving toward streaming telemetry (gNMI, gRPC-based). SNMP isn’t going away for a decade, but a new platform should support both. The good news: streaming telemetry is actually easier to work with than SNMP. The bad news: fewer devices support it today.

The last 20% is 80% of the work: Standard MIBs cover most use cases, but every enterprise has that one weird legacy device that only speaks SNMPv1 with a custom MIB that was written in 1997 by someone who no longer works there. Supporting the long tail of edge cases is what separates a demo from a product.

Security: A system that can SSH into every network device and has AI making recommendations about configurations is a security auditor’s nightmare. The access controls, audit logging, and approval workflows need to be bulletproof. This isn’t a “move fast and break things” domain — it’s a “move carefully and don’t bring down the production network” domain.

“In network monitoring, the cost of a false negative (missing a real problem) is a 3 AM outage. The cost of a false positive (alerting on nothing) is alert fatigue that causes the next real problem to be ignored. Both paths lead to 3 AM.” — Every network engineer, at 3 AM

Selling to enterprises is slow: Even if the product is 10x better and 10x cheaper, getting a Fortune 500 company to rip out SolarWinds and install your startup’s monitoring agent takes 12-18 months of sales cycles, security reviews, and pilot programs. The realistic go-to-market is MSPs and mid-market first, enterprise later.

Where This Idea Stands

This sits firmly in “researching” territory. The technical feasibility is obvious — every component exists today, either as open-source software or as an API call. The economics make sense. The product differentiation (AI-native, conversation-first) is genuine, not just a buzzword stapled onto a traditional dashboard.

The questions I’m still chewing on:

Build vs. integrate: Do you build the collection layer from scratch, or bolt AI onto an existing open-source platform like LibreNMS or Prometheus?
Self-hosted vs. SaaS: Enterprise network teams are paranoid about sending device data to the cloud. A self-hosted option might be necessary, which complicates the business model.
Who’s the first customer? MSPs managing 50-200 client networks seem like the sweet spot — they’re cost-sensitive, they manage lots of similar environments, and they’d kill for AI-assisted troubleshooting across all their clients simultaneously.
The Ollama angle: Could you run the AI inference locally on the same box as the collection agent, using a small open-source model? That eliminates the cloud dependency entirely. A quantized 7B model running on a decent CPU might be enough for basic analysis.

The network monitoring industry has been selling the same fundamental product — repackaged SNMP data — for 30 years. The packaging has gotten prettier, but the core hasn’t changed. AI doesn’t just make the packaging cheaper. It makes the product fundamentally different. That’s not an incremental improvement. That’s a category shift.

And category shifts don’t care about your 5,000-device SolarWinds contract renewal.

FAQ

How much do enterprise network monitoring platforms actually cost?

It varies wildly, but the numbers are eye-watering. SolarWinds Network Performance Monitor starts around $2,500 for 100 nodes and scales to six figures for large deployments. LogicMonitor charges $15-25 per device per month. Datadog’s infrastructure monitoring runs $15-23 per host per month, plus add-ons for network monitoring, APM, and log management. Cisco ThousandEyes is priced per agent and typically requires a “contact sales” conversation, which is enterprise code for “more than you want to spend.” For a mid-size enterprise with 2,000-5,000 devices, annual monitoring costs of $200,000-500,000 are not unusual when you factor in the platform license, maintenance, support contracts, and the salary of the person who maintains it all.

What is SNMP and why is it central to network monitoring?

SNMP — Simple Network Management Protocol — is a standardized protocol for querying network devices about their status and performance. It’s been around since 1988 and is supported by virtually every managed network device: switches, routers, firewalls, wireless access points, UPS units, printers, servers. When a monitoring platform “monitors” a device, it’s usually sending SNMP queries asking “what’s your CPU utilization?” or “how many bytes passed through this interface?” The device responds with the data. That’s the entire interaction. SNMP v2c uses community strings (basically passwords sent in plaintext), while SNMP v3 adds encryption and authentication. The protocol itself is free, open, and built into every operating system. The monitoring platforms are essentially selling you a pretty frontend for SNMP queries.

Could an AI-powered monitoring tool really replace SolarWinds?

For a direct feature-for-feature replacement, not immediately. SolarWinds has 20+ years of vendor MIB support, integrations with ticketing systems, compliance reporting, and enterprise features like multi-tenant views and role-based access. An AI-native tool wouldn’t try to replicate all of that. Instead, it would offer something SolarWinds can’t: intelligent analysis, natural-language troubleshooting, and automatic anomaly detection. The replacement path is more likely gradual — a team deploys the AI tool alongside their existing monitoring, starts using it for troubleshooting and analysis, realizes they’re looking at SolarWinds less and less, and eventually doesn’t renew the contract. That’s how disruption usually works: not a rip-and-replace, but a slow migration driven by daily utility.

What open-source tools exist for network monitoring today?

The open-source landscape is actually quite mature. LibreNMS and Zabbix are full-featured monitoring platforms with web interfaces, alerting, and SNMP support. Prometheus collects metrics and Grafana visualizes them — together they handle most monitoring use cases. Oxidized and RANCID handle network device configuration backup. Netbox provides network inventory and IPAM. The challenge with open-source isn’t capability — it’s integration. Making all these tools work together coherently requires significant engineering effort, which is exactly the gap an AI layer could fill.

What’s the realistic go-to-market for something like this?

MSPs (Managed Service Providers) are probably the best entry point. They manage networks for dozens or hundreds of small businesses, they’re extremely cost-sensitive, and they’d benefit enormously from AI-assisted troubleshooting across all their client networks. A single MSP technician managing 50 client networks could use an AI-native tool to spot problems across all of them simultaneously — something that’s impossible with traditional per-client dashboard monitoring. After proving the model with MSPs, you move upmarket to mid-size enterprises, then eventually enterprise. The homelabber and sysadmin community is also a great proving ground — free tier, get feedback, build credibility.

Isn’t this just Grafana with ChatGPT bolted on?

It might look like that from a distance, but the difference is architectural. Grafana shows you data and you interpret it. A ChatGPT wrapper would let you ask questions about what’s on the screen. An AI-native monitoring system has the AI as the core processing engine — it’s continuously analyzing every metric stream, building behavioral baselines, correlating events across devices and layers, and proactively surfacing insights. The AI isn’t reading your dashboards; it’s reading the raw telemetry and building an understanding of your network that’s deeper than any dashboard could represent. The conversation interface is the output, not the product. The product is the understanding.