The Call
I got a call from a friend, a fellow contractor, early yesterday morning. “Just woke up and my entire network is broken. I have a deadline.” Devices were dropping off, some had no IP address, others only IPv6, and nothing could resolve. For him, everything was down.
I connected to his router over VPN without issue, which narrowed it quickly. The edge was up, which meant the failure sat somewhere inside his local network. The symptoms were familiar. Devices losing addresses and name resolution together usually leads toward DHCP and DNS, though at that stage it could still have been a broader internal failure.
As we talked it through the background came into focus. He had been running his own DNS filtering service for years because it was more flexible and had better monitoring than his router. The service began as a trial setup alongside other DNS options and simply remained in place. It was capable, so for similar reasons he later moved DHCP onto it as well, and at that point it became central to his environment.
What Changed
I checked the service on the host and found it running, which ruled out an application failure and pushed the problem down a layer. Checking the host network configuration, the issue was immediately clear. There was no IPv4 address, which for a DHCP server is a hard stop.
The host itself had been configured to obtain its network settings via DHCP. That worked while it was only providing DNS, but once it became its own DHCP server it didn’t.
Delayed failure
Nothing had changed that morning. DHCP had been moved from the router to the service almost two weeks earlier, so it did not immediately appear related. The lease duration on the previous DHCP settings on his router told the real story, just under two weeks.
As leases issued by the router expired, devices began renewing against a service that could not issue an address to it’s own host. The service lost its lease and then had nowhere to renew from, so it never recovered. The correction was simple, assign a fixed address to the service host and restart. It took about a minute, and everything came back.
If that had been done during the initial setup, or at any time in the intervening years, this scramble would not have happened. “Good enough” can be the enemy of the good.
From his side diagnosis was harder because he was within the failure domain. His devices depended on the same service that was failing, so every test path was affected. I was coming in through the router VPN outside that boundary and could see the problem more directly.
What This Becomes
This is not a story about inexperience. He is a capable and smart person. It is about what happens when something introduced for a specific purpose becomes structural without the design changing along with it.
In this case nothing in the service was wrong. It had been working for years. The failure sat in the environment around it, unchanged from when it was first set up. It only surfaced once the service became something the network depended on.
It took about a minute to fix.