A while ago (years), I reluctantly set up ntp on some servers using an IP address for the source server; at the time, using a DNS name in ntp.conf was incompatible with the ntp/ntpd version and I didn’t want to go out of my way to compile it from scratch.
Today, I realized that I’ve been slowly getting bit in the butt, several years later.
Back then, the IP addresses were supposed to be rock solid ntp references. But over the years, and finally about a month ago, they all came offline.
I would not have caught the drift if it wasn’t for my use of pt-heartbeat (mk-heartbeat) and doing a review of our cacti graphs.
Usually I check them every monday as a routine, but I’ve been so busy for the last several months I haven’t had time to do that. I figured if it hits the fan, our alerts/thresholds will let me know. Which on a few occasions worked as needed for an Apache server.
pt-heartbeat, a tool of the Percona Toolkit has a common table across replicated servers that each one updates a record in the table with a datetime value.
The time difference between the value for server A, replicated to server B is the ‘lag’. The lag can/should be due to temporary spikes in traffic (or intentional delaying). Needless to say my gut sank when I saw that something weird as going on that was causing a small, yet unrecoverable, and linearly increasing lag time. After quickly confirming that SHOW SLAVE STATUS confirmed that everything’s up to date, it was quickly apparent that the mechanisms involved with the graph generation were at fault. Upon a quick side-by-side examination of server A and server B’s ‘heartbeat’ table, it stood out that the times were off by a few seconds from each other.
I restarted the pt-heartbeat daemons and the issue persisted – the next culprit was simple to identify: ntp.
Output of ntpq -p quickly showed that the ntp server hadn’t synchronized for over a month.
Over the years through periodic apt upgrades; the newer version of ntp came through the pipeline and all that was needed was an update of ntp.conf to use a new host (I opted for stratum 1 ‘time.nist.gov’)