We currently lack a metric that tells us how many reboots have occcurred on a cluster. A reboot impacts availability, tells us when admins or hardware decide to take an outage, and might be an accidental outcome of our software incorrectly changing. By tracking a counter of reboots per node we can track the total amount of reboots over time and gain better insight into how machines are managed by environment.
The wtmp log (accessible via last on RHCOS) represents an effective counter for boots. Our node_exporter should read wtmp on startup and write a count of number of boots to a textfile collector reported via the node, and we should sum that over all nodes and report back to telemetry the number of reboots.
This is a part of overall insight into disruption we inject into customer clusters.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.