Bug 1906570

Summary: Number of disruptions caused by reboots on a cluster cannot be measured
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: alegrand, anpicker, erooth, juzhao, kakkoyun, lcosic, mnguyen, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:41:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2020-12-10 19:19:55 UTC
We currently lack a metric that tells us how many reboots have occcurred on a cluster. A reboot impacts availability, tells us when admins or hardware decide to take an outage, and might be an accidental outcome of our software incorrectly changing. By tracking a counter of reboots per node we can track the total amount of reboots over time and gain better insight into how machines are managed by environment.

The wtmp log (accessible via last on RHCOS) represents an effective counter for boots. Our node_exporter should read wtmp on startup and write a count of number of boots to a textfile collector reported via the node, and we should sum that over all nodes and report back to telemetry the number of reboots.

This is a part of overall insight into disruption we inject into customer clusters.

Comment 8 errata-xmlrpc 2021-02-24 15:41:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633