Bug 1906570 - Number of disruptions caused by reboots on a cluster cannot be measured
Summary: Number of disruptions caused by reboots on a cluster cannot be measured
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-10 19:19 UTC by Clayton Coleman
Modified: 2021-02-24 15:42 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:41:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1017 0 None closed Bug 1906570: Mount /var/log/wtmp into node_exporter init container 2021-02-21 08:10:53 UTC
Github openshift node_exporter pull 74 0 None closed Bug 1906570: Capture the number of boots by reading wtmp 2021-02-21 08:10:53 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:42:24 UTC

Description Clayton Coleman 2020-12-10 19:19:55 UTC
We currently lack a metric that tells us how many reboots have occcurred on a cluster. A reboot impacts availability, tells us when admins or hardware decide to take an outage, and might be an accidental outcome of our software incorrectly changing. By tracking a counter of reboots per node we can track the total amount of reboots over time and gain better insight into how machines are managed by environment.

The wtmp log (accessible via last on RHCOS) represents an effective counter for boots. Our node_exporter should read wtmp on startup and write a count of number of boots to a textfile collector reported via the node, and we should sum that over all nodes and report back to telemetry the number of reboots.

This is a part of overall insight into disruption we inject into customer clusters.

Comment 8 errata-xmlrpc 2021-02-24 15:41:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.