Bug 2093016
Summary: | [azure disk] add metric and alert to help identify cascading test failures | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jonathan Dobson <jdobson> |
Component: | Storage | Assignee: | Jan Safranek <jsafrane> |
Storage sub component: | Storage | QA Contact: | Rohit Patil <ropatil> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | jsafrane, ropatil |
Version: | 4.11 | ||
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:49:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jonathan Dobson
2022-06-02 18:15:04 UTC
We need to check: * Why the volume is not mounted (and check if it was attached correctly). * If we can get this info in a synthetic CI test, https://github.com/openshift/origin/tree/master/pkg/synthetictests. * If we can detect it in the CSI driver and emit a metric (+ alert) for it, with some useful info how to fix it. The root cause of this issue were missing udev rules in RHCOS 8.6. This has been fixed in https://github.com/openshift/os/pull/836 Now, how can we report such errors better? I added an alert to OCP that will report when all volume mounts or volume attachments are failing for a volume plugin on a node for 5 minutes. A single success will make the alert go away. I.e. create a PV pointing to a non-existing iSCSI volume + run a Pod with it. The alert should be Pending in 5 minutes and Firing in 10 minutes. Create a second Pod with a working iSCSI volume and run it on the same node as the first pod - the alert should go away soon(ish), as there will be mixed iSCSI volume mount successes and failures on the node and even a single success is enough not to trigger the alert. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |