Bug 1851928
Summary: | "100% of the job ovnkube-node /ovn-kubernetes-node targets in namespace openshift-ovn-kubernetes are down" in 4.5.ci installs | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vadim Rutkovsky <vrutkovs> | |
Component: | Networking | Assignee: | Surya Seetharaman <surya> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Ross Brattain <rbrattai> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | medium | CC: | aconstan, daniel.webster | |
Version: | 4.6 | |||
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1851930 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:10:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1851930, 1854736 |
Description
Vadim Rutkovsky
2020-06-29 13:18:02 UTC
I created a OCP 4.6.0 cluster and saw that this alert is firing always. Moving this bug to the Verified state. Investigating further, this is the TargetDown Alert defined in the cluster-monitoring-operator: https://github.com/openshift/cluster-monitoring-operator/blob/91b6c2073b231770f7829b0ab2b41a876e60569f/assets/prometheus-k8s/rules.yaml#L1823-L1829 The query used for this metric seems to check if the percentage of (number of ovnkube-node pods that are down (up==0))/(total number of ovnkube node pods that are up (up==1)) is greater than 10. I am not sure why Prometheus thinks the pods are not up when they actually are. I need to look further into this to find out why and see if this is something on the ovn side or on the monitoring side. >Moving this bug to the Verified state.
I'm not following - the alert should not be firing, it needs to be fixed
(In reply to Vadim Rutkovsky from comment #3) > >Moving this bug to the Verified state. > > I'm not following - the alert should not be firing, it needs to be fixed Sorry Wrong assumption of what "Verified" state means (please excuse my ignorance). What I meant to do was "mark it as reproduced" but I just realized that Verified doesn't stand for this. I am working on the fix. Will have a patch up shortly. (In reply to Surya Seetharaman from comment #4) > (In reply to Vadim Rutkovsky from comment #3) > > >Moving this bug to the Verified state. > > > > I'm not following - the alert should not be firing, it needs to be fixed > > Sorry Wrong assumption of what "Verified" state means (please excuse my > ignorance). What I meant to do was "mark it as reproduced" but I just > realized that Verified doesn't stand for this. > I am working on the fix. Will > have a patch up shortly. I see, Assigned is a better state for this. Verified means QA has ensures the fix has been merged and released Verified on 4.6.0-0.nightly-2020-07-14-092216 oc get pods -o yaml -l app=ovnkube-node | grep --color=auto inactivity-probe --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ Also seeing this on 4.5.0-0.okd-2020-06-29-110348-beta6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |