Description of problem: OCP 4.5 with OVN alerts about misconfigured ovn-kubernetes-node metrics: `100% of the job ovnkube-node /ovn-kubernetes-node targets in namespace openshift-ovn-kubernetes are down` alert is displayed shortly after install. SDN is functioning correctly, so it appears to be a metrics issue. 4.5.rc.4 is not affected Version-Release number of selected component (if applicable): 4.5.0-0.ci-2020-06-29-090822 4.6.0-0.ci-2020-06-29-103328 How reproducible: Always
I created a OCP 4.6.0 cluster and saw that this alert is firing always. Moving this bug to the Verified state. Investigating further, this is the TargetDown Alert defined in the cluster-monitoring-operator: https://github.com/openshift/cluster-monitoring-operator/blob/91b6c2073b231770f7829b0ab2b41a876e60569f/assets/prometheus-k8s/rules.yaml#L1823-L1829 The query used for this metric seems to check if the percentage of (number of ovnkube-node pods that are down (up==0))/(total number of ovnkube node pods that are up (up==1)) is greater than 10. I am not sure why Prometheus thinks the pods are not up when they actually are. I need to look further into this to find out why and see if this is something on the ovn side or on the monitoring side.
>Moving this bug to the Verified state. I'm not following - the alert should not be firing, it needs to be fixed
(In reply to Vadim Rutkovsky from comment #3) > >Moving this bug to the Verified state. > > I'm not following - the alert should not be firing, it needs to be fixed Sorry Wrong assumption of what "Verified" state means (please excuse my ignorance). What I meant to do was "mark it as reproduced" but I just realized that Verified doesn't stand for this. I am working on the fix. Will have a patch up shortly.
(In reply to Surya Seetharaman from comment #4) > (In reply to Vadim Rutkovsky from comment #3) > > >Moving this bug to the Verified state. > > > > I'm not following - the alert should not be firing, it needs to be fixed > > Sorry Wrong assumption of what "Verified" state means (please excuse my > ignorance). What I meant to do was "mark it as reproduced" but I just > realized that Verified doesn't stand for this. > I am working on the fix. Will > have a patch up shortly. I see, Assigned is a better state for this. Verified means QA has ensures the fix has been merged and released
Verified on 4.6.0-0.nightly-2020-07-14-092216 oc get pods -o yaml -l app=ovnkube-node | grep --color=auto inactivity-probe --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
Also seeing this on 4.5.0-0.okd-2020-06-29-110348-beta6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196