Bug 1851928

Summary: "100% of the job ovnkube-node /ovn-kubernetes-node targets in namespace openshift-ovn-kubernetes are down" in 4.5.ci installs
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: ovn-kubernetes QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: aconstan, daniel.webster
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1851930 (view as bug list) Environment:
Last Closed: 2020-10-27 16:10:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1851930, 1854736    

Description Vadim Rutkovsky 2020-06-29 13:18:02 UTC
Description of problem:
OCP 4.5 with OVN alerts about misconfigured ovn-kubernetes-node metrics: `100% of the job ovnkube-node /ovn-kubernetes-node targets in namespace openshift-ovn-kubernetes are down` alert is displayed shortly after install. SDN is functioning correctly, so it appears to be a metrics issue.

4.5.rc.4 is not affected

Version-Release number of selected component (if applicable):
4.5.0-0.ci-2020-06-29-090822
4.6.0-0.ci-2020-06-29-103328


How reproducible:
Always

Comment 1 Surya Seetharaman 2020-07-01 08:05:21 UTC
I created a OCP 4.6.0 cluster and saw that this alert is firing always. Moving this bug to the Verified state.

Investigating further, this is the TargetDown Alert defined in the cluster-monitoring-operator: https://github.com/openshift/cluster-monitoring-operator/blob/91b6c2073b231770f7829b0ab2b41a876e60569f/assets/prometheus-k8s/rules.yaml#L1823-L1829

The query used for this metric seems to check if the percentage of (number of ovnkube-node pods that are down (up==0))/(total number of ovnkube node pods that are up (up==1)) is greater than 10. I am not sure why Prometheus thinks the pods are not up when they actually are. I need to look further into this to find out why and see if this is something on the ovn side or on the monitoring side.

Comment 3 Vadim Rutkovsky 2020-07-01 08:22:09 UTC
>Moving this bug to the Verified state.

I'm not following - the alert should not be firing, it needs to be fixed

Comment 4 Surya Seetharaman 2020-07-01 10:43:04 UTC
(In reply to Vadim Rutkovsky from comment #3)
> >Moving this bug to the Verified state.
> 
> I'm not following - the alert should not be firing, it needs to be fixed

Sorry Wrong assumption of what "Verified" state means (please excuse my ignorance). What I meant to do was "mark it as reproduced" but I just realized that Verified doesn't stand for this. I am working on the fix. Will have a patch up shortly.

Comment 5 Vadim Rutkovsky 2020-07-01 10:45:54 UTC
(In reply to Surya Seetharaman from comment #4)
> (In reply to Vadim Rutkovsky from comment #3)
> > >Moving this bug to the Verified state.
> > 
> > I'm not following - the alert should not be firing, it needs to be fixed
> 
> Sorry Wrong assumption of what "Verified" state means (please excuse my
> ignorance). What I meant to do was "mark it as reproduced" but I just
> realized that Verified doesn't stand for this. 
> I am working on the fix. Will
> have a patch up shortly.

I see, Assigned is a better state for this. Verified means QA has ensures the fix has been merged and released

Comment 7 Ross Brattain 2020-07-14 19:32:57 UTC
Verified on 4.6.0-0.nightly-2020-07-14-092216

oc get pods -o yaml -l app=ovnkube-node | grep --color=auto inactivity-probe
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \

Comment 8 Daniel Webster 2020-07-18 04:29:49 UTC
Also seeing this on

4.5.0-0.okd-2020-06-29-110348-beta6

Comment 10 errata-xmlrpc 2020-10-27 16:10:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196