Bug 1851928 - "100% of the job ovnkube-node /ovn-kubernetes-node targets in namespace openshift-ovn-kubernetes are down" in 4.5.ci installs
Summary: "100% of the job ovnkube-node /ovn-kubernetes-node targets in namespace opens...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.6.0
Assignee: Surya Seetharaman
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks: 1851930 1854736
TreeView+ depends on / blocked
 
Reported: 2020-06-29 13:18 UTC by Vadim Rutkovsky
Modified: 2020-10-27 16:10 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1851930 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:10:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 689 0 None closed Bug 1851928: [metrics] TargetDown alert is always fired in ovnkube-node job 2020-10-10 15:35:10 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:10:55 UTC

Description Vadim Rutkovsky 2020-06-29 13:18:02 UTC
Description of problem:
OCP 4.5 with OVN alerts about misconfigured ovn-kubernetes-node metrics: `100% of the job ovnkube-node /ovn-kubernetes-node targets in namespace openshift-ovn-kubernetes are down` alert is displayed shortly after install. SDN is functioning correctly, so it appears to be a metrics issue.

4.5.rc.4 is not affected

Version-Release number of selected component (if applicable):
4.5.0-0.ci-2020-06-29-090822
4.6.0-0.ci-2020-06-29-103328


How reproducible:
Always

Comment 1 Surya Seetharaman 2020-07-01 08:05:21 UTC
I created a OCP 4.6.0 cluster and saw that this alert is firing always. Moving this bug to the Verified state.

Investigating further, this is the TargetDown Alert defined in the cluster-monitoring-operator: https://github.com/openshift/cluster-monitoring-operator/blob/91b6c2073b231770f7829b0ab2b41a876e60569f/assets/prometheus-k8s/rules.yaml#L1823-L1829

The query used for this metric seems to check if the percentage of (number of ovnkube-node pods that are down (up==0))/(total number of ovnkube node pods that are up (up==1)) is greater than 10. I am not sure why Prometheus thinks the pods are not up when they actually are. I need to look further into this to find out why and see if this is something on the ovn side or on the monitoring side.

Comment 3 Vadim Rutkovsky 2020-07-01 08:22:09 UTC
>Moving this bug to the Verified state.

I'm not following - the alert should not be firing, it needs to be fixed

Comment 4 Surya Seetharaman 2020-07-01 10:43:04 UTC
(In reply to Vadim Rutkovsky from comment #3)
> >Moving this bug to the Verified state.
> 
> I'm not following - the alert should not be firing, it needs to be fixed

Sorry Wrong assumption of what "Verified" state means (please excuse my ignorance). What I meant to do was "mark it as reproduced" but I just realized that Verified doesn't stand for this. I am working on the fix. Will have a patch up shortly.

Comment 5 Vadim Rutkovsky 2020-07-01 10:45:54 UTC
(In reply to Surya Seetharaman from comment #4)
> (In reply to Vadim Rutkovsky from comment #3)
> > >Moving this bug to the Verified state.
> > 
> > I'm not following - the alert should not be firing, it needs to be fixed
> 
> Sorry Wrong assumption of what "Verified" state means (please excuse my
> ignorance). What I meant to do was "mark it as reproduced" but I just
> realized that Verified doesn't stand for this. 
> I am working on the fix. Will
> have a patch up shortly.

I see, Assigned is a better state for this. Verified means QA has ensures the fix has been merged and released

Comment 7 Ross Brattain 2020-07-14 19:32:57 UTC
Verified on 4.6.0-0.nightly-2020-07-14-092216

oc get pods -o yaml -l app=ovnkube-node | grep --color=auto inactivity-probe
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \
          --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \

Comment 8 Daniel Webster 2020-07-18 04:29:49 UTC
Also seeing this on

4.5.0-0.okd-2020-06-29-110348-beta6

Comment 10 errata-xmlrpc 2020-10-27 16:10:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.