Bug 2070047

Summary:	Kuryr: Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured
Product:	OpenShift Container Platform	Reporter:	Maysa Macedo <mdemaced>
Component:	Networking	Assignee:	Maysa Macedo <mdemaced>
Networking sub component:	kuryr	QA Contact:	Itay Matza <imatza>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	imatza, mbooth, mdulko, prachaud
Version:	4.6	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:02:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2077384

Description Maysa Macedo 2022-03-30 11:20:13 UTC

Description of problem:

With Kuryr, the CNI requests can take a considerable
time given that it has to wait for a VIF from Neutron.
We've seen warning alerts being raised with KuryrCNISlow and reported on the following test
"Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured". The test failure makes the Kuryr upgrade to fail.


Version-Release number of selected component (if applicable):


How reproducible:

Upgrade from OCP 4.9 to OCP 4.10.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Itay Matza 2022-04-28 13:53:27 UTC

Verified with the following steps:

- Installed OCP 4.10.0-0.nightly-2022-04-27-212741 on top of RHOS-16.1-RHEL-8-20220329.n.1 with Kuryr.

- Make sure the cluster is up and the Watchdog and AlertmanagerReceiversNotConfigured alerts exist:
```
(shiftstack) [stack@undercloud-0 ~]$ curl -sk -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.ostest.shiftstack.com/api/v1/alerts' | jq '.data.alerts[] | select(.labels.alertname) | .labels.alertname'
"Watchdog"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"APIRemovedInNextEUSReleaseInUse"
"APIRemovedInNextEUSReleaseInUse"
"AlertmanagerReceiversNotConfigured"
```

- Upgraded successfully to 4.11.0-0.nightly-2022-04-26-181148 using the upgrade command:
```
$ oc adm upgrade --to-image="registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-04-26-181148" --allow-explicit-upgrade --force=true 
```

- Make sure the cluster is up.

- Check the alerts, the Watchdog and AlertmanagerReceiversNotConfigured alerts exist, but the KuryrCNISlow is not.
```
(shiftstack) [stack@undercloud-0 ~]$ curl -sk -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.ostest.shiftstack.com/api/v1/alerts' | jq '.data.alerts[] | select(.labels.alertname) | .labels.alertname'
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"AlertmanagerReceiversNotConfigured"
"Watchdog"
```

- Keep checking the alerts and make sure the KuryrCNISlow is not raised.

- Destroy and create the cluster with OCP 4.11.0-0.nightly-2022-04-26-181148 version.

- Keep checking the alerts and make sure the KuryrCNISlow is not raised.

Comment 6 Prasad Chaudhari 2022-06-23 08:00:05 UTC

The similar issue is seen for version 4.8.45


Description of problem:

test
"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

this test is failing consistently on latest 4.8.45 build. 


Version-Release number of selected component (if applicable):

[root@rdr-zscurst-348a-bastion-0 ~]# oc version
Client Version: 4.8.44
Server Version: 4.8.45
Kubernetes Version: v1.21.11+6b3cbdd


How reproducible:
Deploy the newly come 4.8.45 on power platform and run e2e test.


Actual results:
Test is failing.

Flaky invariants:

[sig-arch] Monitor cluster while tests execute

Failing tests:

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Expected results:
Test should pass without any error.

Comment 7 errata-xmlrpc 2022-08-10 11:02:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069