Bug 2070047 - Kuryr: Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured
Summary: Kuryr: Prometheus when installed on the cluster shouldn't report any alerts i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Maysa Macedo
QA Contact: Itay Matza
URL:
Whiteboard:
Depends On:
Blocks: 2077384
TreeView+ depends on / blocked
 
Reported: 2022-03-30 11:20 UTC by Maysa Macedo
Modified: 2022-08-10 11:03 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:02:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1359 0 None open Bug 2070047: Bump max value of hist quantile for kuryr_cni_request_duration 2022-03-30 11:21:24 UTC
Github openshift kuryr-kubernetes pull 647 0 None open Bug 2070047: Increase cni_request_duration buckets 2022-04-01 09:29:19 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:03:01 UTC

Description Maysa Macedo 2022-03-30 11:20:13 UTC
Description of problem:

With Kuryr, the CNI requests can take a considerable
time given that it has to wait for a VIF from Neutron.
We've seen warning alerts being raised with KuryrCNISlow and reported on the following test
"Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured". The test failure makes the Kuryr upgrade to fail.


Version-Release number of selected component (if applicable):


How reproducible:

Upgrade from OCP 4.9 to OCP 4.10.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Itay Matza 2022-04-28 13:53:27 UTC
Verified with the following steps:

- Installed OCP 4.10.0-0.nightly-2022-04-27-212741 on top of RHOS-16.1-RHEL-8-20220329.n.1 with Kuryr.

- Make sure the cluster is up and the Watchdog and AlertmanagerReceiversNotConfigured alerts exist:
```
(shiftstack) [stack@undercloud-0 ~]$ curl -sk -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.ostest.shiftstack.com/api/v1/alerts' | jq '.data.alerts[] | select(.labels.alertname) | .labels.alertname'
"Watchdog"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"APIRemovedInNextEUSReleaseInUse"
"APIRemovedInNextEUSReleaseInUse"
"AlertmanagerReceiversNotConfigured"
```

- Upgraded successfully to 4.11.0-0.nightly-2022-04-26-181148 using the upgrade command:
```
$ oc adm upgrade --to-image="registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-04-26-181148" --allow-explicit-upgrade --force=true 
```

- Make sure the cluster is up.

- Check the alerts, the Watchdog and AlertmanagerReceiversNotConfigured alerts exist, but the KuryrCNISlow is not.
```
(shiftstack) [stack@undercloud-0 ~]$ curl -sk -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.ostest.shiftstack.com/api/v1/alerts' | jq '.data.alerts[] | select(.labels.alertname) | .labels.alertname'
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"NodeClockNotSynchronising"
"AlertmanagerReceiversNotConfigured"
"Watchdog"
```

- Keep checking the alerts and make sure the KuryrCNISlow is not raised.

- Destroy and create the cluster with OCP 4.11.0-0.nightly-2022-04-26-181148 version.

- Keep checking the alerts and make sure the KuryrCNISlow is not raised.

Comment 6 Prasad Chaudhari 2022-06-23 08:00:05 UTC
The similar issue is seen for version 4.8.45


Description of problem:

test
"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]"

this test is failing consistently on latest 4.8.45 build. 


Version-Release number of selected component (if applicable):

[root@rdr-zscurst-348a-bastion-0 ~]# oc version
Client Version: 4.8.44
Server Version: 4.8.45
Kubernetes Version: v1.21.11+6b3cbdd


How reproducible:
Deploy the newly come 4.8.45 on power platform and run e2e test.


Actual results:
Test is failing.

Flaky invariants:

[sig-arch] Monitor cluster while tests execute

Failing tests:

[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Expected results:
Test should pass without any error.

Comment 7 errata-xmlrpc 2022-08-10 11:02:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.