2008911 – Prometheus repeatedly scaling prometheus-operator replica set

Bug 2008911 - Prometheus repeatedly scaling prometheus-operator replica set

Summary: Prometheus repeatedly scaling prometheus-operator replica set

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-29 13:30 UTC by Stephen Benjamin
Modified:	2022-03-10 16:14 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-arch] events should not repeat pathologically [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]
Last Closed:	2022-03-10 16:13:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1407	None	open	Bug 2008911: Revert "Configure prometheus operator TLS based on the cluster APIServer config"	2021-09-29 20:21:25 UTC
Github	openshift cluster-monitoring-operator pull 1409	None	open	Bug 2008911: Set arguments in a deterministic order	2021-09-30 11:31:57 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:14:21 UTC

Description Stephen Benjamin 2021-09-29 13:30:02 UTC

It looks like maybe since https://github.com/openshift/cluster-monitoring-operator/pull/1380 we're seeing high rates of failures of 2 tests.

Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1443049181332639744

It looks like prometheus operator is repeatedly being scaled up and down:

- [sig-arch] events should not repeat pathologically

event happened 21 times, something is wrong: ns/openshift-monitoring deployment/prometheus-operator - reason/ScalingReplicaSet (combined from similar events): Scaled down replica set prometheus-operator-7b849bbfbf to 0 event happened 22 times, something is wrong: ns/openshift-monitoring deployment/prometheus-operator - reason/ScalingReplicaSet (combined from similar events): Scaled up replica set prometheus-operator-6c9df867b5 to 1 event happened 28 times, something is wrong: ns/openshift-monitoring deployment/prometheus-operator - reason/ScalingReplicaSet (combined from similar events): Scaled up replica set prometheus-operator-69f878c5b5 to 1

The second test is:

- [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel]

operator=prometheus-operator, watchrequestcount=221, upperbound=180, ratio=1.2277777777777779
Operator prometheus-operator produces more watch requests than expected

If the increase in watches is expected, we can up that watch limit, but I wasn't sure if it was related to the first failure.

Comment 1 Stephen Benjamin 2021-09-29 13:32:20 UTC

search.ci's graph shows it started yesterday: https://search.ci.openshift.org/chart?search=something+is+wrong%3A+ns%2Fopenshift-monitoring+deployment%2Fprometheus-operator+-+reason%2FScalingReplicaSet+&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 3 Junqi Zhao 2021-09-30 05:39:10 UTC

checked with 4.10.0-0.nightly-2021-09-29-060153 which include https://github.com/openshift/cluster-monitoring-operator/pull/1380/files, prometheus-operator is restarted frequently
# while true; do date; oc -n openshift-monitoring get pod | grep prometheus-operator; oc -n openshift-monitoring get rs | grep prometheus-operator; echo -e "\n";sleep 20s; done
Thu Sep 30 01:30:20 EDT 2021
prometheus-operator-859d6b85dc-fr42q           2/2     Running   0               74s
prometheus-operator-58c776545            0         0         0       19m
prometheus-operator-5f6b8599fc           0         0         0       21m
prometheus-operator-5fd66599c7           0         0         0       26m
prometheus-operator-5fdbc74fdf           0         0         0       14m
prometheus-operator-68fbbbcfcc           0         0         0       16m
prometheus-operator-76db79b8cc           0         0         0       101s
prometheus-operator-7c478957b9           0         0         0       11m
prometheus-operator-85874c5899           0         0         0       14m
prometheus-operator-859d6b85dc           1         1         1       80s
prometheus-operator-85fd4d457c           0         0         0       16m
prometheus-operator-c7b897c47            0         0         0       6m41s


Thu Sep 30 01:30:50 EDT 2021
prometheus-operator-58b7dd8db7-fq775           2/2     Running   0               5s
prometheus-operator-58b7dd8db7           1         1         1       10s
prometheus-operator-58c776545            0         0         0       19m
prometheus-operator-5f6b8599fc           0         0         0       22m
prometheus-operator-5fdbc74fdf           0         0         0       15m
prometheus-operator-68fbbbcfcc           0         0         0       17m
prometheus-operator-76db79b8cc           0         0         0       2m11s
prometheus-operator-7c478957b9           0         0         0       12m
prometheus-operator-85874c5899           0         0         0       14m
prometheus-operator-859d6b85dc           0         0         0       110s
prometheus-operator-85fd4d457c           0         0         0       16m
prometheus-operator-c7b897c47            0         0         0       7m11s


Thu Sep 30 01:31:20 EDT 2021
prometheus-operator-77f978fc69-pbsms           2/2     Running   0               16s
prometheus-operator-58b7dd8db7           0         0         0       40s
prometheus-operator-58c776545            0         0         0       20m
prometheus-operator-5fdbc74fdf           0         0         0       15m
prometheus-operator-68fbbbcfcc           0         0         0       17m
prometheus-operator-76db79b8cc           0         0         0       2m41s
prometheus-operator-77f978fc69           1         1         1       21s
prometheus-operator-7c478957b9           0         0         0       12m
prometheus-operator-85874c5899           0         0         0       15m
prometheus-operator-859d6b85dc           0         0         0       2m20s
prometheus-operator-85fd4d457c           0         0         0       17m
prometheus-operator-c7b897c47            0         0         0       7m41s


Thu Sep 30 01:31:51 EDT 2021
prometheus-operator-77f978fc69-pbsms           2/2     Running   0               48s
prometheus-operator-58b7dd8db7           0         0         0       72s
prometheus-operator-58c776545            0         0         0       20m
prometheus-operator-5fdbc74fdf           0         0         0       16m
prometheus-operator-68fbbbcfcc           0         0         0       18m
prometheus-operator-76db79b8cc           0         0         0       3m13s
prometheus-operator-77f978fc69           1         1         1       53s
prometheus-operator-7c478957b9           0         0         0       13m
prometheus-operator-85874c5899           0         0         0       15m
prometheus-operator-859d6b85dc           0         0         0       2m52s
prometheus-operator-85fd4d457c           0         0         0       17m
prometheus-operator-c7b897c47            0         0         0       8m13s

Comment 4 Junqi Zhao 2021-09-30 05:40:25 UTC

also checked with 4.9.0-0.nightly-2021-09-29-172320, no such issue
# while true; do date; oc -n openshift-monitoring get pod | grep prometheus-operator; oc -n openshift-monitoring get rs | grep prometheus-operator; echo -e "\n";sleep 20s; done
Thu Sep 30 01:30:07 EDT 2021
prometheus-operator-9f8f5bf6b-dzfzx            2/2     Running   1 (6h13m ago)   6h14m
prometheus-operator-9f8f5bf6b            1         1         1       6h14m


Thu Sep 30 01:30:39 EDT 2021
prometheus-operator-9f8f5bf6b-dzfzx            2/2     Running   1 (6h14m ago)   6h14m
prometheus-operator-9f8f5bf6b            1         1         1       6h14m


Thu Sep 30 01:31:11 EDT 2021
prometheus-operator-9f8f5bf6b-dzfzx            2/2     Running   1 (6h14m ago)   6h15m
prometheus-operator-9f8f5bf6b            1         1         1       6h15m


Thu Sep 30 01:31:43 EDT 2021
prometheus-operator-9f8f5bf6b-dzfzx            2/2     Running   1 (6h15m ago)   6h15m
prometheus-operator-9f8f5bf6b            1         1         1       6h15m

Comment 6 Junqi Zhao 2021-09-30 06:19:24 UTC

also checked with 4.10.0-0.nightly-2021-09-27-141350 which does not include https://github.com/openshift/cluster-monitoring-operator/pull/1380/files, no such issue
# while true; do date; oc -n openshift-monitoring get pod | grep prometheus-operator; oc -n openshift-monitoring get rs | grep prometheus-operator; echo -e "\n";sleep 30s; done
Thu Sep 30 02:05:13 EDT 2021
prometheus-operator-778d67588-t8ltm            2/2     Running   1 (23m ago)   24m
prometheus-operator-778d67588            1         1         1       24m


Thu Sep 30 02:05:54 EDT 2021
prometheus-operator-778d67588-t8ltm            2/2     Running   1 (24m ago)   25m
prometheus-operator-778d67588            1         1         1       25m
...
Thu Sep 30 02:14:04 EDT 2021
prometheus-operator-778d67588-t8ltm            2/2     Running   1 (32m ago)   33m
prometheus-operator-778d67588            1         1         1       33m


Thu Sep 30 02:14:45 EDT 2021
prometheus-operator-778d67588-t8ltm            2/2     Running   1 (33m ago)   34m
prometheus-operator-778d67588            1         1         1       34m

Comment 8 Eran Cohen 2021-09-30 11:02:09 UTC

This issue is failing the "operators should not create watch channels very often" test in the nightly-4.10-e2e-aws-single-node
The last 5 runs fail due to:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443429427324129280
Sep 30 05:56:44.317: INFO: operator=prometheus-operator, watchrequestcount=217, upperbound=180, ratio=1.2055555555555555

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443320451118927872
Sep 29 22:22:20.760: INFO: operator=prometheus-operator, watchrequestcount=185, upperbound=180, ratio=1.0277777777777777

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443228422246502400
Sep 29 16:30:48.651: INFO: operator=prometheus-operator, watchrequestcount=263, upperbound=180, ratio=1.461111111111111

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443094835195023360
Sep 29 07:37:32.531: INFO: operator=prometheus-operator, watchrequestcount=184, upperbound=180, ratio=1.0222222222222221

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443038333637758976
Sep 29 03:57:22.291: INFO: operator=prometheus-operator, watchrequestcount=308, upperbound=180, ratio=1.711111111111111

Comment 9 Simon Pasquier 2021-10-04 08:44:04 UTC

The change that introduced has been reverted in https://github.com/openshift/cluster-monitoring-operator/pull/1407. Moving to MODIFIED.

Comment 12 Junqi Zhao 2021-10-08 03:26:06 UTC

tested with 4.10.0-0.nightly-2021-10-07-175229, only one prometheus-operator replicaset now
# while true; do date; oc -n openshift-monitoring get pod | grep prometheus-operator; oc -n openshift-monitoring get rs | grep prometheus-operator; echo -e "\n";sleep 20s; done
Thu Oct  7 22:48:35 EDT 2021
prometheus-operator-79b57fc9bf-vpgnj           2/2     Running   1 (113m ago)   115m
prometheus-operator-79b57fc9bf           1         1         1       115m


Thu Oct  7 22:49:06 EDT 2021
prometheus-operator-79b57fc9bf-vpgnj           2/2     Running   1 (114m ago)   115m
prometheus-operator-79b57fc9bf           1         1         1       116m


Thu Oct  7 22:49:36 EDT 2021
prometheus-operator-79b57fc9bf-vpgnj           2/2     Running   1 (114m ago)   116m
prometheus-operator-79b57fc9bf           1         1         1       116m


Thu Oct  7 22:50:07 EDT 2021
prometheus-operator-79b57fc9bf-vpgnj           2/2     Running   1 (115m ago)   116m
prometheus-operator-79b57fc9bf           1         1         1       117m


Thu Oct  7 22:50:37 EDT 2021
prometheus-operator-79b57fc9bf-vpgnj           2/2     Running   1 (115m ago)   117m
prometheus-operator-79b57fc9bf           1         1         1       117m

Comment 17 errata-xmlrpc 2022-03-10 16:13:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.