It looks like maybe since https://github.com/openshift/cluster-monitoring-operator/pull/1380 we're seeing high rates of failures of 2 tests. Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1443049181332639744 It looks like prometheus operator is repeatedly being scaled up and down: - [sig-arch] events should not repeat pathologically event happened 21 times, something is wrong: ns/openshift-monitoring deployment/prometheus-operator - reason/ScalingReplicaSet (combined from similar events): Scaled down replica set prometheus-operator-7b849bbfbf to 0 event happened 22 times, something is wrong: ns/openshift-monitoring deployment/prometheus-operator - reason/ScalingReplicaSet (combined from similar events): Scaled up replica set prometheus-operator-6c9df867b5 to 1 event happened 28 times, something is wrong: ns/openshift-monitoring deployment/prometheus-operator - reason/ScalingReplicaSet (combined from similar events): Scaled up replica set prometheus-operator-69f878c5b5 to 1 The second test is: - [sig-arch][Late] operators should not create watch channels very often [Suite:openshift/conformance/parallel] operator=prometheus-operator, watchrequestcount=221, upperbound=180, ratio=1.2277777777777779 Operator prometheus-operator produces more watch requests than expected If the increase in watches is expected, we can up that watch limit, but I wasn't sure if it was related to the first failure.
search.ci's graph shows it started yesterday: https://search.ci.openshift.org/chart?search=something+is+wrong%3A+ns%2Fopenshift-monitoring+deployment%2Fprometheus-operator+-+reason%2FScalingReplicaSet+&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
checked with 4.10.0-0.nightly-2021-09-29-060153 which include https://github.com/openshift/cluster-monitoring-operator/pull/1380/files, prometheus-operator is restarted frequently # while true; do date; oc -n openshift-monitoring get pod | grep prometheus-operator; oc -n openshift-monitoring get rs | grep prometheus-operator; echo -e "\n";sleep 20s; done Thu Sep 30 01:30:20 EDT 2021 prometheus-operator-859d6b85dc-fr42q 2/2 Running 0 74s prometheus-operator-58c776545 0 0 0 19m prometheus-operator-5f6b8599fc 0 0 0 21m prometheus-operator-5fd66599c7 0 0 0 26m prometheus-operator-5fdbc74fdf 0 0 0 14m prometheus-operator-68fbbbcfcc 0 0 0 16m prometheus-operator-76db79b8cc 0 0 0 101s prometheus-operator-7c478957b9 0 0 0 11m prometheus-operator-85874c5899 0 0 0 14m prometheus-operator-859d6b85dc 1 1 1 80s prometheus-operator-85fd4d457c 0 0 0 16m prometheus-operator-c7b897c47 0 0 0 6m41s Thu Sep 30 01:30:50 EDT 2021 prometheus-operator-58b7dd8db7-fq775 2/2 Running 0 5s prometheus-operator-58b7dd8db7 1 1 1 10s prometheus-operator-58c776545 0 0 0 19m prometheus-operator-5f6b8599fc 0 0 0 22m prometheus-operator-5fdbc74fdf 0 0 0 15m prometheus-operator-68fbbbcfcc 0 0 0 17m prometheus-operator-76db79b8cc 0 0 0 2m11s prometheus-operator-7c478957b9 0 0 0 12m prometheus-operator-85874c5899 0 0 0 14m prometheus-operator-859d6b85dc 0 0 0 110s prometheus-operator-85fd4d457c 0 0 0 16m prometheus-operator-c7b897c47 0 0 0 7m11s Thu Sep 30 01:31:20 EDT 2021 prometheus-operator-77f978fc69-pbsms 2/2 Running 0 16s prometheus-operator-58b7dd8db7 0 0 0 40s prometheus-operator-58c776545 0 0 0 20m prometheus-operator-5fdbc74fdf 0 0 0 15m prometheus-operator-68fbbbcfcc 0 0 0 17m prometheus-operator-76db79b8cc 0 0 0 2m41s prometheus-operator-77f978fc69 1 1 1 21s prometheus-operator-7c478957b9 0 0 0 12m prometheus-operator-85874c5899 0 0 0 15m prometheus-operator-859d6b85dc 0 0 0 2m20s prometheus-operator-85fd4d457c 0 0 0 17m prometheus-operator-c7b897c47 0 0 0 7m41s Thu Sep 30 01:31:51 EDT 2021 prometheus-operator-77f978fc69-pbsms 2/2 Running 0 48s prometheus-operator-58b7dd8db7 0 0 0 72s prometheus-operator-58c776545 0 0 0 20m prometheus-operator-5fdbc74fdf 0 0 0 16m prometheus-operator-68fbbbcfcc 0 0 0 18m prometheus-operator-76db79b8cc 0 0 0 3m13s prometheus-operator-77f978fc69 1 1 1 53s prometheus-operator-7c478957b9 0 0 0 13m prometheus-operator-85874c5899 0 0 0 15m prometheus-operator-859d6b85dc 0 0 0 2m52s prometheus-operator-85fd4d457c 0 0 0 17m prometheus-operator-c7b897c47 0 0 0 8m13s
also checked with 4.9.0-0.nightly-2021-09-29-172320, no such issue # while true; do date; oc -n openshift-monitoring get pod | grep prometheus-operator; oc -n openshift-monitoring get rs | grep prometheus-operator; echo -e "\n";sleep 20s; done Thu Sep 30 01:30:07 EDT 2021 prometheus-operator-9f8f5bf6b-dzfzx 2/2 Running 1 (6h13m ago) 6h14m prometheus-operator-9f8f5bf6b 1 1 1 6h14m Thu Sep 30 01:30:39 EDT 2021 prometheus-operator-9f8f5bf6b-dzfzx 2/2 Running 1 (6h14m ago) 6h14m prometheus-operator-9f8f5bf6b 1 1 1 6h14m Thu Sep 30 01:31:11 EDT 2021 prometheus-operator-9f8f5bf6b-dzfzx 2/2 Running 1 (6h14m ago) 6h15m prometheus-operator-9f8f5bf6b 1 1 1 6h15m Thu Sep 30 01:31:43 EDT 2021 prometheus-operator-9f8f5bf6b-dzfzx 2/2 Running 1 (6h15m ago) 6h15m prometheus-operator-9f8f5bf6b 1 1 1 6h15m
also checked with 4.10.0-0.nightly-2021-09-27-141350 which does not include https://github.com/openshift/cluster-monitoring-operator/pull/1380/files, no such issue # while true; do date; oc -n openshift-monitoring get pod | grep prometheus-operator; oc -n openshift-monitoring get rs | grep prometheus-operator; echo -e "\n";sleep 30s; done Thu Sep 30 02:05:13 EDT 2021 prometheus-operator-778d67588-t8ltm 2/2 Running 1 (23m ago) 24m prometheus-operator-778d67588 1 1 1 24m Thu Sep 30 02:05:54 EDT 2021 prometheus-operator-778d67588-t8ltm 2/2 Running 1 (24m ago) 25m prometheus-operator-778d67588 1 1 1 25m ... Thu Sep 30 02:14:04 EDT 2021 prometheus-operator-778d67588-t8ltm 2/2 Running 1 (32m ago) 33m prometheus-operator-778d67588 1 1 1 33m Thu Sep 30 02:14:45 EDT 2021 prometheus-operator-778d67588-t8ltm 2/2 Running 1 (33m ago) 34m prometheus-operator-778d67588 1 1 1 34m
This issue is failing the "operators should not create watch channels very often" test in the nightly-4.10-e2e-aws-single-node The last 5 runs fail due to: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443429427324129280 Sep 30 05:56:44.317: INFO: operator=prometheus-operator, watchrequestcount=217, upperbound=180, ratio=1.2055555555555555 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443320451118927872 Sep 29 22:22:20.760: INFO: operator=prometheus-operator, watchrequestcount=185, upperbound=180, ratio=1.0277777777777777 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443228422246502400 Sep 29 16:30:48.651: INFO: operator=prometheus-operator, watchrequestcount=263, upperbound=180, ratio=1.461111111111111 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443094835195023360 Sep 29 07:37:32.531: INFO: operator=prometheus-operator, watchrequestcount=184, upperbound=180, ratio=1.0222222222222221 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node/1443038333637758976 Sep 29 03:57:22.291: INFO: operator=prometheus-operator, watchrequestcount=308, upperbound=180, ratio=1.711111111111111
The change that introduced has been reverted in https://github.com/openshift/cluster-monitoring-operator/pull/1407. Moving to MODIFIED.
tested with 4.10.0-0.nightly-2021-10-07-175229, only one prometheus-operator replicaset now # while true; do date; oc -n openshift-monitoring get pod | grep prometheus-operator; oc -n openshift-monitoring get rs | grep prometheus-operator; echo -e "\n";sleep 20s; done Thu Oct 7 22:48:35 EDT 2021 prometheus-operator-79b57fc9bf-vpgnj 2/2 Running 1 (113m ago) 115m prometheus-operator-79b57fc9bf 1 1 1 115m Thu Oct 7 22:49:06 EDT 2021 prometheus-operator-79b57fc9bf-vpgnj 2/2 Running 1 (114m ago) 115m prometheus-operator-79b57fc9bf 1 1 1 116m Thu Oct 7 22:49:36 EDT 2021 prometheus-operator-79b57fc9bf-vpgnj 2/2 Running 1 (114m ago) 116m prometheus-operator-79b57fc9bf 1 1 1 116m Thu Oct 7 22:50:07 EDT 2021 prometheus-operator-79b57fc9bf-vpgnj 2/2 Running 1 (115m ago) 116m prometheus-operator-79b57fc9bf 1 1 1 117m Thu Oct 7 22:50:37 EDT 2021 prometheus-operator-79b57fc9bf-vpgnj 2/2 Running 1 (115m ago) 117m prometheus-operator-79b57fc9bf 1 1 1 117m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056