Description of problem: Package-server-manager is crashing/restarting and is going though leader elections during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). kube-scheduler leader election timeout should be set to > 60 seconds to handle the downtime gracefully in SNO. Values reference: https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183. Here is the trace of the events during the rollout: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/package-server-manager/cerberus_cluster_state.log The leader election can be disabled given that there's no HA in SNO. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-07-24-113438 How reproducible: Always Steps to Reproduce: 1. Install a SNO cluster using the latest nightly payload. 2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n 3. Observe the state of package-server-manager. Actual results: package-server-manager is crashing/restarting and going through leader elections. Expected results: packager-server-manager should handle the API rollout/outage gracefully. Additional info: Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/package-server-manager/
Cluster version is 4.9.0-0.nightly-2021-08-06-020133 [cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-6cd746b48b-plwtn -- olm --version OLM version: 0.18.3 git commit: 3a21821b786493b59da83ab4ce16d6ed16dcccad 1, Install an SNO cluster. [cloud-user@preserve-olm-env jian]$ oc get infrastructure cluster -o=jsonpath='{.status.infrastructureTopology}' SingleReplica [cloud-user@preserve-olm-env jian]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-133-45.us-east-2.compute.internal Ready master,worker 49m v1.21.1+8268f88 2, Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds [cloud-user@preserve-olm-env jian]$ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATION1"}}' kubeapiserver.operator.openshift.io/cluster patched [cloud-user@preserve-olm-env jian]$ oc get pods -n openshift-kube-apiserver NAME READY STATUS RESTARTS AGE installer-10-ip-10-0-133-45.us-east-2.compute.internal 1/1 Running 0 42s installer-5-ip-10-0-133-45.us-east-2.compute.internal 0/1 Completed 0 47m installer-7-ip-10-0-133-45.us-east-2.compute.internal 0/1 Completed 0 46m installer-8-ip-10-0-133-45.us-east-2.compute.internal 0/1 Completed 0 40m installer-9-ip-10-0-133-45.us-east-2.compute.internal 0/1 Completed 0 39m kube-apiserver-ip-10-0-133-45.us-east-2.compute.internal 5/5 Running 0 38m kube-apiserver-startup-monitor-ip-10-0-133-45.us-east-2.compute.internal 0/1 Pending 0 32s revision-pruner-10-ip-10-0-133-45.us-east-2.compute.internal 0/1 Completed 0 49s revision-pruner-7-ip-10-0-133-45.us-east-2.compute.internal 0/1 Completed 0 44m revision-pruner-8-ip-10-0-133-45.us-east-2.compute.internal 0/1 Completed 0 41m revision-pruner-9-ip-10-0-133-45.us-east-2.compute.internal 0/1 Completed 0 39m 3, Observe the state of package-server-manager. No crash, looks good, verify it. [cloud-user@preserve-olm-env jian]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-6cd746b48b-plwtn 1/1 Running 0 54m collect-profiles-27136980-g4448 0/1 Completed 0 40m collect-profiles-27136995-zcqx6 0/1 Completed 0 25m collect-profiles-27137010-q4nwq 0/1 Completed 0 10m olm-operator-c9cb64896-5wlfq 1/1 Running 0 54m package-server-manager-5856789494-6nhgc 1/1 Running 0 54m packageserver-5ddb77696f-59nwr 1/1 Running 0 52m
*** Bug 1989418 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759