Bug 1985697 - package-server-manager needs to handle 60 seconds downtime of API server gracefully in SNO
Summary: package-server-manager needs to handle 60 seconds downtime of API server grac...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.9
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 4.9.0
Assignee: tflannag
QA Contact: Jian Zhang
URL:
Whiteboard: chaos
: 1989418 (view as bug list)
Depends On:
Blocks: 1984730
TreeView+ depends on / blocked
 
Reported: 2021-07-25 03:59 UTC by Naga Ravi Chaitanya Elluri
Modified: 2021-10-18 17:41 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:40:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 136 0 None open Bug 1985697: Update the package-server-manager leader election configuration 2021-07-27 21:38:07 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:41:15 UTC

Description Naga Ravi Chaitanya Elluri 2021-07-25 03:59:57 UTC
Description of problem:
Package-server-manager is crashing/restarting and is going though leader elections during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). kube-scheduler leader election timeout should be set to > 60 seconds to handle the downtime gracefully in SNO. Values reference: https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183. 

Here is the trace of the events during the rollout: 
http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/package-server-manager/cerberus_cluster_state.log

The leader election can be disabled given that there's no HA in SNO.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-07-24-113438

How reproducible:
Always

Steps to Reproduce:
1. Install a SNO cluster using the latest nightly payload.
2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n
3. Observe the state of package-server-manager.

Actual results:
package-server-manager is crashing/restarting and going through leader elections.

Expected results:
packager-server-manager should handle the API rollout/outage gracefully.

Additional info:
Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/package-server-manager/

Comment 3 Jian Zhang 2021-08-06 03:41:23 UTC
Cluster version is 4.9.0-0.nightly-2021-08-06-020133
[cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-6cd746b48b-plwtn -- olm --version
OLM version: 0.18.3
git commit: 3a21821b786493b59da83ab4ce16d6ed16dcccad

1, Install an SNO cluster.
[cloud-user@preserve-olm-env jian]$ oc get infrastructure cluster -o=jsonpath='{.status.infrastructureTopology}'
SingleReplica
[cloud-user@preserve-olm-env jian]$ oc get nodes
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-133-45.us-east-2.compute.internal   Ready    master,worker   49m   v1.21.1+8268f88

2, Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds 

[cloud-user@preserve-olm-env jian]$ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATION1"}}'
kubeapiserver.operator.openshift.io/cluster patched

[cloud-user@preserve-olm-env jian]$ oc get pods -n openshift-kube-apiserver 

NAME                                                                       READY   STATUS      RESTARTS   AGE
installer-10-ip-10-0-133-45.us-east-2.compute.internal                     1/1     Running     0          42s
installer-5-ip-10-0-133-45.us-east-2.compute.internal                      0/1     Completed   0          47m
installer-7-ip-10-0-133-45.us-east-2.compute.internal                      0/1     Completed   0          46m
installer-8-ip-10-0-133-45.us-east-2.compute.internal                      0/1     Completed   0          40m
installer-9-ip-10-0-133-45.us-east-2.compute.internal                      0/1     Completed   0          39m
kube-apiserver-ip-10-0-133-45.us-east-2.compute.internal                   5/5     Running     0          38m
kube-apiserver-startup-monitor-ip-10-0-133-45.us-east-2.compute.internal   0/1     Pending     0          32s
revision-pruner-10-ip-10-0-133-45.us-east-2.compute.internal               0/1     Completed   0          49s
revision-pruner-7-ip-10-0-133-45.us-east-2.compute.internal                0/1     Completed   0          44m
revision-pruner-8-ip-10-0-133-45.us-east-2.compute.internal                0/1     Completed   0          41m
revision-pruner-9-ip-10-0-133-45.us-east-2.compute.internal                0/1     Completed   0          39m

3, Observe the state of package-server-manager. No crash, looks good, verify it.
[cloud-user@preserve-olm-env jian]$ oc get pods
NAME                                      READY   STATUS      RESTARTS   AGE
catalog-operator-6cd746b48b-plwtn         1/1     Running     0          54m
collect-profiles-27136980-g4448           0/1     Completed   0          40m
collect-profiles-27136995-zcqx6           0/1     Completed   0          25m
collect-profiles-27137010-q4nwq           0/1     Completed   0          10m
olm-operator-c9cb64896-5wlfq              1/1     Running     0          54m
package-server-manager-5856789494-6nhgc   1/1     Running     0          54m
packageserver-5ddb77696f-59nwr            1/1     Running     0          52m

Comment 4 Kevin Rizza 2021-08-09 17:50:55 UTC
*** Bug 1989418 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2021-10-18 17:40:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.