Description of problem: Hello team, An issue was caught on a cluster during an upgrade phase from 4.9.28 to 4.10.17 with the openshift-apiserver degraded with production impact. During the analysis on the remote session we saw several apiserver pods with high number of restarts and continuously restarting, hitting a timeout while executing the poststarthook "poststarthook/authorization.openshift.io-bootstrapclusterroles check failed: healthz" Cluster is currently running with CV Unmanaged state, workaround (below) applied. Looking for a permanent solution. Version-Release number of selected component (if applicable): 4.10.17 How reproducible: So after careful analysis we could say that the pod is taking too much time to boot up and we kill it in the meantime: - events are full of warnings for unhealthy errors, readiness probe fails and even killing messages. - checking the logs from apiserver pods they are doing "[graceful-termination]" - there is no significant error on the logs, also the apiserver just start serving and sudden begin the termination: - Most meaningful message is the reason to stop the service: ~~~ 2022-06-21T07:20:37.792249662Z I0621 07:20:37.792195 1 healthz.go:257] poststarthook/authorization.openshift.io-bootstrapclusterroles check failed: healthz 2022-06-21T07:20:37.792249662Z [-]poststarthook/authorization.openshift.io-bootstrapclusterroles failed: not finished ~~~ Steps to Reproduce: Baremetal install with proxy; 1 - Set the CV to unmanaged and edit/change the "failureThreshold" from 3 to 10 on `oc edit deploy apiserver -n openshift-apiserver`, to give more time to the deployment to finish health check; 2 - Finish the upgrade process; 3 - Return the CV back to managed state; Actual results: The issue on the healthcheck due to the poststarthook remains. Expected results: Probe and healthcheck pass for the apiserver pods. Additional info: Case linked Feel free to get in touch. Best, Gabriel
Hello, I'm updating the Bug on behalf of Pawan. Please find the below comment from the Customer. When we are reverting back to default values API-pods are failing, while with suggested values pods are up and running. Also, the Customer uses proxy and it is the standard one used across all the openshift clusters. ~~~~ Still it is running with override values suggested in the ticket, if we put the default values (failureThreshold 3) api-server pods are in crashloopbackup state. "Set UnManaged from CV and Scaled Operator to 0. Increased probe failure to 10 (from 3), and it seems to be holding. If we set everything back to managed it fails again. ```yaml apiVersion: config.openshift.io/v1 kind: ClusterVersion metadata: name: version spec: overrides: - group: apps kind: Deployment name: openshift-apiserver-operator namespace: openshift-apiserver-operator unmanaged: true ```" ~~~
I can see from the original must-gather that the kube-apiserver is also failing in its role-bootstrapping logic. How many role, rolebinding, clusterrole and clusterrolebinding objects are there in the cluster? Are there any admission webhooks present in the cluster that operate on RBAC resources? Michal Fojtik also discovered that there were a few HTTP 500s responses to the OAS with regards to some cluster/rolebindings retrieval. Would it be possible to get a must-gather that contains: - audit logs - logs of failing openshift-apiserver pods - kube-apiserver logs from the time period when the openshift-apiserver pods above were failing - possibly even kube-apiserver logs that contain the kube-apiserver startup (note that the logs retrieved by must-gather can be truncated)
(In reply to Standa Laznicka from comment #27) > I can see from the original must-gather that the kube-apiserver is also > failing in its role-bootstrapping logic. > > How many role, rolebinding, clusterrole and clusterrolebinding objects are > there in the cluster? Are there any admission webhooks present in the > cluster that operate on RBAC resources? > > Michal Fojtik also discovered that there were a few HTTP 500s responses to > the OAS with regards to some cluster/rolebindings retrieval. > > Would it be possible to get a must-gather that contains: > - audit logs > - logs of failing openshift-apiserver pods > - kube-apiserver logs from the time period when the openshift-apiserver pods > above were failing > - possibly even kube-apiserver logs that contain the kube-apiserver startup > (note that the logs retrieved by must-gather can be truncated) Hello Standa, I think customer has provided audit logs 2-3 times, in comment22 as well, wasn't those helpful? Will it be possible for someone from engineering team to go on call and collect all log for once? May be this will help in quicker troubleshooting. I will try to ask for required info in the meantime. Regards, Pawan
Fixed in 4.10.25.
If we're saying the fix was delivered in Bug 2109235 we should've marked this as a dupe so that no one has to read through every comment to arrive at that conclusion. *** This bug has been marked as a duplicate of bug 2109235 ***