Bug 2100233

Summary:	Baremetal UPI - apiserver high number of restarts due to poststarthook/authorization.openshift.io-bootstrapclusterroles check failed: healthz
Product:	OpenShift Container Platform	Reporter:	Gabriel Scheffer <gscheffe>
Component:	openshift-apiserver	Assignee:	Luis Sanchez <sanchezl>
Status:	CLOSED DUPLICATE	QA Contact:	Rahul Gangwar <rgangwar>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.10	CC:	akashem, aygarg, bsmitley, jkaur, mfojtik, pawankum, pkhaire, sanchezl, sar, simore, slaznick, smaudet, sponnaga, vkochuku, wlewis
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-25 14:56:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gabriel Scheffer 2022-06-22 19:26:11 UTC

Description of problem:

Hello team,
An issue was caught on a cluster during an upgrade phase from 4.9.28 to 4.10.17 with the openshift-apiserver degraded with production impact.

During the analysis on the remote session we saw several apiserver pods with high number of restarts and continuously restarting, hitting a timeout while executing the poststarthook "poststarthook/authorization.openshift.io-bootstrapclusterroles check failed: healthz"

Cluster is currently running with CV Unmanaged state, workaround (below) applied. Looking for a permanent solution.

Version-Release number of selected component (if applicable): 4.10.17


How reproducible:
So after careful analysis we could say that the pod is taking too much time to boot up and we kill it in the meantime:
- events are full of warnings for unhealthy errors, readiness probe fails and even killing messages.
- checking the logs from apiserver pods they are doing "[graceful-termination]"
- there is no significant error on the logs, also the apiserver just start serving and sudden begin the termination:

- Most meaningful message is the reason to stop the service:
~~~
  2022-06-21T07:20:37.792249662Z I0621 07:20:37.792195       1 healthz.go:257] poststarthook/authorization.openshift.io-bootstrapclusterroles check failed: healthz
  2022-06-21T07:20:37.792249662Z [-]poststarthook/authorization.openshift.io-bootstrapclusterroles failed: not finished
~~~


Steps to Reproduce:

Baremetal install with proxy;

1 - Set the CV to unmanaged and edit/change the "failureThreshold" from 3 to 10 on `oc edit deploy apiserver -n openshift-apiserver`, to give more time to the deployment to finish health check;
2 - Finish the upgrade process;
3 - Return the CV back to managed state;

Actual results:
The issue on the healthcheck due to the poststarthook remains.


Expected results:

Probe and healthcheck pass for the apiserver pods.


Additional info:
Case linked

Feel free to get in touch.

Best,
Gabriel

Comment 14 Santhiya R 2022-07-21 09:03:51 UTC

Hello,

I'm updating the Bug on behalf of Pawan. Please find the below comment from the Customer. When we are reverting back to default values API-pods are failing,  while with suggested values pods are up and running. Also, the Customer uses proxy and it is the standard one used across all the openshift clusters.


~~~~

Still it is running with override values suggested in the ticket, if we put the default values (failureThreshold 3) api-server pods are in crashloopbackup state. 

"Set UnManaged from CV and Scaled Operator to 0.
Increased probe failure to 10 (from 3), and it seems to be holding.
If we set everything back to managed it fails again.
```yaml
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  name: version
spec:
  overrides:
    - group: apps
      kind: Deployment
      name: openshift-apiserver-operator
      namespace: openshift-apiserver-operator
      unmanaged: true
```"

~~~

Comment 27 Standa Laznicka 2022-07-28 12:39:52 UTC

I can see from the original must-gather that the kube-apiserver is also failing in its role-bootstrapping logic.

How many role, rolebinding, clusterrole and clusterrolebinding objects are there in the cluster? Are there any admission webhooks present in the cluster that operate on RBAC resources?

Michal Fojtik also discovered that there were a few HTTP 500s responses to the OAS with regards to some cluster/rolebindings retrieval.

Would it be possible to get a must-gather that contains:
- audit logs
- logs of failing openshift-apiserver pods
- kube-apiserver logs from the time period when the openshift-apiserver pods above were failing
- possibly even kube-apiserver logs that contain the kube-apiserver startup (note that the logs retrieved by must-gather can be truncated)

Comment 28 pawankum 2022-07-28 17:04:43 UTC

(In reply to Standa Laznicka from comment #27)
> I can see from the original must-gather that the kube-apiserver is also
> failing in its role-bootstrapping logic.
> 
> How many role, rolebinding, clusterrole and clusterrolebinding objects are
> there in the cluster? Are there any admission webhooks present in the
> cluster that operate on RBAC resources?
> 
> Michal Fojtik also discovered that there were a few HTTP 500s responses to
> the OAS with regards to some cluster/rolebindings retrieval.
> 
> Would it be possible to get a must-gather that contains:
> - audit logs
> - logs of failing openshift-apiserver pods
> - kube-apiserver logs from the time period when the openshift-apiserver pods
> above were failing
> - possibly even kube-apiserver logs that contain the kube-apiserver startup
> (note that the logs retrieved by must-gather can be truncated)

Hello Standa,

I think customer has provided audit logs 2-3 times, in comment22 as well, wasn't those helpful?

Will it be possible for someone from engineering team to go on call and collect all log for once? May be this will help in quicker troubleshooting. 

I will try to ask for required info in the meantime.



Regards,
Pawan

Comment 45 Luis Sanchez 2022-08-25 14:56:32 UTC

Fixed in 4.10.25.

Comment 46 Scott Dodson 2022-08-25 17:41:43 UTC

If we're saying the fix was delivered in Bug 2109235 we should've marked this as a dupe so that no one has to read through every comment to arrive at that conclusion.

*** This bug has been marked as a duplicate of bug 2109235 ***