2102011 – [IPI on Alibabacloud] cluster operators "network" and "kube-apiserver" turned degraded after rebooting each node of the cluster

Bug 2102011 - [IPI on Alibabacloud] cluster operators "network" and "kube-apiserver" turned degraded after rebooting each node of the cluster

Summary: [IPI on Alibabacloud] cluster operators "network" and "kube-apiserver" turned...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Abu Kashem
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-29 06:14 UTC by Jianli Wei
Modified:	2023-05-11 07:49 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Cause: After rebooting each cluster node. Consequence: The cluster operators "network" and "kube-apiserver" turned into degraded. Workaround (if any): n/a Result: The cluster turns unhealthy.
Clone Of:
Environment:
Last Closed:	2023-01-16 11:33:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs of 2 problem pods (1.81 MB, application/x-tar) 2022-06-29 06:14 UTC, Jianli Wei	no flags	Details
View All

Description Jianli Wei 2022-06-29 06:14:08 UTC

Created attachment 1893328 [details]
logs of 2 problem pods

Version:
$ openshift-install version
openshift-install 4.11.0-0.nightly-2022-06-28-160049
built from commit 6daed68b9863a9b2ecebdf8a4056800aa5c60ad3
release image registry.ci.openshift.org/ocp/release@sha256:b79b1be6aa4f9f62c691c043e0911856cf1c11bb81c8ef94057752c6e5a8478a
release architecture amd64
$ 

Platform: alibabacloud

Please specify: IPI

What happened?
After rebooting compute nodes and then control-plane nodes one by one, cluster operators "network" and "kube-apiserver" turned into degraded. Note that before rebooting the nodes everything was ok. 

What did you expect to happen?
All cluster operators should be stable and ready, without any turning into degraded.

How to reproduce it (as minimally and precisely as possible)?
Always.

Anything else we need to know?
>FYI We also tried the scenario with 4.10.20-x86_64, where no such issue.
>FYI We also tried the scenario with 4.11.0-0.nightly-2022-06-28-160049 on GCP, where no such issue.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-28-160049   True        False         49m     Error while reconciling 4.11.0-0.nightly-2022-06-28-160049: an unknown error has occurred: MultipleErrors
$ oc get nodes
NAME                                             STATUS   ROLES    AGE   VERSION
jiwei-2822411871-szqwg-master-0                  Ready    master   65m   v1.24.0+9ddc8b1
jiwei-2822411871-szqwg-master-1                  Ready    master   67m   v1.24.0+9ddc8b1
jiwei-2822411871-szqwg-master-2                  Ready    master   67m   v1.24.0+9ddc8b1
jiwei-2822411871-szqwg-worker-us-east-1a-jr66l   Ready    worker   56m   v1.24.0+9ddc8b1
jiwei-2822411871-szqwg-worker-us-east-1b-q9tdq   Ready    worker   57m   v1.24.0+9ddc8b1
$ oc get co | grep -Ev "True        False         False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver                             4.11.0-0.nightly-2022-06-28-160049   True        False         True       52m     StaticPodsDegraded: pod/kube-apiserver-jiwei-2822411871-szqwg-master-1 container "kube-apiserver" is waiting: CreateContainerError: error reserving ctr name k8s_kube-apiserver_kube-apiserver-jiwei-2822411871-szqwg-master-1_openshift-kube-apiserver_51a2e7955439eae96d60f88e7b0f3a70_2 for id f82c4afbbf9b55a6e5e7db185c26412f1329a9df5680198485877eb16e1d4ff7: name is reserved...
network                                    4.11.0-0.nightly-2022-06-28-160049   True        True          True       65m     DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-29T03:32:51Z
$ oc get pods -n openshift-sdn -o wide
NAME                   READY   STATUS    RESTARTS   AGE   IP             NODE                                             NOMINATED NODE   READINESS GATES
sdn-5dwzj              1/2     Running   2          57m   10.0.105.72    jiwei-2822411871-szqwg-worker-us-east-1b-q9tdq   <none>           <none>
sdn-controller-hjrf6   2/2     Running   2          67m   10.0.105.69    jiwei-2822411871-szqwg-master-2                  <none>           <none>
sdn-controller-ltq9q   2/2     Running   2          67m   10.0.176.206   jiwei-2822411871-szqwg-master-1                  <none>           <none>
sdn-controller-zflnm   2/2     Running   2          66m   10.0.105.70    jiwei-2822411871-szqwg-master-0                  <none>           <none>
sdn-dgf8z              2/2     Running   2          66m   10.0.105.69    jiwei-2822411871-szqwg-master-2                  <none>           <none>
sdn-h2jmx              1/2     Running   6          56m   10.0.176.207   jiwei-2822411871-szqwg-worker-us-east-1a-jr66l   <none>           <none>
sdn-l6bmw              1/2     Running   2          66m   10.0.105.70    jiwei-2822411871-szqwg-master-0                  <none>           <none>
sdn-m8vfq              2/2     Running   2          66m   10.0.176.206   jiwei-2822411871-szqwg-master-1                  <none>           <none>
$ oc get pods -n openshift-kube-apiserver -o wide | grep -Ev "Completed"
NAME                                                   READY   STATUS                 RESTARTS      AGE   IP             NODE                              NOMINATED NODE   READINESS GATES
apiserver-watcher-jiwei-2822411871-szqwg-master-0      1/1     Running                1             66m   10.0.105.70    jiwei-2822411871-szqwg-master-0   <none>           <none>
apiserver-watcher-jiwei-2822411871-szqwg-master-1      1/1     Running                1             67m   10.0.176.206   jiwei-2822411871-szqwg-master-1   <none>           <none>
apiserver-watcher-jiwei-2822411871-szqwg-master-2      1/1     Running                1             67m   10.0.105.69    jiwei-2822411871-szqwg-master-2   <none>           <none>
kube-apiserver-guard-jiwei-2822411871-szqwg-master-0   1/1     Running                1             51m   10.130.0.8     jiwei-2822411871-szqwg-master-0   <none>           <none>
kube-apiserver-guard-jiwei-2822411871-szqwg-master-1   1/1     Running                1             63m   10.128.0.20    jiwei-2822411871-szqwg-master-1   <none>           <none>
kube-apiserver-guard-jiwei-2822411871-szqwg-master-2   1/1     Running                1             52m   10.129.0.10    jiwei-2822411871-szqwg-master-2   <none>           <none>
kube-apiserver-jiwei-2822411871-szqwg-master-0         4/5     CreateContainerError   5 (25m ago)   51m   10.0.105.70    jiwei-2822411871-szqwg-master-0   <none>           <none>
kube-apiserver-jiwei-2822411871-szqwg-master-1         3/5     CreateContainerError   5 (29m ago)   54m   10.0.176.206   jiwei-2822411871-szqwg-master-1   <none>           <none>
kube-apiserver-jiwei-2822411871-szqwg-master-2         3/5     CreateContainerError   5 (27m ago)   53m   10.0.105.69    jiwei-2822411871-szqwg-master-2   <none>           <none>
$ oc logs -n openshift-kube-apiserver kube-apiserver-jiwei-2822411871-szqwg-master-0 | grep E0629
E0629 03:33:35.671323      16 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ResourceQuota: failed to list *v1.ResourceQuota: Get "https://[::1]:6443/api/v1/resourcequotas?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z
E0629 03:33:35.671887      16 reflector.go:138] pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1.APIService: failed to list *v1.APIService: Get "https://[::1]:6443/apis/apiregistration.k8s.io/v1/apiservices?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z
E0629 03:33:35.672363      16 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://[::1]:6443/api/v1/services?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z
E0629 03:33:35.702646      16 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://[::1]:6443/api/v1/nodes?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z
E0629 03:33:35.703559      16 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: Get "https://[::1]:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z
$

Comment 1 Abu Kashem 2022-07-06 17:10:44 UTC

- does this happen with (4.10, alibaba)? I want to know if this is a regression.
- can you confirm that this is "Always" reproducible
- have you found any workaround that resolves the issue?

Comment 2 Wally 2022-07-11 12:20:34 UTC

I am setting blocker- for now until we determine this is impacting a broader set of platforms and occurs more frequently.

Comment 3 Jianli Wei 2022-07-13 07:57:02 UTC

(In reply to Abu Kashem from comment #1)
> - does this happen with (4.10, alibaba)? I want to know if this is a
> regression.
> - can you confirm that this is "Always" reproducible
> - have you found any workaround that resolves the issue?

No such issue with 4.10 on alibaba, see below (from the original description)
>FYI We also tried the scenario with 4.10.20-x86_64, where no such issue.
>FYI We also tried the scenario with 4.11.0-0.nightly-2022-06-28-160049 on GCP, where no such issue.

Yes, we tried multiple times and met the issue every time.

We did not find any workaround so far.

Comment 4 Michal Fojtik 2023-01-16 11:33:33 UTC

Dear reporter, we greatly appreciate the bug you have reported here. Unfortunately, due to migration to a new issue-tracking system (https://issues.redhat.com/), we cannot continue triaging bugs reported in Bugzilla. Since this bug has been stale for multiple days, we, therefore, decided to close this bug.

If you think this is a mistake or this bug has a higher priority or severity as set today, please feel free to reopen this bug and tell us why. We are going to move every re-opened bug to https://issues.redhat.com. 

Thank you for your patience and understanding.

Note You need to log in before you can comment on or make changes to this bug.