Created attachment 1893328 [details] logs of 2 problem pods Version: $ openshift-install version openshift-install 4.11.0-0.nightly-2022-06-28-160049 built from commit 6daed68b9863a9b2ecebdf8a4056800aa5c60ad3 release image registry.ci.openshift.org/ocp/release@sha256:b79b1be6aa4f9f62c691c043e0911856cf1c11bb81c8ef94057752c6e5a8478a release architecture amd64 $ Platform: alibabacloud Please specify: IPI What happened? After rebooting compute nodes and then control-plane nodes one by one, cluster operators "network" and "kube-apiserver" turned into degraded. Note that before rebooting the nodes everything was ok. What did you expect to happen? All cluster operators should be stable and ready, without any turning into degraded. How to reproduce it (as minimally and precisely as possible)? Always. Anything else we need to know? >FYI We also tried the scenario with 4.10.20-x86_64, where no such issue. >FYI We also tried the scenario with 4.11.0-0.nightly-2022-06-28-160049 on GCP, where no such issue. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-28-160049 True False 49m Error while reconciling 4.11.0-0.nightly-2022-06-28-160049: an unknown error has occurred: MultipleErrors $ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-2822411871-szqwg-master-0 Ready master 65m v1.24.0+9ddc8b1 jiwei-2822411871-szqwg-master-1 Ready master 67m v1.24.0+9ddc8b1 jiwei-2822411871-szqwg-master-2 Ready master 67m v1.24.0+9ddc8b1 jiwei-2822411871-szqwg-worker-us-east-1a-jr66l Ready worker 56m v1.24.0+9ddc8b1 jiwei-2822411871-szqwg-worker-us-east-1b-q9tdq Ready worker 57m v1.24.0+9ddc8b1 $ oc get co | grep -Ev "True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.11.0-0.nightly-2022-06-28-160049 True False True 52m StaticPodsDegraded: pod/kube-apiserver-jiwei-2822411871-szqwg-master-1 container "kube-apiserver" is waiting: CreateContainerError: error reserving ctr name k8s_kube-apiserver_kube-apiserver-jiwei-2822411871-szqwg-master-1_openshift-kube-apiserver_51a2e7955439eae96d60f88e7b0f3a70_2 for id f82c4afbbf9b55a6e5e7db185c26412f1329a9df5680198485877eb16e1d4ff7: name is reserved... network 4.11.0-0.nightly-2022-06-28-160049 True True True 65m DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-29T03:32:51Z $ oc get pods -n openshift-sdn -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sdn-5dwzj 1/2 Running 2 57m 10.0.105.72 jiwei-2822411871-szqwg-worker-us-east-1b-q9tdq <none> <none> sdn-controller-hjrf6 2/2 Running 2 67m 10.0.105.69 jiwei-2822411871-szqwg-master-2 <none> <none> sdn-controller-ltq9q 2/2 Running 2 67m 10.0.176.206 jiwei-2822411871-szqwg-master-1 <none> <none> sdn-controller-zflnm 2/2 Running 2 66m 10.0.105.70 jiwei-2822411871-szqwg-master-0 <none> <none> sdn-dgf8z 2/2 Running 2 66m 10.0.105.69 jiwei-2822411871-szqwg-master-2 <none> <none> sdn-h2jmx 1/2 Running 6 56m 10.0.176.207 jiwei-2822411871-szqwg-worker-us-east-1a-jr66l <none> <none> sdn-l6bmw 1/2 Running 2 66m 10.0.105.70 jiwei-2822411871-szqwg-master-0 <none> <none> sdn-m8vfq 2/2 Running 2 66m 10.0.176.206 jiwei-2822411871-szqwg-master-1 <none> <none> $ oc get pods -n openshift-kube-apiserver -o wide | grep -Ev "Completed" NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES apiserver-watcher-jiwei-2822411871-szqwg-master-0 1/1 Running 1 66m 10.0.105.70 jiwei-2822411871-szqwg-master-0 <none> <none> apiserver-watcher-jiwei-2822411871-szqwg-master-1 1/1 Running 1 67m 10.0.176.206 jiwei-2822411871-szqwg-master-1 <none> <none> apiserver-watcher-jiwei-2822411871-szqwg-master-2 1/1 Running 1 67m 10.0.105.69 jiwei-2822411871-szqwg-master-2 <none> <none> kube-apiserver-guard-jiwei-2822411871-szqwg-master-0 1/1 Running 1 51m 10.130.0.8 jiwei-2822411871-szqwg-master-0 <none> <none> kube-apiserver-guard-jiwei-2822411871-szqwg-master-1 1/1 Running 1 63m 10.128.0.20 jiwei-2822411871-szqwg-master-1 <none> <none> kube-apiserver-guard-jiwei-2822411871-szqwg-master-2 1/1 Running 1 52m 10.129.0.10 jiwei-2822411871-szqwg-master-2 <none> <none> kube-apiserver-jiwei-2822411871-szqwg-master-0 4/5 CreateContainerError 5 (25m ago) 51m 10.0.105.70 jiwei-2822411871-szqwg-master-0 <none> <none> kube-apiserver-jiwei-2822411871-szqwg-master-1 3/5 CreateContainerError 5 (29m ago) 54m 10.0.176.206 jiwei-2822411871-szqwg-master-1 <none> <none> kube-apiserver-jiwei-2822411871-szqwg-master-2 3/5 CreateContainerError 5 (27m ago) 53m 10.0.105.69 jiwei-2822411871-szqwg-master-2 <none> <none> $ oc logs -n openshift-kube-apiserver kube-apiserver-jiwei-2822411871-szqwg-master-0 | grep E0629 E0629 03:33:35.671323 16 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ResourceQuota: failed to list *v1.ResourceQuota: Get "https://[::1]:6443/api/v1/resourcequotas?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z E0629 03:33:35.671887 16 reflector.go:138] pkg/client/informers/externalversions/factory.go:117: Failed to watch *v1.APIService: failed to list *v1.APIService: Get "https://[::1]:6443/apis/apiregistration.k8s.io/v1/apiservices?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z E0629 03:33:35.672363 16 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://[::1]:6443/api/v1/services?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z E0629 03:33:35.702646 16 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://[::1]:6443/api/v1/nodes?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z E0629 03:33:35.703559 16 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ClusterRoleBinding: failed to list *v1.ClusterRoleBinding: Get "https://[::1]:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0": x509: certificate has expired or is not yet valid: current time 2022-06-29T03:33:35Z is before 2022-06-29T10:33:30Z $
- does this happen with (4.10, alibaba)? I want to know if this is a regression. - can you confirm that this is "Always" reproducible - have you found any workaround that resolves the issue?
I am setting blocker- for now until we determine this is impacting a broader set of platforms and occurs more frequently.
(In reply to Abu Kashem from comment #1) > - does this happen with (4.10, alibaba)? I want to know if this is a > regression. > - can you confirm that this is "Always" reproducible > - have you found any workaround that resolves the issue? No such issue with 4.10 on alibaba, see below (from the original description) >FYI We also tried the scenario with 4.10.20-x86_64, where no such issue. >FYI We also tried the scenario with 4.11.0-0.nightly-2022-06-28-160049 on GCP, where no such issue. Yes, we tried multiple times and met the issue every time. We did not find any workaround so far.
Dear reporter, we greatly appreciate the bug you have reported here. Unfortunately, due to migration to a new issue-tracking system (https://issues.redhat.com/), we cannot continue triaging bugs reported in Bugzilla. Since this bug has been stale for multiple days, we, therefore, decided to close this bug. If you think this is a mistake or this bug has a higher priority or severity as set today, please feel free to reopen this bug and tell us why. We are going to move every re-opened bug to https://issues.redhat.com. Thank you for your patience and understanding.