Error occurs occasionally, install log and bootstrap logs are attached. Error from install log: level=info msg="Waiting up to 30m0s for bootstrapping to complete..." E1026 09:47:21.258406 885 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get "https://api.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=13936&timeoutSeconds=465&watch=true": Service Unavailable E1026 09:47:22.314589 885 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get "https://api.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0": Service Unavailable E1026 09:47:23.366783 885 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get "https://api.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0": Service Unavailable E1026 09:47:24.428891 885 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ConfigMap: Get "https://api.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dbootstrap&limit=500&resourceVersion=0": Service Unavailable level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError: OAuthServiceCheckEndpointAccessibleControllerDegraded: Get \"https://172.30.211.114:443/healthz\": dial tcp 172.30.211.114:443: connect: connection refused\nOAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready\nIngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server" level=info msg="Cluster operator authentication Progressing is Unknown with NoData: " level=info msg="Cluster operator authentication Available is False with OAuthServiceCheckEndpointAccessibleController_EndpointUnavailable::OAuthServiceEndpointsCheckEndpointAccessibleController_EndpointUnavailable::ReadyIngressNodes_NoReadyIngressNodes: OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints\nReadyIngressNodesAvailable: Authentication require functional ingress which requires at least one schedulable and ready node. Got 2 worker nodes and 3 master nodes (none are schedulable or ready for ingress pods).\nOAuthServiceCheckEndpointAccessibleControllerAvailable: Get \"https://172.30.211.114:443/healthz\": dial tcp 172.30.211.114:443: connect: connection refused" level=error msg="Cluster operator etcd Degraded is True with InstallerController_Error::StaticPods_Error: InstallerControllerDegraded: Internal error occurred: admission plugin \"MutatingAdmissionWebhook\" failed to complete mutation in 13s\nStaticPodsDegraded: pods \"etcd-ip-10-0-71-163.us-east-2.compute.internal\" not found\nStaticPodsDegraded: pods \"etcd-ip-10-0-49-162.us-east-2.compute.internal\" not found" level=info msg="Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2” NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 21h Unable to apply 4.6.0: an unknown error has occurred: MultipleErrors NAME STATUS ROLES AGE VERSION ip-10-0-49-162.us-east-2.compute.internal Ready master 21h v1.19.0+d59ce34 ip-10-0-54-178.us-east-2.compute.internal NotReady worker 21h v1.19.0+d59ce34 ip-10-0-55-251.us-east-2.compute.internal Ready master 21h v1.19.0+d59ce34 ip-10-0-69-151.us-east-2.compute.internal NotReady worker 21h v1.19.0+d59ce34 ip-10-0-71-163.us-east-2.compute.internal Ready master 21h v1.19.0+d59ce34 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication False Unknown True 22h cloud-credential 4.6.0 True False False 22h cluster-autoscaler 4.6.0 True False False 22h config-operator 4.6.0 True False False 22h console csi-snapshot-controller 4.6.0 True False False 51m dns 4.6.0 True False False 22h etcd 4.6.0 True True True 22h image-registry ingress False True True 22h insights 4.6.0 True False True 22h kube-apiserver False True True 22h kube-controller-manager False True True 22h kube-scheduler 4.6.0 False True True 22h kube-storage-version-migrator 4.6.0 False False False 22h machine-api 4.6.0 True False False 22h machine-approver 4.6.0 True False False 22h machine-config 4.6.0 True False False 22h marketplace 4.6.0 True False False 22h monitoring False True True 21h network 4.6.0 True False False 22h node-tuning 4.6.0 True False False 22h openshift-apiserver 4.6.0 False False False 22h openshift-controller-manager False True False 22h openshift-samples operator-lifecycle-manager 4.6.0 True False False 22h operator-lifecycle-manager-catalog 4.6.0 True False False 22h operator-lifecycle-manager-packageserver False True False 22h service-ca 4.6.0 True False False 22h storage 4.6.0 True False False 22h oc describe node/ip-10-0-54-178.us-east-2.compute.internal <—snip—> Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Tue, 27 Oct 2020 03:25:56 -0400 Mon, 26 Oct 2020 05:39:43 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 27 Oct 2020 03:25:56 -0400 Mon, 26 Oct 2020 05:39:43 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 27 Oct 2020 03:25:56 -0400 Mon, 26 Oct 2020 05:39:43 -0400 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Tue, 27 Oct 2020 03:25:56 -0400 Mon, 26 Oct 2020 05:39:43 -0400 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? <—snip—> Checked etcd logs: 2020-10-26 09:25:56.696635 I | etcdserver/api: enabled capabilities for version 3.4 2020-10-26 09:34:30.420782 I | embed: rejected connection from "10.0.49.162:48474" (error "remote error: tls: bad certificate", ServerName "") 2020-10-26 09:34:31.427203 I | embed: rejected connection from "10.0.49.162:48484" (error "remote error: tls: bad certificate", ServerName “") Version-Release number of the following components: 4.6.0-x86_64 How reproducible: occasionally Steps to Reproduce: 1. create a disconnect and private cluster(config CCO in manual mode and no proxy) Actual results: Create cluster failed Expected results: Create cluster successfully Additional info:
Created attachment 1724664 [details] install log
bootstrap logs: https://drive.google.com/file/d/17yWENjj2JAjGaWUhunQifxb1dWjhjdFp/view?usp=sharing
I don't think it's a blocker, since this error occurs occasionally
Hi, I need to see the logs from the worker nodes when the install fails. I don't see them in the log bundle. Can you please run must-gather and be sure to include the worker node(s) that fail? Thanks, Victor
Created attachment 1733283 [details] 01_vpc_disconnected_aws_with_privatelink.yaml
Thanks Yunfei! I am debugging on this cluster now. Appreciate the detailed steps to setup the cluster, and also, thanks for setting up this cluster for me to debug. Much appreciated!
The daemonsets in this cluster are all failing to create pods. It looks like the connection to the api server is not working. Still investigating to figure out why this connection is not working. [vpickard@rippleRider$][~/bz1892129]$ oc describe ds sdn -n openshift-sdn Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 163m (x15 over 3h) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 161m daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 142m (x15 over 158m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 140m (x9 over 140m) daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": dial tcp [::1]:6443: connect: connection refused Warning FailedCreate 126m (x11 over 139m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 120m (x10 over 120m) daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": dial tcp [::1]:6443: connect: connection refused Warning FailedCreate 105m (x11 over 119m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 81m (x15 over 97m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 79m (x9 over 79m) daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": dial tcp [::1]:6443: connect: connection refused Warning FailedCreate 62m (x11 over 79m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 41m (x15 over 57m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 21m (x14 over 36m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 19m daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": http2: server sent GOAWAY and closed the connection; LastStreamID=1187, ErrCode=NO_ERROR, debug="" Warning FailedCreate 19m (x9 over 19m) daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-sdn/pods": dial tcp [::1]:6443: connect: connection refused Warning FailedCreate 4m12s (x10 over 18m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s [vpickard@rippleRider$][~/bz1892129]$ oc describe ds dns-default -n openshift-dns Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 170m (x15 over 3h6m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 150m (x14 over 165m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 148m (x9 over 148m) daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": dial tcp [::1]:6443: connect: connection refused Warning FailedCreate 136m (x9 over 147m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 128m (x10 over 128m) daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": dial tcp [::1]:6443: connect: connection refused Warning FailedCreate 111m (x9 over 126m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 107m daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 88m (x15 over 105m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 87m daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": unexpected EOF Warning FailedCreate 87m (x8 over 87m) daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": dial tcp [::1]:6443: connect: connection refused Warning FailedCreate 76m (x9 over 86m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 49m (x15 over 65m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 27m (x15 over 44m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 26m (x10 over 26m) daemonset-controller Error creating: Post "https://localhost:6443/api/v1/namespaces/openshift-dns/pods": dial tcp [::1]:6443: connect: connection refused Warning FailedCreate 11m (x9 over 25m) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Warning FailedCreate 2s (x5 over 4m36s) daemonset-controller Error creating: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s And, you can see that the ds desired replicas is 3, should be 6 (for a good majority of the them) This failed cluster =================== [vpickard@rippleRider$][~/bz1892129]$ oc get ds -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE openshift-cluster-csi-drivers aws-ebs-csi-driver-node 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-cluster-node-tuning-operator tuned 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-controller-manager controller-manager 3 3 0 0 0 node-role.kubernetes.io/master= 15h openshift-dns dns-default 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-image-registry node-ca 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-machine-api machine-api-termination-handler 0 0 0 0 0 machine.openshift.io/interruptible-instance= 15h openshift-machine-config-operator machine-config-daemon 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-machine-config-operator machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 15h openshift-monitoring node-exporter 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-multus multus 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-multus multus-admission-controller 3 3 3 3 3 node-role.kubernetes.io/master= 15h openshift-multus network-metrics-daemon 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-sdn ovs 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-sdn sdn 3 3 3 3 3 kubernetes.io/os=linux 15h openshift-sdn sdn-controller 3 3 3 3 3 node-role.kubernetes.io/master= 15h [vpickard@rippleRider$][~/bz1892129]$ Working cluster =============== [vpickard@rippleRider$][~]$ oc get ds -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE openshift-cluster-csi-drivers aws-ebs-csi-driver-node 6 6 6 6 6 kubernetes.io/os=linux 75m openshift-cluster-node-tuning-operator tuned 6 6 6 6 6 kubernetes.io/os=linux 75m openshift-controller-manager controller-manager 3 3 3 3 3 node-role.kubernetes.io/master= 75m openshift-dns dns-default 6 6 6 6 6 kubernetes.io/os=linux 74m openshift-image-registry node-ca 6 6 6 6 6 kubernetes.io/os=linux 74m openshift-machine-api machine-api-termination-handler 0 0 0 0 0 machine.openshift.io/interruptible-instance= 69m openshift-machine-config-operator machine-config-daemon 6 6 6 6 6 kubernetes.io/os=linux 75m openshift-machine-config-operator machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 73m openshift-monitoring node-exporter 6 6 6 6 6 kubernetes.io/os=linux 75m openshift-multus multus 6 6 6 6 6 kubernetes.io/os=linux 76m openshift-multus multus-admission-controller 3 3 3 3 3 node-role.kubernetes.io/master= 76m openshift-multus network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 76m openshift-sdn ovs 6 6 6 6 6 kubernetes.io/os=linux 76m openshift-sdn sdn 6 6 6 6 6 kubernetes.io/os=linux 76m openshift-sdn sdn-controller 3 3 3 3 3 node-role.kubernetes.io/master= 76m [vpickard@rippleRider$][~]$
It looks like the apiserver did not start. From the api-server-operator logs, I see this: oc logs kube-apiserver-operator-787b8d6458-8ztt6 -n openshift-kube-apiserver-operator|more I1125 18:08:05.811870 1 cmd.go:200] Using service-serving-cert provided certificates I1125 18:08:05.821573 1 observer_polling.go:159] Starting file observer W1125 18:08:05.835340 1 builder.go:207] unable to get owner reference (falling back to namespace): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-787b8d6458-8ztt6": dial tcp 172.30.0 .1:443: connect: connection refused I1125 18:08:05.835521 1 builder.go:238] kube-apiserver-operator version v4.0.0-alpha.0-1126-g358d3e91-358d3e915b2e7df4e1557f4c73c3a911a151b456 W1125 18:08:27.021574 1 requestheader_controller.go:193] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLEBINDING_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA' [vpickard@rippleRider$][~/bz1892129]$ oc get pods -A |grep apiserver openshift-apiserver-operator openshift-apiserver-operator-7585c9f557-vv9dn 1/1 Running 13 17h openshift-kube-apiserver-operator kube-apiserver-operator-787b8d6458-8ztt6 1/1 Running 13 17h [vpickard@rippleRider$][~/bz1892129]$
I've been looking at this more today. It looks like the root of the problem may be that the user is being set to system:anonymous because the certificate is signed by unknown authority, as seen in the scheduler logs below. In the kubelet logs from nodes on 2 nodes, I see logs of errors because the user is system:anonymous, like these: kubelet.log on 10.0.49.162 ========================== user is system:anonymous... what should it be? Lots of these errors Oct 26 09:32:48 ip-10-0-49-162 hyperkube[1504]: E1026 09:32:48.979745 1504 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:anonymous" cannot list resource "services" in API group "" at the cluster scope Oct 26 09:32:48 ip-10-0-49-162 hyperkube[1504]: I1026 09:32:48.979781 1504 manager.go:987] Added container: "/system.slice/systemd-journal-flush.service" (aliases: [], namespace: "") Oct 26 09:32:48 ip-10-0-49-162 hyperkube[1504]: E1026 09:32:48.979970 1504 reflector.go:127] k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: failed to list *v1.Node: nodes "ip-10-0-49-162.us-east-2.compute.internal" is forbidden: User "system:anonymous" cannot list resource "nodes" in API group "" at the cluster scope kubelet.log on 10.0.55.251 =========================== Oct 26 09:32:49 ip-10-0-55-251 hyperkube[1508]: E1026 09:32:49.754065 1508 kubelet_node_status.go:92] Unable to register node "ip-10-0-55-251.us-east-2.compute.internal" with API server: nodes is forbidden: User "system:anonymous" cannot create resource "nodes" in API group "" at the cluster scope And from the scheduler log on 10.0.49.162, I see these logs indicating there is a cert issue: log-bundle-20201027031917/control-plane/10.0.49.162/containers/kube-scheduler-446df2e00c661ee9d4ddba97b7435a03e651b7b1bd665c334c4b5c8ab3435372.log ================================================================================================================================================== W1026 09:35:15.409139 1 authentication.go:294] Error looking up in-cluster authentication configuration: Get "https://api-int.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": x509: certificate signed by unknown authority W1026 09:35:15.409224 1 authentication.go:295] Continuing without authentication configuration. This may treat all requests as anonymous. E1026 09:35:15.440876 1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StatefulSet: failed to list *v1.StatefulSet: Get "https://api-int.yunjiang-26dprr.qe.devcluster.openshift.com:6443/apis/apps/v1/statefulsets?limit=500&resourceVersion=0": x509: certificate signed by unknown authority E1026 09:35:15.445589 1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: Get "https://api-int.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0": x509: certificate signed by unknown authority E1026 09:35:16.879068 1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://api-int.yunjiang-26dprr.qe.devcluster.openshift.com:6443/api/v1/pods?limit=500&resourceVersion=0": x509: certificate signed by unknown authority
Yunfei, Can you please setup another cluster today when you get in, and I will continue debugging tomorrow morning my time. It would be good to capture the output of: oc get csr And also, a must-gather for the system in the failed state. Thanks in advance!
Victor, the must-gather command (oc adm must-gather) failed, since there are too many operators are abnormal, the must-gather pod may not be running. you may need to check logs on a live cluster. >> oc get csr NAME AGE SIGNERNAME REQUESTOR CONDITION csr-7lkqv 4h33m kubernetes.io/kubelet-serving system:node:ip-10-0-58-116.us-east-2.compute.internal Approved,Issued csr-7rhpq 4h39m kubernetes.io/kubelet-serving system:node:ip-10-0-77-49.us-east-2.compute.internal Approved,Issued csr-8cbx2 4h40m kubernetes.io/kubelet-serving system:node:ip-10-0-77-196.us-east-2.compute.internal Approved,Issued csr-fqmkr 4h33m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-klt89 4h33m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-ks7sw 4h40m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-lprdm 4h40m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-mf97n 4h40m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued csr-nlgpg 4h40m kubernetes.io/kubelet-serving system:node:ip-10-0-56-214.us-east-2.compute.internal Approved,Issued csr-rqdth 4h33m kubernetes.io/kubelet-serving system:node:ip-10-0-57-1.us-east-2.compute.internal Approved,Issued
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Hello Stefan, Any progress on this bug? the issue is still there, let me know if you need further information.
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
Today, I tired twice, all failed. I am adding "testblocker" keywords.
[root@preserve-jialiu-ansible ~]# oc get node NAME STATUS ROLES AGE VERSION ip-10-0-49-35.us-east-2.compute.internal Ready master 97m v1.19.0+7070803 ip-10-0-58-213.us-east-2.compute.internal Ready master 97m v1.19.0+7070803 ip-10-0-79-220.us-east-2.compute.internal Ready master 98m v1.19.0+7070803 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication False Unknown True 97m cloud-credential 4.6.9 True False False 96m cluster-autoscaler 4.6.9 True False False 95m config-operator 4.6.9 True False False 97m console csi-snapshot-controller 4.6.9 True False False 97m dns 4.6.9 True False False 96m etcd 4.6.9 False True True 97m image-registry ingress False True True 96m insights 4.6.9 True False True 97m kube-apiserver False True True 97m kube-controller-manager False True True 97m kube-scheduler 4.6.9 False True True 97m kube-storage-version-migrator 4.6.9 False False False 97m machine-api 4.6.9 True False False 93m machine-approver 4.6.9 True False False 96m machine-config False True True 86m marketplace 4.6.9 True False False 96m monitoring False True True 92m network 4.6.9 True False False 96m node-tuning 4.6.9 True False False 97m openshift-apiserver 4.6.9 False False False 97m openshift-controller-manager False True False 97m openshift-samples operator-lifecycle-manager 4.6.9 True False False 96m operator-lifecycle-manager-catalog 4.6.9 True False False 96m operator-lifecycle-manager-packageserver False True False 96m service-ca 4.6.9 True False False 97m storage 4.6.9 True False False 96m [root@preserve-jialiu-ansible ~]# oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE auto-jialiu-616184-tspvm-master-0 Running m5.xlarge us-east-2 us-east-2a 104m auto-jialiu-616184-tspvm-master-1 Running m5.xlarge us-east-2 us-east-2b 104m auto-jialiu-616184-tspvm-master-2 Running m5.xlarge us-east-2 us-east-2a 104m auto-jialiu-616184-tspvm-worker-us-east-2a-cl9zt Provisioned m5.large us-east-2 us-east-2a 97m auto-jialiu-616184-tspvm-worker-us-east-2a-t678x Provisioned m5.large us-east-2 us-east-2a 97m auto-jialiu-616184-tspvm-worker-us-east-2b-r8lrg Provisioned m5.large us-east-2 us-east-2b 97m worker is provisioned, but not running, because "Daemonset machine-config-server is not ready", that lead to worker can not get worker ignition file from https://api-int.auto-jialiu-616184.qe.devcluster.openshift.com:22623/config/worker to boot the system up. [root@preserve-jialiu-ansible ~]# oc get po -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE machine-config-controller-5548b9c88f-tzjvd 1/1 Running 5 98m machine-config-daemon-f9h6f 2/2 Running 0 99m machine-config-daemon-q556r 2/2 Running 0 99m machine-config-daemon-ww5mk 2/2 Running 0 99m machine-config-operator-7677b5fd8b-mp5g2 1/1 Running 5 106m Seem like this cluster totally broken when apiserver and etcd operator get into degraded state. [root@preserve-jialiu-ansible ~]# oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-01-19T13:06:55Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:spec: f:status: .: f:versions: Manager: cluster-version-operator Operation: Update Time: 2021-01-19T13:06:55Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: Manager: machine-config-operator Operation: Update Time: 2021-01-19T14:42:15Z Resource Version: 41218 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 5b940e9e-1e52-4576-9b59-efcd11a3d510 Spec: Status: Conditions: Last Transition Time: 2021-01-19T13:13:57Z Message: Working towards 4.6.9 Status: True Type: Progressing Last Transition Time: 2021-01-19T13:25:23Z Message: Unable to apply 4.6.9: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-server is not ready. status: (desired: 0, updated: 0, ready: 0, unavailable: 0) Reason: MachineConfigServerFailed Status: True Type: Degraded Last Transition Time: 2021-01-19T13:25:23Z Message: Cluster not available for 4.6.9 Status: False Type: Available Last Transition Time: 2021-01-19T13:25:23Z Reason: AsExpected Status: True Type: Upgradeable Extension: Master: all 3 nodes are at latest configuration rendered-master-372fb00019365868033c7aeb39ad30de Worker: all 0 nodes are at latest configuration rendered-worker-ebfa863b048487ff5638da9807018019 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Events: <none> [root@preserve-jialiu-ansible ~]# oc describe co etcd Name: etcd Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-01-19T13:06:54Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:spec: f:status: .: f:extension: f:relatedObjects: Manager: cluster-version-operator Operation: Update Time: 2021-01-19T13:06:54Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:versions: Manager: cluster-etcd-operator Operation: Update Time: 2021-01-19T14:31:17Z Resource Version: 37043 Self Link: /apis/config.openshift.io/v1/clusteroperators/etcd UID: 703578de-988f-499d-8e00-8dab22b1c5cb Spec: Status: Conditions: Last Transition Time: 2021-01-19T13:17:04Z Message: InstallerControllerDegraded: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s NodeInstallerDegraded: 1 nodes are failing on revision 2: NodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending StaticPodsDegraded: pods "etcd-ip-10-0-58-213.us-east-2.compute.internal" not found StaticPodsDegraded: pods "etcd-ip-10-0-79-220.us-east-2.compute.internal" not found Reason: InstallerController_Error::NodeInstaller_InstallerPodFailed::StaticPods_Error Status: True Type: Degraded Last Transition Time: 2021-01-19T13:14:57Z Message: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 4 Reason: NodeInstaller Status: True Type: Progressing Last Transition Time: 2021-01-19T13:14:16Z Message: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 4 Reason: StaticPods_ZeroNodesActive Status: False Type: Available Last Transition Time: 2021-01-19T13:14:18Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: etcds Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-etcd-operator Resource: namespaces Group: Name: openshift-etcd Resource: namespaces Versions: Name: raw-internal Version: 4.6.9 Name: operator Version: 4.6.9 Name: etcd Version: 4.6.9 Events: <none> [root@preserve-jialiu-ansible ~]# oc describe co kube-apiserver Name: kube-apiserver Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-01-19T13:06:54Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:spec: f:status: .: f:extension: f:relatedObjects: Manager: cluster-version-operator Operation: Update Time: 2021-01-19T13:06:54Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:versions: Manager: cluster-kube-apiserver-operator Operation: Update Time: 2021-01-19T14:51:19Z Resource Version: 44335 Self Link: /apis/config.openshift.io/v1/clusteroperators/kube-apiserver UID: 6dde2415-cc7e-461c-b48b-1da465ab16f8 Spec: Status: Conditions: Last Transition Time: 2021-01-19T13:16:19Z Message: StaticPodsDegraded: pods "kube-apiserver-ip-10-0-49-35.us-east-2.compute.internal" not found StaticPodsDegraded: pods "kube-apiserver-ip-10-0-79-220.us-east-2.compute.internal" not found StaticPodsDegraded: pods "kube-apiserver-ip-10-0-58-213.us-east-2.compute.internal" not found InstallerControllerDegraded: Internal error occurred: admission plugin "MutatingAdmissionWebhook" failed to complete mutation in 13s Reason: InstallerController_Error::StaticPods_Error Status: True Type: Degraded Last Transition Time: 2021-01-19T13:15:16Z Message: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 2 Reason: NodeInstaller Status: True Type: Progressing Last Transition Time: 2021-01-19T13:14:21Z Message: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 2 Reason: StaticPods_ZeroNodesActive Status: False Type: Available Last Transition Time: 2021-01-19T13:14:20Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: kubeapiservers Group: apiextensions.k8s.io Name: Resource: customresourcedefinitions Group: security.openshift.io Name: Resource: securitycontextconstraints Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-kube-apiserver-operator Resource: namespaces Group: Name: openshift-kube-apiserver Resource: namespaces Group: admissionregistration.k8s.io Name: Resource: mutatingwebhookconfigurations Group: admissionregistration.k8s.io Name: Resource: validatingwebhookconfigurations Group: controlplane.operator.openshift.io Name: Namespace: openshift-kube-apiserver Resource: podnetworkconnectivitychecks Versions: Name: raw-internal Version: 4.6.9 Events: <none> [root@preserve-jialiu-ansible ~]# oc get po -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-ip-10-0-49-35.us-east-2.compute.internal 3/3 Running 1 104m installer-2-ip-10-0-49-35.us-east-2.compute.internal 0/1 Completed 0 104m [root@preserve-jialiu-ansible ~]# oc get all -n openshift-etcd NAME READY STATUS RESTARTS AGE pod/etcd-ip-10-0-49-35.us-east-2.compute.internal 3/3 Running 1 104m pod/installer-2-ip-10-0-49-35.us-east-2.compute.internal 0/1 Completed 0 104m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/etcd ClusterIP 172.30.110.11 <none> 2379/TCP,9979/TCP 113m service/host-etcd-2 ClusterIP None <none> 2379/TCP 113m
The original root cause was an unrestricted MutatingAdmissionWebhook, fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1903226. That bug is VERIFIED for 4.7, but not backported to 4.6. Moving this one to cloud team as a reminder to backport https://bugzilla.redhat.com/show_bug.cgi?id=1903226 to 4.6.
Setting to blocker- give stts' comment that this is a backport reminder.
> That bug is VERIFIED for 4.7, but not backported to 4.6 Yeah, agree, from my test result, this issue is mainly happening on 4.6.
*** This bug has been marked as a duplicate of bug 1921901 ***