+++ This bug was initially created as a clone of Bug #1909600 +++ +++ This bug was initially created as a clone of Bug #1858763 +++ Description of problem: Getting below error in CI upgrade logs while upgrading to 4.7.z Get the below log from jenkins upgrade CI job http://virt-openshift-05.lab.eng.nay.redhat.com/ci-logs/upgrade_CI/16365/log, 08-02 15:28:04.489 => Below status and logs for the different conditions of all abnormal cos are sorted by 'lastTransitionTime': 08-02 15:28:04.489 2021-08-01T22:05:30Z [network] Upgradeable True - - 08-02 15:28:04.489 2021-08-01T22:06:39Z [kube-apiserver] Upgradeable True AsExpected All is well 08-02 15:28:04.489 2021-08-01T22:07:37Z [network] Available True - - 08-02 15:28:04.489 2021-08-02T00:20:17Z [kube-apiserver] Available True AsExpected StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 16; 0 nodes have achieved new revision 17 08-02 15:28:04.489 2021-08-02T00:26:57Z [kube-storage-version-migrator] Upgradeable Unknown NoData - 08-02 15:28:04.489 2021-08-02T00:26:58Z [kube-storage-version-migrator] Degraded False AsExpected All is well 08-02 15:28:04.489 2021-08-02T04:10:47Z [network] Degraded False - - 08-02 15:28:04.489 2021-08-02T04:50:40Z [network] ManagementStateDegraded False - - 08-02 15:28:04.489 2021-08-02T04:51:32Z [kube-apiserver] -->>Progressing True<<-- NodeInstaller NodeInstallerProgressing: 3 nodes are at revision 16; 0 nodes have achieved new revision 17 08-02 15:28:04.489 2021-08-02T04:53:01Z [kube-storage-version-migrator] Progressing False AsExpected All is well 08-02 15:28:04.490 2021-08-02T04:54:02Z [kube-apiserver] -->>Degraded True<<-- As checked must gather logs also we found below error and observer multiple operators have problems, and kube-apiserver even reports that one node's kubelet is broken. For kube-apiserver-operator grep -i '1 nodes are failing on revision' *.log 2021-08-02T05:04:33.887573848Z I0802 05:04:33.887180 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"90c6eefe-f314-11eb-a009-fa163eb7c2a6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "InstallerControllerDegraded: missing required resources: [configmaps: aggregator-client-ca,check-endpoints-kubeconfig,client-ca,control-plane-node-kubeconfig, configmaps: bound-sa-token-signing-certs-17,config-17,etcd-serving-ca-17,kube-apiserver-audit-policies-17,kube-apiserver-cert-syncer-kubeconfig-17,kube-apiserver-pod-17,kubelet-serving-ca-17,sa-token-signing-certs-17, secrets: aggregator-client,bound-service-account-signing-key,check-endpoints-client-cert-key,control-plane-node-admin-client-cert-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,node-kubeconfigs,service-network-serving-certkey, secrets: etcd-client-17,kubelet-client-17,localhost-recovery-client-token-17,localhost-recovery-serving-certkey-17]\nNodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context canceled" to "InstallerControllerDegraded: missing required resources: [secrets: aggregator-client,bound-service-account-signing-key,check-endpoints-client-cert-key,control-plane-node-admin-client-cert-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,node-kubeconfigs,service-network-serving-certkey, secrets: etcd-client-17,kubelet-client-17,localhost-recovery-client-token-17,localhost-recovery-serving-certkey-17]\nNodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context canceled" current.log:2021-08-02T05:04:33.888754269Z I0802 05:04:33.888675 1 status_controller.go:213] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-02T04:54:02Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context canceled","reason":"NodeInstaller_InstallerPodFailed::StaticPods_Error","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-02T04:51:32Z","message":"NodeInstallerProgressing: 3 nodes are at revision 16; 0 nodes have achieved new revision 17","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-08-02T00:20:17Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 16; 0 nodes have achieved new revision 17","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-08-01T22:06:39Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} current.log:2021-08-02T05:04:33.989561516Z I0802 05:04:33.981194 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"90c6eefe-f314-11eb-a009-fa163eb7c2a6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "InstallerControllerDegraded: missing required resources: [secrets: aggregator-client,bound-service-account-signing-key,check-endpoints-client-cert-key,control-plane-node-admin-client-cert-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,node-kubeconfigs,service-network-serving-certkey, secrets: etcd-client-17,kubelet-client-17,localhost-recovery-client-token-17,localhost-recovery-serving-certkey-17]\nNodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context canceled" to "NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context : Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nNodeControllerDegraded: The master nodes not ready: node \"leap02054839-tf2qv-control-plane-2\" not ready since 2021-08-02 05:06:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nNodeControllerDegraded: The master nodes not ready: node \"leap02054839-tf2qv-control-plane-2\" not ready since 2021-08-02 05:07:52 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])" For Kubelet service logs: grep "PLEG is not healthy" ./host_service_logs/masters/kubelet_service.log Aug 02 03:43:23.540253 leap02054839-tf2qv-control-plane-2 hyperkube[1681]: I0802 03:43:23.540167 1681 setters.go:555] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2021-08-02 03:43:23.539650512 +0000 UTC m=+7.539703485 LastTransitionTime:2021-08-02 03:43:23.539650512 +0000 UTC m=+7.539703485 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]} Aug 02 03:43:23.562260 leap02054839-tf2qv-control-plane-2 hyperkube[1681]: E0802 03:43:23.562210 1681 kubelet.go:1804] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] Aug 02 03:47:18.778711 leap02054839-tf2qv-control-plane-0 hyperkube[1686]: E0802 03:47:18.778560 1686 kubelet.go:1804] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] Aug 02 03:51:14.748145 leap02054839-tf2qv-control-plane-1 hyperkube[1685]: I0802 03:51:14.747861 1685 setters.go:555] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2021-08-02 03:51:14.747824702 +0000 UTC m=+7.741865589 LastTransitionTime:2021-08-02 03:51:14.747824702 +0000 UTC m=+7.741865589 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
The installer 17 failed because the kubelet was not ready: NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nNodeControllerDegraded: The master nodes not ready: node \"leap02054839-tf2qv-control-plane-2\" not ready since 2021-08-02 05:07:52 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]) In 4.8 we have an installer retry. In 4.7 we don't and the backport is not feasible. Moving to node to understand why the pod failed.
This is the 4.8 change: https://github.com/openshift/library-go/pull/979
@Harshal: Can you please check this log? If they are useful find . -name kubelet_service.log ./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-040f48c020420ff93b227216469f6c2971cf10fac2b0b52ea9853e88ec1964a6/host_service_logs/masters/kubelet_service.log
@Harshal: We re-running the upgrade job, if we able to replicate error we can share the journal system logs, till then please don't close bug
Please find the attached journal logs. https://drive.google.com/file/d/1iGLAIp-0LRjXhkzQdbSZ81C7ivgK6ZRZ/view?usp=sharing
@harshal: Any update on this, it's long time pending issue
network is unavailable because the sdn pod failed to start because kubelet or cri-o was apparently unable to set it up correctly: Aug 27 01:32:16 rgangwar27114550-xlbt5-control-plane-1 hyperkube[2577]: E0827 05:32:16.921366 2577 pod_workers.go:190] Error syncing pod ab763d18-06eb-11ec-a51c-fa163e167285 ("sdn-q6v8g_openshift-sdn(ab763d18-06eb-11ec-a51c-fa163e167285)"), skipping: failed to "StartContainer" for "sdn" with CreateContainerError: "container create failed: time=\"2021-08-27T05:32:16Z\" level=warning msg=\"exit status 1\"\ntime=\"2021-08-27T05:32:16Z\" level=error msg=\"container_linux.go:349: starting container process caused \\\"process_linux.go:449: container init caused \\\\\\\"rootfs_linux.go:58: mounting \\\\\\\\\\\\\\\"/var/lib/kubelet/pods/ab763d18-06eb-11ec-a51c-fa163e167285/volumes/kubernetes.io~configmap/config\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/containers/storage/overlay/d2349e5ab6f0c58bbb439f29a0bd42d2e9003486a865d23a33fdc8940ba07c0f/merged\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/config\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"stat /var/lib/kubelet/pods/ab763d18-06eb-11ec-a51c-fa163e167285/volumes/kubernetes.io~configmap/config: no such file or directory\\\\\\\\\\\\\\\"\\\\\\\"\\\"\"\ncontainer_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/var/lib/kubelet/pods/ab763d18-06eb-11ec-a51c-fa163e167285/volumes/kubernetes.io~configmap/config\\\\\\\" to rootfs \\\\\\\"/var/lib/containers/storage/overlay/d2349e5ab6f0c58bbb439f29a0bd42d2e9003486a865d23a33fdc8940ba07c0f/merged\\\\\\\" at \\\\\\\"/config\\\\\\\" caused \\\\\\\"stat /var/lib/kubelet/pods/ab763d18-06eb-11ec-a51c-fa163e167285/volumes/kubernetes.io~configmap/config: no such file or directory\\\\\\\"\\\"\"\n" AFAICT there is no must-gather corresponding to this run. In the earlier must-gather all of the sdn pods are fine, so (assuming it was even the same problem both times), the problem is not with Networking. back to Node. (But is there really a diagnosable bug here at this point?)