Bug 1995911

Summary:	"co/kube-apiserver is still in Progressing" is seen in upgrade ci
Product:	OpenShift Container Platform	Reporter:	Rahul Gangwar <rgangwar>
Component:	Node	Assignee:	Harshal Patil <harpatil>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, cblecker, danw, harpatil, jokerman, jupierce, kewang, mfojtik, nagrawal, rgangwar, sanchezl, schoudha, sttts, travi, vpickard, wking, xxia
Version:	4.7	Keywords:	Reopened
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1909600	Environment:
Last Closed:	2021-09-16 05:02:43 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rahul Gangwar 2021-08-20 06:50:28 UTC

+++ This bug was initially created as a clone of Bug #1909600 +++

+++ This bug was initially created as a clone of Bug #1858763 +++

Description of problem:
 Getting below error in CI upgrade logs while upgrading to 4.7.z

Get the below log from jenkins upgrade CI job http://virt-openshift-05.lab.eng.nay.redhat.com/ci-logs/upgrade_CI/16365/log, 

08-02 15:28:04.489  => Below status and logs for the different conditions of all abnormal cos are sorted by 'lastTransitionTime':
08-02 15:28:04.489  2021-08-01T22:05:30Z [network] Upgradeable True - -
08-02 15:28:04.489  2021-08-01T22:06:39Z [kube-apiserver] Upgradeable True AsExpected All is well
08-02 15:28:04.489  2021-08-01T22:07:37Z [network] Available True - -
08-02 15:28:04.489  2021-08-02T00:20:17Z [kube-apiserver] Available True AsExpected StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 16; 0 nodes have achieved new revision 17
08-02 15:28:04.489  2021-08-02T00:26:57Z [kube-storage-version-migrator] Upgradeable Unknown NoData -
08-02 15:28:04.489  2021-08-02T00:26:58Z [kube-storage-version-migrator] Degraded False AsExpected All is well
08-02 15:28:04.489  2021-08-02T04:10:47Z [network] Degraded False - -
08-02 15:28:04.489  2021-08-02T04:50:40Z [network] ManagementStateDegraded False - -
08-02 15:28:04.489  2021-08-02T04:51:32Z [kube-apiserver] -->>Progressing True<<-- NodeInstaller NodeInstallerProgressing: 3 nodes are at revision 16; 0 nodes have achieved new revision 17
08-02 15:28:04.489  2021-08-02T04:53:01Z [kube-storage-version-migrator] Progressing False AsExpected All is well
08-02 15:28:04.490  2021-08-02T04:54:02Z [kube-apiserver] -->>Degraded True<<-- 

As checked must gather logs also we found below error and observer multiple operators have problems, and kube-apiserver even reports that one node's kubelet is broken.

For kube-apiserver-operator
grep -i  '1 nodes are failing on revision' *.log

2021-08-02T05:04:33.887573848Z I0802 05:04:33.887180       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"90c6eefe-f314-11eb-a009-fa163eb7c2a6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "InstallerControllerDegraded: missing required resources: [configmaps: aggregator-client-ca,check-endpoints-kubeconfig,client-ca,control-plane-node-kubeconfig, configmaps: bound-sa-token-signing-certs-17,config-17,etcd-serving-ca-17,kube-apiserver-audit-policies-17,kube-apiserver-cert-syncer-kubeconfig-17,kube-apiserver-pod-17,kubelet-serving-ca-17,sa-token-signing-certs-17, secrets: aggregator-client,bound-service-account-signing-key,check-endpoints-client-cert-key,control-plane-node-admin-client-cert-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,node-kubeconfigs,service-network-serving-certkey, secrets: etcd-client-17,kubelet-client-17,localhost-recovery-client-token-17,localhost-recovery-serving-certkey-17]\nNodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context canceled" to "InstallerControllerDegraded: missing required resources: [secrets: aggregator-client,bound-service-account-signing-key,check-endpoints-client-cert-key,control-plane-node-admin-client-cert-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,node-kubeconfigs,service-network-serving-certkey, secrets: etcd-client-17,kubelet-client-17,localhost-recovery-client-token-17,localhost-recovery-serving-certkey-17]\nNodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context canceled"
current.log:2021-08-02T05:04:33.888754269Z I0802 05:04:33.888675       1 status_controller.go:213] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2021-08-02T04:54:02Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context canceled","reason":"NodeInstaller_InstallerPodFailed::StaticPods_Error","status":"True","type":"Degraded"},{"lastTransitionTime":"2021-08-02T04:51:32Z","message":"NodeInstallerProgressing: 3 nodes are at revision 16; 0 nodes have achieved new revision 17","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-08-02T00:20:17Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 16; 0 nodes have achieved new revision 17","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2021-08-01T22:06:39Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
current.log:2021-08-02T05:04:33.989561516Z I0802 05:04:33.981194       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"90c6eefe-f314-11eb-a009-fa163eb7c2a6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "InstallerControllerDegraded: missing required resources: [secrets: aggregator-client,bound-service-account-signing-key,check-endpoints-client-cert-key,control-plane-node-admin-client-cert-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,node-kubeconfigs,service-network-serving-certkey, secrets: etcd-client-17,kubelet-client-17,localhost-recovery-client-token-17,localhost-recovery-serving-certkey-17]\nNodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context canceled" to "NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nStaticPodsDegraded: context


: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nNodeControllerDegraded: The master nodes not ready: node \"leap02054839-tf2qv-control-plane-2\" not ready since 2021-08-02 05:06:33 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nNodeControllerDegraded: The master nodes not ready: node \"leap02054839-tf2qv-control-plane-2\" not ready since 2021-08-02 05:07:52 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])"

For Kubelet service logs:

grep "PLEG is not healthy" ./host_service_logs/masters/kubelet_service.log

Aug 02 03:43:23.540253 leap02054839-tf2qv-control-plane-2 hyperkube[1681]: I0802 03:43:23.540167    1681 setters.go:555] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2021-08-02 03:43:23.539650512 +0000 UTC m=+7.539703485 LastTransitionTime:2021-08-02 03:43:23.539650512 +0000 UTC m=+7.539703485 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}
Aug 02 03:43:23.562260 leap02054839-tf2qv-control-plane-2 hyperkube[1681]: E0802 03:43:23.562210    1681 kubelet.go:1804] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
Aug 02 03:47:18.778711 leap02054839-tf2qv-control-plane-0 hyperkube[1686]: E0802 03:47:18.778560    1686 kubelet.go:1804] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
Aug 02 03:51:14.748145 leap02054839-tf2qv-control-plane-1 hyperkube[1685]: I0802 03:51:14.747861    1685 setters.go:555] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2021-08-02 03:51:14.747824702 +0000 UTC m=+7.741865589 LastTransitionTime:2021-08-02 03:51:14.747824702 +0000 UTC m=+7.741865589 Reason:KubeletNotReady Message:[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]}

Comment 3 Stefan Schimanski 2021-08-20 09:50:26 UTC

The installer 17 failed because the kubelet was not ready:

NodeInstallerDegraded: 1 nodes are failing on revision 17:\nNodeInstallerDegraded: no detailed termination message, see `oc get -n \"openshift-kube-apiserver\" pods/\"installer-17-leap02054839-tf2qv-control-plane-0\" -oyaml`\nNodeControllerDegraded: The master nodes not ready: node \"leap02054839-tf2qv-control-plane-2\" not ready since 2021-08-02 05:07:52 +0000 UTC because KubeletNotReady ([container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful])

In 4.8 we have an installer retry. In 4.7 we don't and the backport is not feasible.

Moving to node to understand why the pod failed.

Comment 4 Stefan Schimanski 2021-08-20 09:52:09 UTC

This is the 4.8 change: https://github.com/openshift/library-go/pull/979

Comment 6 Rahul Gangwar 2021-08-26 10:19:26 UTC

@Harshal: Can you please check this log? If they are useful

 find . -name kubelet_service.log
./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-040f48c020420ff93b227216469f6c2971cf10fac2b0b52ea9853e88ec1964a6/host_service_logs/masters/kubelet_service.log

Comment 8 Rahul Gangwar 2021-08-26 11:13:56 UTC

@Harshal: We re-running  the upgrade job, if we able to replicate error we can share the journal system logs, till then please don't close bug

Comment 10 Rahul Gangwar 2021-08-27 15:43:59 UTC

Please find the attached journal logs.
https://drive.google.com/file/d/1iGLAIp-0LRjXhkzQdbSZ81C7ivgK6ZRZ/view?usp=sharing

Comment 12 Rahul Gangwar 2021-09-14 09:16:01 UTC

@harshal: Any update on this, it's long time pending issue

Comment 13 Dan Winship 2021-09-15 15:16:17 UTC

network is unavailable because the sdn pod failed to start because kubelet or cri-o was apparently unable to set it up correctly:

Aug 27 01:32:16 rgangwar27114550-xlbt5-control-plane-1 hyperkube[2577]: E0827 05:32:16.921366    2577 pod_workers.go:190] Error syncing pod ab763d18-06eb-11ec-a51c-fa163e167285 ("sdn-q6v8g_openshift-sdn(ab763d18-06eb-11ec-a51c-fa163e167285)"), skipping: failed to "StartContainer" for "sdn" with CreateContainerError: "container create failed: time=\"2021-08-27T05:32:16Z\" level=warning msg=\"exit status 1\"\ntime=\"2021-08-27T05:32:16Z\" level=error msg=\"container_linux.go:349: starting container process caused \\\"process_linux.go:449: container init caused \\\\\\\"rootfs_linux.go:58: mounting \\\\\\\\\\\\\\\"/var/lib/kubelet/pods/ab763d18-06eb-11ec-a51c-fa163e167285/volumes/kubernetes.io~configmap/config\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/containers/storage/overlay/d2349e5ab6f0c58bbb439f29a0bd42d2e9003486a865d23a33fdc8940ba07c0f/merged\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/config\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"stat /var/lib/kubelet/pods/ab763d18-06eb-11ec-a51c-fa163e167285/volumes/kubernetes.io~configmap/config: no such file or directory\\\\\\\\\\\\\\\"\\\\\\\"\\\"\"\ncontainer_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/var/lib/kubelet/pods/ab763d18-06eb-11ec-a51c-fa163e167285/volumes/kubernetes.io~configmap/config\\\\\\\" to rootfs \\\\\\\"/var/lib/containers/storage/overlay/d2349e5ab6f0c58bbb439f29a0bd42d2e9003486a865d23a33fdc8940ba07c0f/merged\\\\\\\" at \\\\\\\"/config\\\\\\\" caused \\\\\\\"stat /var/lib/kubelet/pods/ab763d18-06eb-11ec-a51c-fa163e167285/volumes/kubernetes.io~configmap/config: no such file or directory\\\\\\\"\\\"\"\n"

AFAICT there is no must-gather corresponding to this run. In the earlier must-gather all of the sdn pods are fine, so (assuming it was even the same problem both times), the problem is not with Networking.

back to Node. (But is there really a diagnosable bug here at this point?)