Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1822865

Summary:	stuck kubelet
Product:	OpenShift Container Platform	Reporter:	Lukasz Szaszkiewicz <lszaszki>
Component:	RHCOS	Assignee:	Micah Abbott <miabbott>
Status:	CLOSED CANTFIX	QA Contact:	Michael Nguyen <mnguyen>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	aos-bugs, bbreard, cglombek, fedoraproject, imcleod, jligon, jokerman, nstielau, smilner
Target Milestone:	---
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-14 20:14:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lukasz Szaszkiewicz 2020-04-10 07:59:33 UTC

Description of problem:

While investigating https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.4/658 
I realised that a significant number of tests failed on waiting for a pod to become ready. 
All these pods were scheduled on the same machine (10-0-143-25). 

Interesting, I didn't find any traces of the pods in the log file from that machine as if they were never even considered to be run. 
Also, there were no errors that would indicate a broken link with the API server.
Additionally, the main kubelet's loop stopped logging. 
Maybe it's a matter of the log level but the last SyncLoop entry on that machine was at 03:13:51.


Expected results:

It's worth further investigation and knowing why kubelet didn't run these pods.

Additional info:

The list of tests that failed on waiting for a pod (incomplete):

- [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted] [Testpattern: Pre-provisioned PV (default fs)] subPath [Top Level] [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted]
[Testpattern: Pre-provisioned PV (default fs)] subPath should support existing directories when readOnly specified in the volumeSource [Suite:openshift/conformance/parallel] [Suite:k8s] 

- [Area:Networking] services basic functionality [Top Level] [Area:Networking] services basic functionality should allow connections to another pod on a different node via a service IP [Suite:openshift/conformance/parallel]

- [sig-network] Networking Granular Checks: Services [Top Level] [sig-network] Networking Granular Checks: Services should function for node-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s] 

The last SyncLoop entry on that machine: 
- Mar 25 03:13:51.347987 ip-10-0-143-25 hyperkube[1352]: I0325 03:13:51.332226    1352 kubelet.go:1913] SyncLoop (ADD, "api"): "webserver-99r9t_e2e-k8s-nettest-8443(91d79e1a-dcd0-402d-95b2-6deee8e0ebcd)"

Comment 1 Ryan Phillips 2020-04-14 15:56:35 UTC

In the master log:

Mar 25 03:57:23.009597 ip-10-0-148-71 zincati[224827]: Error: Error("missing field `coreos-assembler.basearch`", line: 10, column: 7)
Mar 25 03:57:23.009597 ip-10-0-148-71 zincati[224827]: failed to introspect OS base architecture
Mar 25 03:57:23.009597 ip-10-0-148-71 zincati[224827]: failed to build default identity
Mar 25 03:57:23.009597 ip-10-0-148-71 zincati[224827]: failed to validate agent identity configuration
Mar 25 03:57:23.009597 ip-10-0-148-71 zincati[224827]: failed to assemble configuration settings

I think zincati is rhcos related... Moving this BZ over

Comment 3 Steve Milner 2020-04-14 20:14:01 UTC

Confirmed that this is not an RHCOS issue, but is present in FCOS. The bug is noted in https://github.com/coreos/fedora-coreos-tracker/issues/392 which also has a workaround for the time being. Since this is already tracked in the Fedora CoreOS namespace I'm going to close this bug since it's against RHCOS/OCP.