Bug 1788683

Summary: nodes fail to upgrade in 4.2->4.3 upgrade test
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: bbennett, ccoleman, kgarriso, smilner
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-07 23:25:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Parees 2020-01-07 19:34:13 UTC
Description of problem:
Upgrade test failed due to worker pools failing to complete their upgrade.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3/327

discussion/investigation:
https://coreos.slack.com/archives/CEKNRGF25/p1578421898113000



Possibly related, this job failed due to a master failing to upgrade, it's possible the master failed to upgrade for the same reason the workers failed to upgrade in the above job.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3/324


This is not failing consistently (the job has failed twice out of the last 20 runs).

Comment 3 Kirsten Garrison 2020-01-07 19:52:56 UTC
Culprit seems to be a service-test pod that couldnt be evicted and caused it to essentialyl time out:

I0107 04:48:03.836869   85538 update.go:91] error when evicting pod "service-test-hhr9h" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
....
I0107 05:30:37.947683   85538 update.go:159] Draining failed with: error when evicting pod "service-test-hhr9h": pods "service-test-hhr9h" is forbidden: unable to create new content in namespace e2e-k8s-service-upgrade-9169 because it is being terminated, retrying
I0107 05:30:47.954956   85538 update.go:91] cordoned node "ip-10-0-156-53.ec2.internal"

the actual update does eventually start but i believe gets cut off (since the above took 40 minutes):
I0107 05:31:51.503021  174838 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ocp/4.3-2020-01-06-150524@sha256:de991faee828834bb95440fec3c866f95af1e9c52946f69889cd675ab3d96e70
2020-01-07 05:31:51.664012344 +0000 UTC m=+0.054593980 system refresh

See:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3/327/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-t77tf_machine-config-daemon.log

Comment 4 Kirsten Garrison 2020-01-07 19:55:27 UTC
Can confirm that other than unevicatable pod, the logs of mco/mcc/mcd/mcs are fine.

Comment 5 Kirsten Garrison 2020-01-07 20:27:54 UTC
Some notes (Im a bit unfamiliar with the test). There seems to be a problem with "service-test-hhr9" before the upgrade. Readiness probes were failing consistently before the MCD tried to kick off the upgrade. Basically the readiness probe starts failing around 4:35 due to a Missing CNI default network :

aliases: [k8s_POD_service-test-hhr9h_e2e-k8s-service-upgrade-9169_78425dc5-3104-11ea-9d03-12a298d70101_0 3d0728275defd92e909599a689681447f537522f85b268a982f8329a7dfae213], namespace: "crio"

Jan 07 04:34:59 ip-10-0-156-53 hyperkube[1988]: E0107 04:34:59.284516    1988 pod_workers.go:190] Error syncing pod 78425dc5-3104-11ea-9d03-12a298d70101 ("service-test-hhr9h_e2e-k8s-service-upgrade-9169(78425dc5-3104-11ea-9d03-12a298d70101)"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
Jan 07 04:34:59 ip-10-0-156-53 hyperkube[1988]: I0107 04:34:59.284545    1988 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-k8s-service-upgrade-9169", Name:"service-test-hhr9h", UID:"78425dc5-3104-11ea-9d03-12a298d70101", APIVersion:"v1", ResourceVersion:"16394", FieldPath:""}): type: 'Warning' reason: 'NetworkNotReady' network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
...
Jan 07 04:35:01 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:01.769634    1988 prober.go:112] Readiness probe for "service-test-hhr9h_e2e-k8s-service-upgrade-9169(78425dc5-3104-11ea-9d03-12a298d70101):netexec" failed (failure): Get http://10.129.2.13:80/hostName: dial tcp 10.129.2.13:80: connect: no route to host
...
Jan 07 04:35:05 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:05.637950    1988 prober.go:112] Readiness probe for "service-test-hhr9h_e2e-k8s-service-upgrade-9169(78425dc5-3104-11ea-9d03-12a298d70101):netexec" failed (failure): Get http://10.129.2.13:80/hostName: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Jan 07 04:35:05 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:05.638068    1988 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-k8s-service-upgrade-9169", Name:"service-test-hhr9h", UID:"78425dc5-3104-11ea-9d03-12a298d70101", APIVersion:"v1", ResourceVersion:"16394", FieldPath:"spec.containers{netexec}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get http://10.129.2.13:80/hostName: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)


This is all well before the 4:50 attempts to evict.  Not 100% sure of what's happening in the tests around then but this never recovers...

Comment 6 Kirsten Garrison 2020-01-07 20:34:26 UTC
Also seeing (unsure of how this fits in):
n 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562245    1988 volume_manager.go:355] Waiting for volumes to attach and mount for pod "dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd)"
Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562307    1988 volume_manager.go:388] All volumes are attached and mounted for pod "dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd)"
Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562450    1988 kuberuntime_manager.go:600] Container "dns" ({"cri-o" "81495caa148e3ce447c261d35e4301818345c6e09dd5693b7f11049c92f583b2"}) of pod dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd): Container dns failed liveness probe, will be restarted
Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562523    1988 kuberuntime_manager.go:621] computePodActions got {KillPod:false CreateSandbox:false SandboxID:a21a3dae8490d6e58d5b27956010f65b1e3c23a946828913bf5fbbce7f6a794e Attempt:0 NextInitContainerToStart:nil ContainersToStart:[0] ContainersToKill:map[{Type:cri-o ID:81495caa148e3ce447c261d35e4301818345c6e09dd5693b7f11049c92f583b2}:{container:0xc00295ec80 name:dns message:Container dns failed liveness probe, will be restarted}]} for pod "dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd)"
Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562574    1988 kuberuntime_manager.go:655] Killing unwanted container "dns"(id={"cri-o" "81495caa148e3ce447c261d35e4301818345c6e09dd5693b7f11049c92f583b2"}) for pod "dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd)"
Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562622    1988 kuberuntime_container.go:581] Killing container "cri-o://81495caa148e3ce447c261d35e4301818345c6e09dd5693b7f11049c92f583b2" with 30 second grace period
Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562687    1988 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-sd2gz", UID:"5352f7f8-3103-11ea-9c8f-0a9a323322dd", APIVersion:"v1", ResourceVersion:"8083", FieldPath:"spec.containers{dns}"}): type: 'Normal' reason: 'Killing' Container dns failed liveness probe, will be restarted

Comment 7 Kirsten Garrison 2020-01-07 21:11:17 UTC
Seems like networking is appropriate to check this out as it doesn't seem to be MCO problem. Perhaps they can dig a a bit on this?

Comment 8 Ben Bennett 2020-01-07 23:25:59 UTC

*** This bug has been marked as a duplicate of bug 1787581 ***