Bug 1788683
Summary: | nodes fail to upgrade in 4.2->4.3 upgrade test | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> |
Component: | Networking | Assignee: | Casey Callendrello <cdc> |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | bbennett, ccoleman, kgarriso, smilner |
Version: | 4.3.0 | ||
Target Milestone: | --- | ||
Target Release: | 4.3.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-01-07 23:25:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ben Parees
2020-01-07 19:34:13 UTC
Culprit seems to be a service-test pod that couldnt be evicted and caused it to essentialyl time out: I0107 04:48:03.836869 85538 update.go:91] error when evicting pod "service-test-hhr9h" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. .... I0107 05:30:37.947683 85538 update.go:159] Draining failed with: error when evicting pod "service-test-hhr9h": pods "service-test-hhr9h" is forbidden: unable to create new content in namespace e2e-k8s-service-upgrade-9169 because it is being terminated, retrying I0107 05:30:47.954956 85538 update.go:91] cordoned node "ip-10-0-156-53.ec2.internal" the actual update does eventually start but i believe gets cut off (since the above took 40 minutes): I0107 05:31:51.503021 174838 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ocp/4.3-2020-01-06-150524@sha256:de991faee828834bb95440fec3c866f95af1e9c52946f69889cd675ab3d96e70 2020-01-07 05:31:51.664012344 +0000 UTC m=+0.054593980 system refresh See: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3/327/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-t77tf_machine-config-daemon.log Can confirm that other than unevicatable pod, the logs of mco/mcc/mcd/mcs are fine. Some notes (Im a bit unfamiliar with the test). There seems to be a problem with "service-test-hhr9" before the upgrade. Readiness probes were failing consistently before the MCD tried to kick off the upgrade. Basically the readiness probe starts failing around 4:35 due to a Missing CNI default network : aliases: [k8s_POD_service-test-hhr9h_e2e-k8s-service-upgrade-9169_78425dc5-3104-11ea-9d03-12a298d70101_0 3d0728275defd92e909599a689681447f537522f85b268a982f8329a7dfae213], namespace: "crio" Jan 07 04:34:59 ip-10-0-156-53 hyperkube[1988]: E0107 04:34:59.284516 1988 pod_workers.go:190] Error syncing pod 78425dc5-3104-11ea-9d03-12a298d70101 ("service-test-hhr9h_e2e-k8s-service-upgrade-9169(78425dc5-3104-11ea-9d03-12a298d70101)"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network Jan 07 04:34:59 ip-10-0-156-53 hyperkube[1988]: I0107 04:34:59.284545 1988 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-k8s-service-upgrade-9169", Name:"service-test-hhr9h", UID:"78425dc5-3104-11ea-9d03-12a298d70101", APIVersion:"v1", ResourceVersion:"16394", FieldPath:""}): type: 'Warning' reason: 'NetworkNotReady' network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network ... Jan 07 04:35:01 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:01.769634 1988 prober.go:112] Readiness probe for "service-test-hhr9h_e2e-k8s-service-upgrade-9169(78425dc5-3104-11ea-9d03-12a298d70101):netexec" failed (failure): Get http://10.129.2.13:80/hostName: dial tcp 10.129.2.13:80: connect: no route to host ... Jan 07 04:35:05 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:05.637950 1988 prober.go:112] Readiness probe for "service-test-hhr9h_e2e-k8s-service-upgrade-9169(78425dc5-3104-11ea-9d03-12a298d70101):netexec" failed (failure): Get http://10.129.2.13:80/hostName: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Jan 07 04:35:05 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:05.638068 1988 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-k8s-service-upgrade-9169", Name:"service-test-hhr9h", UID:"78425dc5-3104-11ea-9d03-12a298d70101", APIVersion:"v1", ResourceVersion:"16394", FieldPath:"spec.containers{netexec}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get http://10.129.2.13:80/hostName: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) This is all well before the 4:50 attempts to evict. Not 100% sure of what's happening in the tests around then but this never recovers... Also seeing (unsure of how this fits in): n 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562245 1988 volume_manager.go:355] Waiting for volumes to attach and mount for pod "dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd)" Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562307 1988 volume_manager.go:388] All volumes are attached and mounted for pod "dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd)" Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562450 1988 kuberuntime_manager.go:600] Container "dns" ({"cri-o" "81495caa148e3ce447c261d35e4301818345c6e09dd5693b7f11049c92f583b2"}) of pod dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd): Container dns failed liveness probe, will be restarted Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562523 1988 kuberuntime_manager.go:621] computePodActions got {KillPod:false CreateSandbox:false SandboxID:a21a3dae8490d6e58d5b27956010f65b1e3c23a946828913bf5fbbce7f6a794e Attempt:0 NextInitContainerToStart:nil ContainersToStart:[0] ContainersToKill:map[{Type:cri-o ID:81495caa148e3ce447c261d35e4301818345c6e09dd5693b7f11049c92f583b2}:{container:0xc00295ec80 name:dns message:Container dns failed liveness probe, will be restarted}]} for pod "dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd)" Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562574 1988 kuberuntime_manager.go:655] Killing unwanted container "dns"(id={"cri-o" "81495caa148e3ce447c261d35e4301818345c6e09dd5693b7f11049c92f583b2"}) for pod "dns-default-sd2gz_openshift-dns(5352f7f8-3103-11ea-9c8f-0a9a323322dd)" Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562622 1988 kuberuntime_container.go:581] Killing container "cri-o://81495caa148e3ce447c261d35e4301818345c6e09dd5693b7f11049c92f583b2" with 30 second grace period Jan 07 04:35:11 ip-10-0-156-53 hyperkube[1988]: I0107 04:35:11.562687 1988 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-sd2gz", UID:"5352f7f8-3103-11ea-9c8f-0a9a323322dd", APIVersion:"v1", ResourceVersion:"8083", FieldPath:"spec.containers{dns}"}): type: 'Normal' reason: 'Killing' Container dns failed liveness probe, will be restarted Seems like networking is appropriate to check this out as it doesn't seem to be MCO problem. Perhaps they can dig a a bit on this? *** This bug has been marked as a duplicate of bug 1787581 *** |