Description of problem: The s390x CI upgrade jobs (4.9 to 4.10) fails with errors "[sig-network] pods should successfully create sandboxes by other" How reproducible: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1480977914789892096 Steps to Reproduce: https://search.ci.openshift.org/?search=%5C%5Bsig-network%5C%5D+pods+should+successfully+create+sandboxes+by+other&maxAge=48h&context=1&type=bug%2Bjunit&name=s390x&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Actual results: The test fails. Expected results: The test should pass.
We are still observing job failures with this test.
This looks like the default network wasn't ready. Sending along to SDN team for triage. From the provided CI run: ``` ns/openshift-kube-scheduler pod/openshift-kube-scheduler-guard-libvirt-s390x-0-2-708-h989f-master-0 node/libvirt-s390x-0-2-708-h989f-master-0 - 828.97 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-kube-scheduler-guard-libvirt-s390x-0-2-708-h989f-master-0_openshift-kube-scheduler_8f7882c3-ff56-4cba-8624-d856020c678b_0(b7ded924f1a87d15bd0040c9d6f02fb559814800ab7a9ef9bba279d3c9192ff4): error adding pod openshift-kube-scheduler_openshift-kube-scheduler-guard-libvirt-s390x-0-2-708-h989f-master-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): Multus: [openshift-kube-scheduler/openshift-kube-scheduler-guard-libvirt-s390x-0-2-708-h989f-master-0/8f7882c3-ff56-4cba-8624-d856020c678b]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/80-openshift-network.conf. pollimmediate error: timed out waiting for the condition ```
This is likely the same (or just a duplicate) as https://bugzilla.redhat.com/show_bug.cgi?id=2038481 notice the name of the pods in question are "guard" pods and the amount of time (e.g., 828.97 seconds) it reports. Most likely these guard pods are getting drained as part of the reboot process for the node in the upgrade, but they incorrectly come back to the node even though it's marked unschedulable. there was some work done recently to fix 2038481 and I notice this bug was filed almost a month ago. I looked at search.ci and sure I saw this specific error any more. @suryadav, can you see if this is no longer happening? if so, you can just close this as a duplicate of 2038481. If it's still happening I can try to help find the right team to get it resolved.
closing this as it was a duplicate of 2038481. the search provided in the description will still turn up failures in this test case that looks for failed sandboxes issues, but the handful I looked at just now were not because of the guard pods which were fixed in 2038481 *** This bug has been marked as a duplicate of bug 2038481 ***