Bug 1948066 - 4.7 to 4.8 update CI failing: pods should successfully create sandboxes by other
Summary: 4.7 to 4.8 update CI failing: pods should successfully create sandboxes by other
Keywords:
Status: CLOSED DUPLICATE of bug 1972490
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Douglas Smith
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks: 1972490
TreeView+ depends on / blocked
 
Reported: 2021-04-09 22:26 UTC by W. Trevor King
Modified: 2021-10-27 18:03 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1972490 (view as bug list)
Environment:
Last Closed: 2021-06-18 13:42:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description W. Trevor King 2021-04-09 22:26:26 UTC
4.7 -> 4.8 CI has been red for a long time now [1].  One leading contributor is:

  [sig-network] pods should successfully create sandboxes by other

with:

  [sig-network] pods should successfully create sandboxes by getting pod

also contributing.  There are also API-server alert issues, but that's probably orthogonal.

I had been expecting bug 1908378 to be the underlying issue, but it's now VERIFIED for a while, the update issues persist, and Elana has clearly scoped it to static pods [2].  Picking on a recent job [3]:

  [sig-network] pods should successfully create sandboxes by other	0s
  4 failures to create the sandbox

  ns/openshift-multus pod/network-metrics-daemon-qdd59 node/ip-10-0-176-46.us-west-2.compute.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-qdd59_openshift-multus_3cd19768-6f58-45bb-860c-7ca5f89e72bd_0(50d13a74e1d486564f00b2fc458a18e5e502846a1b3528e421e94280a2ad2238): [openshift-multus/network-metrics-daemon-qdd59:openshift-sdn]: error adding container to network "openshift-sdn": failed to find plugin "openshift-sdn" in path [/opt/multus/bin /var/lib/cni/bin /usr/libexec/cni]
ns/openshift-network-diagnostics pod/network-check-target-lc8sm node/ip-10-0-156-130.us-west-2.compute.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-lc8sm_openshift-network-diagnostics_e78b6504-e78e-4588-8d46-36fe1aa65ded_0(e77da0277e4fbd1a3d46eaf8ec4895c64aaad45a2e23400f933fb4eae28b0396): [openshift-network-diagnostics/network-check-target-lc8sm:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'Get "https://api-int.ci-op-zbk23sg5-7dd68.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-network-diagnostics/pods/network-check-target-lc8sm": dial tcp 10.0.200.16:6443: connect: connection refused
  '
  ns/openshift-network-diagnostics pod/network-check-target-lc8sm node/ip-10-0-156-130.us-west-2.compute.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-lc8sm_openshift-network-diagnostics_e78b6504-e78e-4588-8d46-36fe1aa65ded_0(d880c73f032d9cd54421dddd271a359501dc5d78183ff0fe42ca903a6960ecb0): Multus: [openshift-network-diagnostics/network-check-target-lc8sm]: error getting pod: Get "https://[api-int.ci-op-zbk23sg5-7dd68.origin-ci-int-aws.dev.rhcloud.com]:6443/api/v1/namespaces/openshift-network-diagnostics/pods/network-check-target-lc8sm?timeout=1m0s": dial tcp 10.0.200.16:6443: connect: connection refused
ns/e2e-k8s-sig-apps-daemonset-upgrade-480 pod/ds1-52nj5 node/ip-10-0-146-52.us-west-2.compute.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ds1-52nj5_e2e-k8s-sig-apps-daemonset-upgrade-480_16b8af98-5804-4d28-9d51-83a3e85b1885_0(d8ed34679aba0696456ee3f58888d28a5c8dfc143d1323f8e147d87fc7c6d9e7): Multus: [e2e-k8s-sig-apps-daemonset-upgrade-480/ds1-52nj5]: error getting pod: Get "https://[api-int.ci-op-zbk23sg5-7dd68.origin-ci-int-aws.dev.rhcloud.com]:6443/api/v1/namespaces/e2e-k8s-sig-apps-daemonset-upgrade-480/pods/ds1-52nj5?timeout=1m0s": dial tcp 10.0.147.9:6443: connect: connection refused

and:

  [sig-network] pods should successfully create sandboxes by getting pod	0s
  1 failures to create the sandbox

  ns/openshift-network-diagnostics pod/network-check-target-vnj96 node/ip-10-0-204-38.us-west-2.compute.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-vnj96_openshift-network-diagnostics_bce7e215-af5e-4a2a-9106-584a15015f10_0(80916235c75a63456346be7c2a297a7ab7ad86157d404f269cda41b3fde360fd): Multus: [openshift-network-diagnostics/network-check-target-vnj96]: error getting pod: pods "network-check-target-vnj96" is forbidden: User "system:serviceaccount:openshift-multus:multus" cannot get resource "pods" in API group "" in the namespace "openshift-network-diagnostics": RBAC: [clusterrole.rbac.authorization.k8s.io "multus" not found, clusterrole.rbac.authorization.k8s.io "system:basic-user" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-jenkinspipeline" not found, clusterrole.rbac.authorization.k8s.io "system:oauth-token-deleter" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-docker" not found, clusterrole.rbac.authorization.k8s.io "system:service-account-issuer-discovery" not found, clusterrole.rbac.authorization.k8s.io "self-access-reviewer" not found, clusterrole.rbac.authorization.k8s.io "system:scope-impersonation" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "helm-chartrepos-viewer" not found, clusterrole.rbac.authorization.k8s.io "whereabouts-cni" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:discovery" not found, clusterrole.rbac.authorization.k8s.io "basic-user" not found, clusterrole.rbac.authorization.k8s.io "cluster-status" not found, clusterrole.rbac.authorization.k8s.io "system:webhook" not found, clusterrole.rbac.authorization.k8s.io "system:discovery" not found, clusterrole.rbac.authorization.k8s.io "system:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "multus-admission-controller-webhook" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-source" not found, clusterrole.rbac.authorization.k8s.io "console-extensions-reader" not found]

Setting high severity, because green 4.7 -> 4.8 updates are important, and something that we want very solid by the time 4.8 GAs.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1908378#c30
[3]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1380591032978116608

Comment 1 Alexander Constantinescu 2021-04-12 15:19:39 UTC

*** This bug has been marked as a duplicate of bug 1927264 ***

Comment 2 W. Trevor King 2021-06-16 04:15:44 UTC
Reopening.  Bug 1927264 is now VERIFIED, with the referenced PR landing in master 11 days ago [1].  But "pods should successfully create sandboxes by other" is still wildly popular in CI, so whatever bug 1927264 fixed, there's certainly still more to go:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=pods+should+successfully+create+sandboxes+by+other' | grep '4\.[89].*failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-compact (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 13 runs, 69% failed, 44% of failures match = 31% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 14 runs, 100% failed, 57% of failures match = 57% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 13 runs, 92% failed, 75% of failures match = 69% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 4 runs, 75% failed, 67% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-csi (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.8-e2e-vsphere (all) - 6 runs, 100% failed, 50% of failures match = 50% impact
pull-ci-openshift-machine-api-operator-release-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-machine-api-operator-release-4.8-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-origin-release-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
rehearse-15939-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-15939-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-15939-periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.7-e2e-aws-upgrade-paused (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-15939-periodic-ci-openshift-release-master-stable-4.8-upgrade-from-stable-4.6-e2e-aws-upgrade-paused (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-17730-pull-ci-openshift-installer-release-4.9-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-17730-pull-ci-openshift-installer-release-4.9-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-18877-periodic-ci-openshift-release-master-okd-4.8-upgrade-from-4.7-e2e-upgrade-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-19228-periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
rehearse-19239-periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node (all) - 9 runs, 44% failed, 25% of failures match = 11% impact
rehearse-19285-periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
rehearse-19285-periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
rehearse-19285-pull-ci-openshift-installer-release-4.9-e2e-aws-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
release-openshift-ocp-installer-e2e-azure-ovn-4.9 (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-ppc64le-4.7-to-4.8 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-ocp-installer-upgrade-remote-libvirt-s390x-4.7-to-4.8 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

[1]: https://github.com/openshift/multus-cni/pull/101#event-4844582487

Comment 3 W. Trevor King 2021-06-16 04:26:08 UTC
Reasonable amount of those from recent releases seem to look like [1]:

  ns/openshift-multus pod/network-metrics-daemon-zx2pz node/ci-op-bwcbtfmb-25656-9n58p-master-1 - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-zx2pz_openshift-multus_968336b3-1fef-4098-8e2d-f37b3cbee8f7_0(6ea40a13af26babba135f17a209ba100ffcb534ff174da10eb569f8a045c36ac): Multus: [openshift-multus/network-metrics-daemon-zx2pz]: error getting pod: Unauthorized
  ns/openshift-multus pod/network-metrics-daemon-rjhfv node/ci-op-bwcbtfmb-25656-9n58p-worker-eastus3-8ssnx - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-rjhfv_openshift-multus_7342244f-1581-4bd3-b6f1-25d013cc4e34_0(fb86f0f60c86921d1dda1dc977336fbc6a93eec6c03da3e3ee59c6c4a2a991a5): Multus: [openshift-multus/network-metrics-daemon-rjhfv]: error getting pod: Unauthorized

Searching:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=network-metrics-daemon.*never+deleted.*reason/FailedCreatePodSandBox.*failed+to+create+pod+network+sandbox.*error+getting+pod:+Unauthorized' | grep 'failures match' | grep -v 'pull-ci-\|rehearse-' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 13 runs, 69% failed, 22% of failures match = 15% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 14 runs, 100% failed, 43% of failures match = 43% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 13 runs, 92% failed, 58% of failures match = 54% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 7 runs, 100% failed, 29% of failures match = 29% impact
release-openshift-origin-installer-old-rhcos-e2e-aws-4.9 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

But I guess that's not 4.7 -> 4.8, so I'll spin it off into a new bug.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1404639671656386560

Comment 4 Ben Bennett 2021-06-16 12:50:29 UTC
Perhaps the same as https://bugzilla.redhat.com/show_bug.cgi?id=1972167 ?

Comment 5 W. Trevor King 2021-06-16 20:37:00 UTC
Bug 1972167 seems to manifest as "error getting pod: Unauthorized", and there are a other few bugs in that space, including bug 1972490.  But checking on the 4.7->4.8 updates:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=pods+should+successfully+create+sandboxes+by+other' | grep '4.8-upgrade-from.*4.7.*fai
lures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 8 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 5 runs, 100% failed, 80% of failures match = 80% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
pull-ci-openshift-ovn-kubernetes-master-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Finding a job:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&search=pods+should+successfully+create+sandboxes+by+other&name=periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1405188888481239040

That job has:

  : [sig-network] pods should successfully create sandboxes by other	0s
    1 failures to create the sandbox

    ns/openshift-network-diagnostics pod/network-check-target-87lq2 node/ip-10-0-222-37.ec2.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_network-check-target-87lq2_openshift-network-diagnostics_dac0da1d-6e7a-40c9-bd1c-92d2e4076d02_0": error locating item named "manifest-sha256:fa0f2cad0e8d907a10bf91b2fe234659495a694235a9e2ef7015eb450ce9f1ba" for image with ID "c8420102ec4009c486f7a4085fb574c2cc68b6047e871c1206b29b775d6c0a34": file does not exist

Checking to see how common that symptom is:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=FailedCreatePodSandBox.*file+does+not+exist' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 14 runs, 93% failed, 8% of failures match = 7% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact

So not everything.  Let's move over to GCP:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&search=pods+should+successfully+create+sandboxes+by+other&name=periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1404922286883999744
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1405215035801735168

The first of those has five like:

  ns/openshift-dns pod/dns-default-z4hp4 node/ci-op-xf1lrxzf-3b3f8-n9rlk-worker-b-6t5sq - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dns-default-z4hp4_openshift-dns_09f88700-2071-4378-90a1-7c76ef21c3a7_0(19158e53eeb638a87c0ffdf885c4693970d820d669f9e3f3788254340d4d03a4): Multus: [openshift-dns/dns-default-z4hp4]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/80-openshift-network.conf. pollimmediate error: timed out waiting for the condition

The second has a single:

  ns/openshift-kube-apiserver pod/revision-pruner-9-ci-op-zgwmkdxz-3b3f8-qzwml-master-1 node/ci-op-zgwmkdxz-3b3f8-qzwml-master-1 - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = container create failed: time="2021-06-16T19:04:58Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: Unit crio-861e1151e82ecb87a527315b9c027c2130772a4405ed83824a2bb88ced40d277.scope not found."

So I dunno if there's a consistent pattern.  But we want 4.7->4.8 to be reliably green.  Maybe just keep this open as an umbrella tracker, and circle back once we've fixed the other bugs around this test-case to see what's left?

Comment 6 W. Trevor King 2021-06-16 20:42:08 UTC
Possibly helpful query for ranking the last bits from these error messages:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&search=reason/FailedCreatePodSandBox&name=4.8-upgrade-from-.*4.7' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*://' | sort | uniq -c | sort -n
      1 
      1  '[openshift-multus/network-metrics-daemon-bs9q8 2d5fc65c6b3329c9b87ed040562ad539d8e155c18d9938fe70998fba25f7fc07] [openshift-multus/network-metrics-daemon-bs9q8 2d5fc65c6b3329c9b87ed040562ad539d8e155c18d9938fe70998fba25f7fc07] timed out waiting for annotations
      1  EOF
      1  Unit crio-861e1151e82ecb87a527315b9c027c2130772a4405ed83824a2bb88ced40d277.scope not found."
      1  client connection lost
      1  connection refused
      1  failed to find plugin "openshift-sdn" in path [/opt/multus/bin /var/lib/cni/bin /usr/libexec/cni]
      1  file does not exist
      1  pod "installer-7-ci-op-y4f9imhp-8929c-hh85g-master-0" not found
      1  request timed out
      1 image-puller" not found
      1 scope-impersonation" not found, clusterrole.rbac.authorization.k8s.io "whereabouts-cni" not found, clusterrole.rbac.authorization.k8s.io "self-access-reviewer" not found]
      3  timed out waiting for annotations
     10  timed out waiting for the condition

And ranking over all CI without limiting to 4.7->4.8:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&search=reason/FailedCreatePodSandBox' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*://;s/\(pods\?\) "[^"]*" not found/\1 "..." not found/' | sort | uniq -c | sort -n | tail 
      8  connection refused
     13  timed out waiting for annotations
     15 
     15  'pods "..." not found
     16  timed out waiting for OVS flows
     21  timed out waiting for the condition
     51  EOF
     81  i/o timeout
    126  pods "..." not found
    154  Unauthorized

Comment 7 Douglas Smith 2021-06-18 13:42:29 UTC
We currently believe this is a dupe of 1972167, and should be verified as such, or reopened if determined otherwise.

*** This bug has been marked as a duplicate of bug 1972167 ***

Comment 8 W. Trevor King 2021-07-02 04:57:28 UTC
4.7.19 -> 4.8.0-rc.2 failed on [1]:

  : [sig-network] pods should successfully create sandboxes by getting pod	0s
    2 failures to create the sandbox

    ns/openshift-controller-manager pod/controller-manager-twwfk node/ip-10-0-161-230.ec2.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_controller-manager-twwfk_openshift-controller-manager_877f87bd-92a5-4599-b880-c15779979c7a_0(cac21b592d2559456968829fc58964a8e62994926b123126a9eab8bc8fb566b9): error adding pod openshift-controller-manager_controller-manager-twwfk to CNI network "multus-cni-network": Multus: [openshift-controller-manager/controller-manager-twwfk]: error getting pod: pods "controller-manager-twwfk" is forbidden: User "system:serviceaccount:openshift-multus:multus" cannot get resource "pods" in API group "" in the namespace "openshift-controller-manager": RBAC: [clusterrole.rbac.authorization.k8s.io "system:build-strategy-source" not found, clusterrole.rbac.authorization.k8s.io "system:scope-impersonation" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-docker" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:discovery" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-jenkinspipeline" not found, clusterrole.rbac.authorization.k8s.io "console-extensions-reader" not found, clusterrole.rbac.authorization.k8s.io "whereabouts-cni" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "system:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "multus" not found, clusterrole.rbac.authorization.k8s.io "system:oauth-token-deleter" not found, clusterrole.rbac.authorization.k8s.io "system:basic-user" not found, clusterrole.rbac.authorization.k8s.io "cluster-status" not found, clusterrole.rbac.authorization.k8s.io "multus-admission-controller-webhook" not found, clusterrole.rbac.authorization.k8s.io "system:discovery" not found, clusterrole.rbac.authorization.k8s.io "self-access-reviewer" not found, clusterrole.rbac.authorization.k8s.io "system:webhook" not found, clusterrole.rbac.authorization.k8s.io "helm-chartrepos-viewer" not found, clusterrole.rbac.authorization.k8s.io "basic-user" not found, clusterrole.rbac.authorization.k8s.io "system:service-account-issuer-discovery" not found]
    ns/openshift-network-diagnostics pod/network-check-target-27zht node/ip-10-0-161-230.ec2.internal - never deleted - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-27zht_openshift-network-diagnostics_ba96ba9c-a453-41ee-a262-162eb5284cab_0(b793f8042104005964acc38a8805a10842eb5fbc90d43efd041dc39a2fef82f3): error adding pod openshift-network-diagnostics_network-check-target-27zht to CNI network "multus-cni-network": Multus: [openshift-network-diagnostics/network-check-target-27zht]: error getting pod: pods "network-check-target-27zht" is forbidden: User "system:serviceaccount:openshift-multus:multus" cannot get resource "pods" in API group "" in the namespace "openshift-network-diagnostics": RBAC: [clusterrole.rbac.authorization.k8s.io "system:service-account-issuer-discovery" not found, clusterrole.rbac.authorization.k8s.io "system:scope-impersonation" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-source" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-docker" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:discovery" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-jenkinspipeline" not found, clusterrole.rbac.authorization.k8s.io "console-extensions-reader" not found, clusterrole.rbac.authorization.k8s.io "whereabouts-cni" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "system:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "multus" not found, clusterrole.rbac.authorization.k8s.io "system:oauth-token-deleter" not found, clusterrole.rbac.authorization.k8s.io "system:basic-user" not found, clusterrole.rbac.authorization.k8s.io "cluster-status" not found, clusterrole.rbac.authorization.k8s.io "multus-admission-controller-webhook" not found, clusterrole.rbac.authorization.k8s.io "system:discovery" not found, clusterrole.rbac.authorization.k8s.io "self-access-reviewer" not found, clusterrole.rbac.authorization.k8s.io "system:webhook" not found, clusterrole.rbac.authorization.k8s.io "helm-chartrepos-viewer" not found, clusterrole.rbac.authorization.k8s.io "basic-user" not found]

Which is very similar to my c0 here, despite bug 1972167 being VERIFIED in 4.8 for a while.  I'm moving this over to claim it as a dup of bug 1972490.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1410786292060393472

*** This bug has been marked as a duplicate of bug 1972490 ***


Note You need to log in before you can comment on or make changes to this bug.