Bug 2090547 - pods should successfully create sandboxes by adding pod to network (failed (add)); multus-cni-network not found in CNI cache
Summary: pods should successfully create sandboxes by adding pod to network (failed (a...
Keywords:
Status: CLOSED DUPLICATE of bug 2073452
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Douglas Smith
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-26 02:24 UTC by Dennis Periquet
Modified: 2022-11-09 14:23 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-09 14:23:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Dennis Periquet 2022-05-26 02:24:10 UTC
Description of problem:

: [sig-network] pods should successfully create sandboxes by adding pod to network
{  5 failures to create the sandbox

ns/openshift-kube-controller-manager pod/revision-pruner-7-ip-10-0-189-82.us-west-2.compute.internal node/ip-10-0-189-82.us-west-2.compute.internal - 351.61 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-7-ip-10-0-189-82.us-west-2.compute.internal_openshift-kube-controller-manager_727a0869-3332-4b7a-93cb-302d2211226d_0(3750ff286013e8dce3ecd7020f8ad01b4d19490d062e08d2092dd3bf36488a88): error adding pod openshift-kube-controller-manager_revision-pruner-7-ip-10-0-189-82.us-west-2.compute.internal to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): 
ns/openshift-kube-apiserver pod/revision-pruner-6-ip-10-0-189-82.us-west-2.compute.internal node/ip-10-0-189-82.us-west-2.compute.internal - 355.08 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-6-ip-10-0-189-82.us-west-2.compute.internal_openshift-kube-apiserver_7c4355bd-69bb-4fe9-8e09-859259cd7aa1_0(4f54968650290a2e6de012aac8679d6ce9c9cc1d9eedd02559660c06ecc5e7f7): error adding pod openshift-kube-apiserver_revision-pruner-6-ip-10-0-189-82.us-west-2.compute.internal to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): netplugin failed but error parsing its diagnostic message "\x00\x00\x00\x00\x00\x00\x00\x00\x00...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00": invalid character '\x00' looking for beginning of value
ns/openshift-kube-scheduler pod/revision-pruner-6-ip-10-0-189-82.us-west-2.compute.internal node/ip-10-0-189-82.us-west-2.compute.internal - 354.06 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-6-ip-10-0-189-82.us-west-2.compute.internal_openshift-kube-scheduler_5e8af62a-c6c1-42ba-af79-5f8a8c74d4ff_0(cd0e5bdd588bf0b98da1501eb3ec1071b2c44d0547e739800922d4186a337b56): error adding pod openshift-kube-scheduler_revision-pruner-6-ip-10-0-189-82.us-west-2.compute.internal to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): 
ns/openshift-network-diagnostics pod/network-check-target-wjz9m node/ip-10-0-189-82.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-wjz9m_openshift-network-diagnostics_6b59d804-f204-4c5c-a5b0-b8bfa532313a_0(5263496a66d2be3c02b5c92580be9475a4601687b993fd185cb8231203535878): error adding pod openshift-network-diagnostics_network-check-target-wjz9m to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): 
ns/openshift-multus pod/network-metrics-daemon-72xqw node/ip-10-0-189-82.us-west-2.compute.internal - never deleted - network rollout - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-72xqw_openshift-multus_e0d8b470-7683-4d0b-91bc-d44336bab958_0(c42fa540057a7516de1e9747be16cdbbb58d4a3145419fa72bbd8013616da8bc): error adding pod openshift-multus_network-metrics-daemon-72xqw to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): }


Version-Release number of selected component (if applicable): 4.11


How reproducible:

This does not happen all the time; we got a payload accepted earlier today.

Steps to Reproduce:
1. Run the relevant blocking job
2.
3.

Actual results:

We get this failure:

  [sig-network] pods should successfully create sandboxes by adding pod to network

Expected results:

No failure

Additional info:

This is one job of several that had a similar symptom: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1529373268140625920

Here's the aggregated job that caused the payload to be rejected (this aggregated job has several jobs that had this failure): https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-aws-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator/1529373276529233920

This is the payload where that was rejected: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-05-25-080235

The failures above also have the symptom tracked in https://bugzilla.redhat.com/show_bug.cgi?id=2090389


This sippy query shows significant drop in pass rates of this test: https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/details?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25221%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522never-stable%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522techpreview%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522pods%2520should%2520successfully%2520create%2520sandboxes%2520by%2520adding%2520pod%2520to%2520network%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=current_working_percentage

I looked at events.json to find the relevant event for the sandbox failure at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1529373268140625920/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json.

            "kind": "Event",
            "lastTimestamp": "2022-05-25T09:18:16Z",
            "message": "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-7-ip-10-0-189-82.us-west-2.compute.internal_openshift-kube-controller-manager_727a0869-3332-4b7a-93cb-302d2211226d_0(3750ff286013e8dce3ecd7020f8ad01b4d19490d062e08d2092dd3bf36488a88): error adding pod openshift-kube-controller-manager_revision-pruner-7-ip-10-0-189-82.us-west-2.compute.internal to CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (add): ",
            "metadata": {
                "creationTimestamp": "2022-05-25T09:18:33Z",
                "name": "revision-pruner-7-ip-10-0-189-82.us-west-2.compute.internal.16f24e797a686b07",
                "namespace": "openshift-kube-controller-manager",
                "resourceVersion": "54053",
                "uid": "261a978f-9553-446a-8eca-dbe127de40b4"
            },

From that, I know the anomaly occurred at 09:18:16Z (or 05:18:16EST).

I took the clusterversion.json to see when the upgrade happened and determined that the pod sandbox failure occurs during upgrade.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1529373268140625920/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'|sort
2022-05-25T08:22:02Z 2022-05-25T08:43:23Z Completed 4.11.0-0.nightly-2022-05-20-213928
2022-05-25T08:48:13Z 2022-05-25T09:26:55Z Completed 4.11.0-0.nightly-2022-05-25-080235  <-- pod sandbox event occurs during upgrade

The anomaly occurred on node ip-10-0-189-82.us-west-2.compute.internal as mentioned in the failures on the prow job so
I gathered logs from that node for multus, sdn pods, and crio logs (from gather-extra/nodes subdir)

Here's a series of events with times (in UTC):

Upgrade starts about 8:48...
node ip-10-0-189-82.us-west-2.compute.internal reboots
"node is not ready event" occurs at 05:15:45EST to 05:18:03EST (09:15:45Z to 09:18:13Z) (see first spyglass chart in prow job or https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1529373268140625920/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/junit/e2e-intervals_everything_20220525-084359.json)

multus pod (multus-bv7sw) starts 09:18:05, log at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1529373268140625920/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-multus_multus-bv7sw_kube-multus.log
multus pod looks for master plugin configuration 09:18:06
multus pod finds master plugin configuration and enters watch loop 09:18:14

This is the multus-bv7sw pod log since it's rather short:
Successfully copied files in /usr/src/multus-cni/rhel8/bin/ to /host/opt/cni/bin/
2022-05-25T09:18:05+00:00 WARN: {unknown parameter "-"}
2022-05-25T09:18:05+00:00 Entrypoint skipped copying Multus binary.
2022-05-25T09:18:06+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2022-05-25T09:18:06+00:00 Attempting to find master plugin configuration, attempt 0
2022-05-25T09:18:11+00:00 Attempting to find master plugin configuration, attempt 5
2022-05-25T09:18:14+00:00 Using MASTER_PLUGIN: 80-openshift-network.conf
2022-05-25T09:18:14+00:00 Nested capabilities string: 
2022-05-25T09:18:14+00:00 Using /host/var/run/multus/cni/net.d/80-openshift-network.conf as a source to generate the Multus configuration
2022-05-25T09:18:14+00:00 Config file created @ /host/etc/cni/net.d/00-multus.conf
{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/80-openshift-network.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] }
2022-05-25T09:18:14+00:00 Entering watch loop...


sdn-wk2qc pod, on container "sdn" starts at 09:18:11 https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1529373268140625920/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-sdn_sdn-wk2qc_sdn.log

These logs happen in the sdn-wk2qc/sdn pod 3 seconds before the pod sandbox problem (not sure if they are bad, though they are Error level):

E0525 09:18:13.739146    2171 proxier.go:1640] "can't open port, skipping it" err="listen tcp4 :30274: bind: address already in use" port={Description:nodePort for openshift-ingress/router-default:https IP: IPFamily:4 Port:30274 Protocol:TCP}
I0525 09:18:13.739667    2171 proxier.go:1644] "Opened local port" port={Description:nodePort for e2e-k8s-service-load-balancer-with-pdb-reused-4118/service-test IP: IPFamily:4 Port:31992 Protocol:TCP}
E0525 09:18:13.739710    2171 proxier.go:1640] "can't open port, skipping it" err="listen tcp4 :31992: bind: address already in use" port={Description:nodePort for e2e-k8s-service-load-balancer-with-pdb-reused-4118/service-test IP: IPFamily:4 Port:31992 Protocol:TCP}
I0525 09:18:13.739857    2171 proxier.go:1644] "Opened local port" port={Description:nodePort for e2e-k8s-service-load-balancer-with-pdb-new-8170/service-test IP: IPFamily:4 Port:31031 Protocol:TCP}
E0525 09:18:13.739887    2171 proxier.go:1640] "can't open port, skipping it" err="listen tcp4 :31031: bind: address already in use" port={Description:nodePort for e2e-k8s-service-load-balancer-with-pdb-new-8170/service-test IP: IPFamily:4 Port:31031 Protocol:TCP}
I0525 09:18:13.740071    2171 proxier.go:1644] "Opened local port" port={Description:nodePort for openshift-ingress/router-default:http IP: IPFamily:4 Port:32192 Protocol:TCP}
E0525 09:18:13.740099    2171 proxier.go:1640] "can't open port, skipping it" err="listen tcp4 :32192: bind: address already in use" port={Description:nodePort for openshift-ingress/router-default:http IP: IPFamily:4 Port:32192 Protocol:TCP}

We see several CNI_ADD, CNI_DEL, OVS Commit transation messages for several minutes in sdn log starting at 09:18:14
During this time, at 09:18:16-17, we see several FailedCreatePodSandBox events in event-filter.html; see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1529373268140625920/artifacts/e2e-aws-upgrade/gather-must-gather/artifacts/event-filter.html

Check crio logs for node ip-10-0-189-82.us-west-2.compute.internal https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1529373268140625920/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-189-82.us-west-2.compute.internal/journal
search using this regex: "25 09:18:1.*revision-pruner-7-ip-10-0-189-82" to see logs near the time the anomaly occurs`

You will see these logs:

2022-05-25T09:18:39Z {} time="2022-05-25 09:18:15.844480391Z" level=error msg="error loading cached network config: network \"multus-cni-network\" not found in CNI cache"
2022-05-25T09:18:39Z {} time="2022-05-25 09:18:15.844492021Z" level=warning msg="falling back to loading from existing plugins on disk"
2022-05-25T09:18:39Z {} time="2022-05-25 09:18:15.844501850Z" level=info msg="Deleting pod openshift-kube-controller-manager_revision-pruner-7-ip-10-0-189-82.us-west-2.compute.internal from CNI network \"multus-cni-network\" (type=multus)"
2022-05-25T09:18:39Z {} time="2022-05-25 09:18:16.161726800Z" level=error msg="!sandboxdebug network.config: &{Name:multus-cni-network CNIVersion:0.3.1 DisableCheck:false Plugins:[0xc000146660] Bytes:[123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 112 108 117 103 105 110 115 34 58 91 123 34 98 105 110 68 105 114 34 58 34 47 111 112 116 47 109 117 108 116 117 115 47 98 105 110 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 101 108 101 103 97 116 101 115 34 58 91 123 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 110 97 109 101 34 58 34 111 112 101 110 115 104 105 102 116 45 115 100 110 34 44 34 116 121 112 101 34 58 34 111 112 101 110 115 104 105 102 116 45 115 100 110 34 125 93 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 107 117 98 101 99 111 110 102 105 103 34 58 34 47 101 116 99 47 107 117 98 101 114 110 101 116 101 115 47 99 110 105 47 110 101 116 46 100 47 109 117 108 116 117 115 46 100 47 109 117 108 116 117 115 46 107 117 98 101 99 111 110 102 105 103 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 114 101 97 100 105 110 101 115 115 105 110 100 105 99 97 116 111 114 102 105 108 101 34 58 34 47 118 97 114 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 47 56 48 45 111 112 101 110 115 104 105 102 116 45 110 101 116 119 111 114 107 46 99 111 110 102 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 34 125 93 125]}"
2022-05-25T09:18:39Z {} time="2022-05-25 09:18:16.161817427Z" level=error msg="!sandboxdebug network type: multus"
2022-05-25T09:18:39Z {} time="2022-05-25 09:18:16.161825838Z" level=error msg="!sandboxdebug fullpodname: openshift-network-diagnostics_network-check-target-wjz9m"
2022-05-25T09:18:39Z {} time="2022-05-25 09:18:16.161834226Z" level=error msg="!sandboxdebug error: plugin type=\"multus\" name=\"multus-cni-network\" failed (add): "
2022-05-25T09:18:39Z {} time="2022-05-25 09:18:16.161863578Z" level=info msg="NetworkStart: stopping network for sandbox 5263496a66d2be3c02b5c92580be9475a4601687b993fd185cb8231203535878" id=043b2b0b-7881-4889-8264-482e8846bba3 name=/runtime.v1.RuntimeService/RunPodSandbox
2022-05-25T09:18:39Z {} time="2022-05-25 09:18:16.162301325Z" level=info msg="Got pod network &{Name:network-check-target-wjz9m Namespace:openshift-network-diagnostics ID:5263496a66d2be3c02b5c92580be9475a4601687b993fd185cb8231203535878 UID:6b59d804-f204-4c5c-a5b0-b8bfa532313a NetNS:/var/run/netns/10b6b6ec-36d2-44a5-989a-b17aaa96320c Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"

The "Bytes" mentioned above translate to (as a string):
{"cniVersion":"0.3.1","name":"multus-cni-network","plugins":[{"binDir":"/opt/multus/bin","cniVersion":"0.3.1","delegates":[{"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}],"globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","kubeconfig":"/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig","logLevel":"verbose","name":"multus-cni-network","namespaceIsolation":true,"readinessindicatorfile":"/var/run/multus/cni/net.d/80-openshift-network.conf","type":"multus"}]}

This seems to be the problem:
  level=error msg="error loading cached network config: network \"multus-cni-network\" not found in CNI cache"

Comment 2 Dennis Periquet 2022-05-26 13:08:06 UTC
I move the priority up because this is intermittently failing on nightly payloads.

Comment 4 Douglas Smith 2022-11-09 14:23:21 UTC

*** This bug has been marked as a duplicate of bug 2073452 ***


Note You need to log in before you can comment on or make changes to this bug.