Bug 2101859 - Failed to create pod in heterogeneous cluster with OVN network
Summary: Failed to create pod in heterogeneous cluster with OVN network
Keywords:
Status: CLOSED DUPLICATE of bug 2101498
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: All
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-28 14:45 UTC by liqcui
Modified: 2022-06-29 13:41 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-29 13:41:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description liqcui 2022-06-28 14:45:35 UTC
Description of problem:
Create route perf scale job in OCP heterogeneous cluster with 120 OVN network worker nodes

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Scale cluster to 120 worker nodes
2. Execute route-perf job

Actual results:
A log of pod failed to created

Expected results:
The job can executed successfully.

Additional info:
some OVN pod logs/information
Warning  NetworkNotReady         84m (x18 over 85m)  kubelet            network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Warning  FailedCreatePodSandBox  79m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-nq7cv_openshift-network-diagnostics_fb379484-8a04-4773-b58b-d67a442e5591_0(5842cecac99e2ae780dff8e6a6f2646c6d40b4de5c8ecae95adb4302e31c3b67): error adding pod openshift-network-diagnostics_network-check-target-nq7cv to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-network-diagnostics/network-check-target-nq7cv/fb379484-8a04-4773-b58b-d67a442e5591:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-network-diagnostics/network-check-target-nq7cv 5842cecac99e2ae780dff8e6a6f2646c6d40b4de5c8ecae95adb4302e31c3b67] [openshift-network-diagnostics/network-check-target-nq7cv 5842cecac99e2ae780dff8e6a6f2646c6d40b4de5c8ecae95adb4302e31c3b67] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:1c:03 [10.131.28.3/23]
'

oc logs -f ovnkube-node-lk8xs  -n openshift-ovn-kubernetes -c ovn-controller
2022-06-28T12:39:21.317Z|00813|ovsdb_cs|INFO|ssl:10.0.145.48:9642: clustered database server is disconnected from cluster; trying another server
2022-06-28T12:39:21.318Z|00814|main|INFO|OVNSB commit failed, force recompute next time.
2022-06-28T12:39:21.318Z|00815|reconnect|INFO|ssl:10.0.145.48:9642: connection attempt timed out
2022-06-28T12:39:21.320Z|00816|reconnect|INFO|ssl:10.0.207.143:9642: connecting...
2022-06-28T12:39:21.756Z|00817|reconnect|INFO|ssl:10.0.207.143:9642: connected
2022-06-28T12:39:22.288Z|00818|inc_proc_eng|INFO|node: logical_flow_output, recompute (forced) took 517ms

116m        Warning   FailedCreatePodSandBox   pod/network-metrics-daemon-tzz7j          Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-tzz7j_openshift-multus_dfff197d-3cb0-4fd2-9bc3-72901e18a2a9_0(317f24ba0684e7fa6b1f07d92d7b3f3bce215adcee2dcfcb6f35cd7960ce6e4e): error adding pod openshift-multus_network-metrics-daemon-tzz7j to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-multus/network-metrics-daemon-tzz7j/dfff197d-3cb0-4fd2-9bc3-72901e18a2a9:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-multus/network-metrics-daemon-tzz7j 317f24ba0684e7fa6b1f07d92d7b3f3bce215adcee2dcfcb6f35cd7960ce6e4e] [openshift-multus/network-metrics-daemon-tzz7j 317f24ba0684e7fa6b1f07d92d7b3f3bce215adcee2dcfcb6f35cd7960ce6e4e] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:1c:04 [10.129.28.4/23]
'
oc get pods -n openshift-multus -owide |grep ip-10-0-169-235.us-east-2.compute.internal
multus-additional-cni-plugins-t4pgq   1/1     Running             0          127m   10.0.169.235   ip-10-0-169-235.us-east-2.compute.internal   <none>           <none>
multus-m5scr                          1/1     Running             0          127m   10.0.169.235   ip-10-0-169-235.us-east-2.compute.internal   <none>           <none>
network-metrics-daemon-dlh5h          0/2     ContainerCreating   0          127m   <none>         ip-10-0-169-235.us-east-2.compute.internal   <none>           <none>

oc logs -f multus-m5scr -n openshift-multus
2022-06-28T11:10:09+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel8/bin/ to /host/opt/cni/bin/upgrade_4616f50f-0485-43ce-a4e8-1ae8b0bfb069
2022-06-28T11:10:09+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_4616f50f-0485-43ce-a4e8-1ae8b0bfb069 to /host/opt/cni/bin/
2022-06-28T11:10:09+00:00 WARN: {unknown parameter "-"}
2022-06-28T11:10:09+00:00 Entrypoint skipped copying Multus binary.
2022-06-28T11:10:09+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2022-06-28T11:10:09+00:00 Attempting to find master plugin configuration, attempt 0
2022-06-28T11:10:14+00:00 Attempting to find master plugin configuration, attempt 5
2022-06-28T11:10:18+00:00 Using MASTER_PLUGIN: 10-ovn-kubernetes.conf
2022-06-28T11:10:18+00:00 Nested capabilities string:
2022-06-28T11:10:18+00:00 Using /host/var/run/multus/cni/net.d/10-ovn-kubernetes.conf as a source to generate the Multus configuration
2022-06-28T11:10:18+00:00 Config file created @ /host/etc/cni/net.d/00-multus.conf
{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/10-ovn-kubernetes.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ {"cniVersion":"0.4.0","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxsize":100,"logfile-maxbackups":5,"logfile-maxage":5} ] }
2022-06-28T11:10:18+00:00 Entering watch loop...

oc describe pod network-metrics-daemon-dlh5h  -n openshift-multus
Name:                 network-metrics-daemon-dlh5h
Namespace:            openshift-multus
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-0-169-235.us-east-2.compute.internal/10.0.169.235
Start Time:           Tue, 28 Jun 2022 11:09:58 +0000
Labels:               app=network-metrics-daemon
component=network
controller-revision-hash=55897ff588
openshift.io/component=network
pod-template-generation=1
type=infra
Annotations:          k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.129.18.4/23"],"mac_address":"0a:58:0a:81:12:04","gateway_ips":["10.129.18.1"],"ip_address":"10.129.18.4/23...
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        DaemonSet/network-metrics-daemon
Containers:
network-metrics-daemon:
Container ID:
Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c66bc33d936ebe73656b85f4dd498266880ebbd10ba4210586109b36e36a6258
Image ID:
Port:          <none>
Host Port:     <none>
Command:
/usr/bin/network-metrics
Args:
--node-name
$(NODE_NAME)
State:          Waiting
Reason:       ContainerCreating
Ready:          False
Restart Count:  0
Requests:
cpu:     10m
memory:  100Mi
Environment:
NODE_NAME:   (v1:spec.nodeName)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-496mw (ro)
kube-rbac-proxy:
Container ID:
Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b618601c08f13a78710a25221deeb31f4a9281acb0947e3c269984cff706d932
Image ID:
Port:          8443/TCP
Host Port:     0/TCP
Args:
--logtostderr
--secure-listen-address=:8443
--tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
--upstream=http://127.0.0.1:9091/
--tls-private-key-file=/etc/metrics/tls.key
--tls-cert-file=/etc/metrics/tls.crt
State:          Waiting
Reason:       ContainerCreating
Ready:          False
Restart Count:  0
Requests:
cpu:        10m
memory:     20Mi
Environment:  <none>
Mounts:
/etc/metrics from metrics-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-496mw (ro)
Conditions:
Type              Status
Initialized       True
Ready             False
ContainersReady   False
PodScheduled      True
Volumes:
metrics-certs:
Type:        Secret (a volume populated by a Secret)
SecretName:  metrics-daemon-secret
Optional:    false
kube-api-access-496mw:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
ConfigMapName:           openshift-service-ca.crt
ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
Events:
Type     Reason                  Age                   From               Message
--------------------------------------------------------------------------------
Normal   Scheduled               130m                  default-scheduler  Successfully assigned openshift-multus/network-metrics-daemon-dlh5h to ip-10-0-169-235.us-east-2.compute.internal by ip-10-0-145-48
Warning  ErrorAddingLogicalPort  130m (x3 over 130m)   controlplane       addLogicalPort failed for openshift-multus/network-metrics-daemon-dlh5h: unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-169-235.us-east-2.compute.internal"
Warning  FailedMount             130m (x6 over 130m)   kubelet            MountVolume.SetUp failed for volume "metrics-certs" : object "openshift-multus"/"metrics-daemon-secret" not registered
Warning  NetworkNotReady         130m (x11 over 130m)  kubelet            network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Warning  FailedCreatePodSandBox  127m                  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-dlh5h_openshift-multus_4eec4f3e-0216-493c-8e24-6b71ef3133e0_0(869809c73e7f3303f68e9b9b11a5c1432b59bba7c72d525099109ed6ad43f3d4): error adding pod openshift-multus_network-metrics-daemon-dlh5h to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-multus/network-metrics-daemon-dlh5h/4eec4f3e-0216-493c-8e24-6b71ef3133e0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-multus/network-metrics-daemon-dlh5h 869809c73e7f3303f68e9b9b11a5c1432b59bba7c72d525099109ed6ad43f3d4] [openshift-multus/network-metrics-daemon-dlh5h 869809c73e7f3303f68e9b9b11a5c1432b59bba7c72d525099109ed6ad43f3d4] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:12:04 [10.129.18.4/23]
'
Warning  FailedCreatePodSandBox  125m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-dlh5h_openshift-multus_4eec4f3e-0216-493c-8e24-6b71ef3133e0_0(7194779e01eff0316e777760d9cea6c8dc59f0178e7a98adddfa26bbdf7398fa): error adding pod openshift-multus_network-metrics-daemon-dlh5h to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-multus/network-metrics-daemon-dlh5h/4eec4f3e-0216-493c-8e24-6b71ef3133e0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-multus/network-metrics-daemon-dlh5h 7194779e01eff0316e777760d9cea6c8dc59f0178e7a98adddfa26bbdf7398fa] [openshift-multus/network-metrics-daemon-dlh5h 7194779e01eff0316e777760d9cea6c8dc59f0178e7a98adddfa26bbdf7398fa] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:12:04 [10.129.18.4/23]
'
oc logs -f ovnkube-node-vgkzv -n openshift-ovn-kubernetes -c ovn-controller
2022-06-28T11:33:10.039Z|00504|poll_loop|INFO|Dropped 23 log messages in last 78 seconds (most recently, 77 seconds ago) due to excessive rate
2022-06-28T11:33:10.039Z|00505|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (99% CPU usage)
2022-06-28T11:33:10.039Z|00506|poll_loop|INFO|wakeup due to [POLLIN][POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:10.090Z|00507|poll_loop|INFO|wakeup due to [POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:10.107Z|00508|poll_loop|INFO|wakeup due to [POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:10.112Z|00509|poll_loop|INFO|wakeup due to [POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:10.115Z|00510|poll_loop|INFO|wakeup due to [POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:10.119Z|00511|poll_loop|INFO|wakeup due to [POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:10.122Z|00512|poll_loop|INFO|wakeup due to [POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:10.125Z|00513|poll_loop|INFO|wakeup due to [POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:10.129Z|00514|poll_loop|INFO|wakeup due to [POLLOUT] on fd 21 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:153 (99% CPU usage)
2022-06-28T11:33:39.546Z|00515|lflow_cache|INFO|Detected cache inactivity (last active 30006 ms ago): trimming cache
2022-06-28T11:34:53.377Z|00516|ovsdb_cs|INFO|ssl:10.0.172.17:9642: clustered database server is disconnected from cluster; trying another server
2022-06-28T11:34:53.379Z|00517|reconnect|INFO|ssl:10.0.172.17:9642: connection attempt timed out
2022-06-28T11:34:53.380Z|00518|main|INFO|OVNSB commit failed, force recompute next time.

Last State:  Terminated
  Reason:    Error
  Message:    Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 DisablePacketMTUCheck:false RouterSubnet:} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full MgmtPortNetdev: DisableOVNIfaceIdVer:false}}

I0628 10:31:02.316819       1 client.go:325]  "msg"="trying to connect" "database"="OVN_Northbound" "endpoint"="ssl:10.0.145.48:9641"
I0628 10:31:02.328649       1 client.go:781]  "msg"="transacting operations"  "database"="_Server" "operations"="[{Op:select Table:Database Row:map[] Rows:[] Columns:[name model leader sid] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0628 10:31:02.329575       1 client.go:260]  "msg"="successfully connected" "database"="OVN_Northbound" "endpoint"="ssl:10.0.145.48:9641" "sid"="aa5bc698-feb5-4bb3-9470-7be6df5b7f15"
I0628 10:31:02.332711       1 client.go:325]  "msg"="trying to connect" "database"="OVN_Southbound" "endpoint"="ssl:10.0.145.48:9642"
I0628 10:31:02.333475       1 client.go:325]  "msg"="trying to connect" "database"="OVN_Southbound" "endpoint"="ssl:10.0.172.17:9642"
I0628 10:31:02.333577       1 client.go:325]  "msg"="trying to connect" "database"="OVN_Southbound" "endpoint"="ssl:10.0.207.143:9642"
F0628 10:31:02.334001       1 ovnkube.go:133] error when trying to initialize libovsdb SB client: unable to connect to any endpoints: failed to connect to ssl:10.0.145.48:9642: failed to open connection: dial tcp 10.0.145.48:9642: connect: connection refused. failed to connect to ssl:10.0.172.17:9642: failed to open connection: dial tcp 10.0.172.17:9642: connect: connection refused. failed to connect to ssl:10.0.207.143:9642: failed to open connection: dial tcp 10.0.207.143:9642: connect: connection refused


Note You need to log in before you can comment on or make changes to this bug.