2003558 – network-metrics-daemon not available after install: timed out waiting for OVS port binding (ovn-installed)

Bug 2003558 - network-metrics-daemon not available after install: timed out waiting for OVS port binding (ovn-installed)

Summary: network-metrics-daemon not available after install: timed out waiting for OVS...

Keywords:
Status:	CLOSED DUPLICATE of bug 1997205
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Martin Kennelly
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-13 08:41 UTC by Qiujie Li
Modified:	2023-09-15 01:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-22 10:06:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Qiujie Li 2021-09-13 08:41:09 UTC

Description of problem:
On 4.9 OVN AWS cluster cluster. DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes). The pod is in 'ContainerCreating' state because 'failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed)'.


pod's event

-----------
Events:
  Type     Reason                  Age                     From     Message
  ----     ------                  ----                    ----     -------
  Warning  FailedCreatePodSandBox  4m1s (x7283 over 6d6h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-vv98k_openshift-multus_b2d3c2d2-cd90-407a-93f0-ac5c9029871c_0(234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a): error adding pod openshift-multus_network-metrics-daemon-vv98k to CNI network "multus-cni-network": [openshift-multus/network-metrics-daemon-vv98k:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-multus/network-metrics-daemon-vv98k 234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a] [openshift-multus/network-metrics-daemon-vv98k 234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:00:04 [10.131.0.4/23]
-----------

Reliability test https://github.com/openshift/svt/tree/master/reliability 
This test basically simulate project creation, check, scale up/down, build, modification, delete and application access, continuously run for 1 week.


Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-09-06-055314

How reproducible:

Steps to Reproduce:
1. Create an OVN AWS cluster with 3 master and 3 worker nodes, m5.xlarge type.

Actual results:
network operator in PROGRESSING state DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)

Expected results:
network operator not in PROGRESSING state

Additional info:

---------------------------

# oc get clusterversions
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-06-055314   True        False         6d6h    Cluster version is 4.9.0-0.nightly-2021-09-06-055314

---------------------------

# oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-130-178.us-east-2.compute.internal   Ready    master   6d6h   v1.22.0-rc.0+75ee307
ip-10-0-135-41.us-east-2.compute.internal    Ready    worker   6d6h   v1.22.0-rc.0+75ee307
ip-10-0-167-28.us-east-2.compute.internal    Ready    master   6d6h   v1.22.0-rc.0+75ee307
ip-10-0-176-130.us-east-2.compute.internal   Ready    worker   6d6h   v1.22.0-rc.0+75ee307
ip-10-0-210-23.us-east-2.compute.internal    Ready    worker   6d6h   v1.22.0-rc.0+75ee307
ip-10-0-223-209.us-east-2.compute.internal   Ready    master   6d6h   v1.22.0-rc.0+75ee307

---------------------------

# oc get pods -A | egrep -v "Completed|Running"
NAMESPACE                                          NAME                                                                  READY   STATUS              RESTARTS       AGE
openshift-multus                                   network-metrics-daemon-vv98k                                          0/2     ContainerCreating   0              6d6h

---------------------------

# oc get pod network-metrics-daemon-vv98k -n openshift-multus -o wide
NAME                           READY   STATUS              RESTARTS   AGE    IP       NODE                                        NOMINATED NODE   READINESS GATES
network-metrics-daemon-vv98k   0/2     ContainerCreating   0          6d6h   <none>   ip-10-0-210-23.us-east-2.compute.internal   <none>           <none>


---------------------------

# oc describe pod network-metrics-daemon-vv98k -n openshift-multus
Name:                 network-metrics-daemon-vv98k
Namespace:            openshift-multus
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-0-210-23.us-east-2.compute.internal/10.0.210.23
Start Time:           Tue, 07 Sep 2021 01:48:27 +0000
Labels:               app=network-metrics-daemon
                      component=network
                      controller-revision-hash=58b6b476f5
                      openshift.io/component=network
                      pod-template-generation=1
                      type=infra
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.131.0.4/23"],"mac_address":"0a:58:0a:83:00:04","gateway_ips":["10.131.0.1"],"ip_address":"10.131.0.4/23","...
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        DaemonSet/network-metrics-daemon
Containers:
  network-metrics-daemon:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:77cfdefb0c09c686a4d749db754aabc330c28e39975ed4263b4cb4dd82b7d56a
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/network-metrics
    Args:
      --node-name
      $(NODE_NAME)
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     10m
      memory:  100Mi
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7l5j2 (ro)
  kube-rbac-proxy:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940
    Image ID:      
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --logtostderr
      --secure-listen-address=:8443
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
      --upstream=http://127.0.0.1:9091/
      --tls-private-key-file=/etc/metrics/tls.key
      --tls-cert-file=/etc/metrics/tls.crt
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        10m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /etc/metrics from metrics-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7l5j2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  metrics-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metrics-daemon-secret
    Optional:    false
  kube-api-access-7l5j2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
Events:
  Type     Reason                  Age                     From     Message
  ----     ------                  ----                    ----     -------
  Warning  FailedCreatePodSandBox  4m1s (x7283 over 6d6h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-vv98k_openshift-multus_b2d3c2d2-cd90-407a-93f0-ac5c9029871c_0(234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a): error adding pod openshift-multus_network-metrics-daemon-vv98k to CNI network "multus-cni-network": [openshift-multus/network-metrics-daemon-vv98k:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-multus/network-metrics-daemon-vv98k 234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a] [openshift-multus/network-metrics-daemon-vv98k 234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:00:04 [10.131.0.4/23]
'

---------------------------


# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-0.nightly-2021-09-06-055314   True        False         False      2d12h   
baremetal                                  4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
cloud-controller-manager                   4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
cloud-credential                           4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
cluster-autoscaler                         4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
config-operator                            4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
console                                    4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
csi-snapshot-controller                    4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
dns                                        4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
etcd                                       4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
image-registry                             4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
ingress                                    4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
insights                                   4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
kube-apiserver                             4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
kube-controller-manager                    4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
kube-scheduler                             4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
kube-storage-version-migrator              4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
machine-api                                4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
machine-approver                           4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
machine-config                             4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
marketplace                                4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
monitoring                                 4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
network                                    4.9.0-0.nightly-2021-09-06-055314   True        True          False      6d6h    DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
node-tuning                                4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
openshift-apiserver                        4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
openshift-controller-manager               4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
openshift-samples                          4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
operator-lifecycle-manager                 4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
operator-lifecycle-manager-catalog         4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
operator-lifecycle-manager-packageserver   4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
service-ca                                 4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h    
storage                                    4.9.0-0.nightly-2021-09-06-055314   True        False         False      6d6h

Comment 1 Alexander Constantinescu 2021-09-13 15:14:36 UTC

Hi

Could you please provide us with a must-gather or a cluster where this issue is reproduced? 

Thanks in advance,
Alex

Comment 3 Qiujie Li 2021-09-14 02:51:18 UTC

@mkennell @aconstan let me know if you can't see the private Comment2 for the kubeconfig and must-gather. Bugzilla promoted "The assignee of this bug cannot see private comments!" when I marked it as private. I wonder if it is a fake alert as if you're members of redhat group, you should be able to see it.

Comment 4 Mike Fiedler 2021-09-16 14:39:28 UTC

Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1996201

Comment 5 Martin Kennelly 2021-09-16 16:56:57 UTC

@qili I cannot access the kubeconfig. Permission denied. Can you correct this and give me access? Thank you.

Like what Mike said, it maybe related to that issue.
Can you check if there is a large amount of veth's on the node where this failed pod is scheduled? You may use oc debug node/${NODE_NAME} to get a terminal on the worker host.

On that node run:
# See number of OVS controlled interfaces 
ovs-vsctl --columns=name --data=bare --format=table list interface | wc -l

# See number of interfaces actually on the node 
ifconfig | grep MULTICAST | wc -l 

This will give you an idea of the number of interfaces seen on the host and the number managed by OVS. If there is a large discrepancy then it may indicate the veth "leak" is occurring and what Mike linked above is occurring.

Also, this veth leak will clean itself up after a period of time, so you may not see the veth's anymore. You may have to rerun your test suite, see if there is a rising level of veths present on the host.

Comment 6 Qiujie Li 2021-09-17 03:59:41 UTC

@mkennell Sorry the test cluster has already been destroyed. This does not reproduce every time in my previous tests, I will try the step you suggest when I can reproduce it again. Thanks for the triage.

Comment 8 Martin Kennelly 2021-09-17 10:31:30 UTC

Thank you for the update.
As stated by Qiujie Li, network-metrics-daemon fails to roll out during install on one node.
This is not a blocker+ for OVN-K because it does not degrade provisioning of work loads following the failure seen. Test of workloads were still carried out successfully.
I am trying to understand now why network-metrics-daemon failed to provision and is it related to OVN-K.

Comment 11 Martin Kennelly 2021-09-17 18:14:50 UTC

The must-gather logs were retrieved 7 days post install and when this error occurred. ovnkube-master logs unfortunately do not cover the period of time when this pods LSP was or was not added.

@qili: Can you repro and get me a must-gather?

Comment 19 Martin Kennelly 2021-09-22 10:06:06 UTC

Chatted with Riccardo. He is further ahead with his understanding of this bug. Thank you Qiujie Li for repro this issue. I passed the details of the cluster to Riccardo.

*** This bug has been marked as a duplicate of bug 1997205 ***

Comment 20 Red Hat Bugzilla 2023-09-15 01:15:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.