Description of problem: On 4.9 OVN AWS cluster cluster. DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes). The pod is in 'ContainerCreating' state because 'failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed)'. pod's event ----------- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 4m1s (x7283 over 6d6h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-vv98k_openshift-multus_b2d3c2d2-cd90-407a-93f0-ac5c9029871c_0(234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a): error adding pod openshift-multus_network-metrics-daemon-vv98k to CNI network "multus-cni-network": [openshift-multus/network-metrics-daemon-vv98k:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-multus/network-metrics-daemon-vv98k 234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a] [openshift-multus/network-metrics-daemon-vv98k 234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:00:04 [10.131.0.4/23] ----------- Reliability test https://github.com/openshift/svt/tree/master/reliability This test basically simulate project creation, check, scale up/down, build, modification, delete and application access, continuously run for 1 week. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-09-06-055314 How reproducible: Steps to Reproduce: 1. Create an OVN AWS cluster with 3 master and 3 worker nodes, m5.xlarge type. Actual results: network operator in PROGRESSING state DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) Expected results: network operator not in PROGRESSING state Additional info: --------------------------- # oc get clusterversions NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-06-055314 True False 6d6h Cluster version is 4.9.0-0.nightly-2021-09-06-055314 --------------------------- # oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-130-178.us-east-2.compute.internal Ready master 6d6h v1.22.0-rc.0+75ee307 ip-10-0-135-41.us-east-2.compute.internal Ready worker 6d6h v1.22.0-rc.0+75ee307 ip-10-0-167-28.us-east-2.compute.internal Ready master 6d6h v1.22.0-rc.0+75ee307 ip-10-0-176-130.us-east-2.compute.internal Ready worker 6d6h v1.22.0-rc.0+75ee307 ip-10-0-210-23.us-east-2.compute.internal Ready worker 6d6h v1.22.0-rc.0+75ee307 ip-10-0-223-209.us-east-2.compute.internal Ready master 6d6h v1.22.0-rc.0+75ee307 --------------------------- # oc get pods -A | egrep -v "Completed|Running" NAMESPACE NAME READY STATUS RESTARTS AGE openshift-multus network-metrics-daemon-vv98k 0/2 ContainerCreating 0 6d6h --------------------------- # oc get pod network-metrics-daemon-vv98k -n openshift-multus -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES network-metrics-daemon-vv98k 0/2 ContainerCreating 0 6d6h <none> ip-10-0-210-23.us-east-2.compute.internal <none> <none> --------------------------- # oc describe pod network-metrics-daemon-vv98k -n openshift-multus Name: network-metrics-daemon-vv98k Namespace: openshift-multus Priority: 2000001000 Priority Class Name: system-node-critical Node: ip-10-0-210-23.us-east-2.compute.internal/10.0.210.23 Start Time: Tue, 07 Sep 2021 01:48:27 +0000 Labels: app=network-metrics-daemon component=network controller-revision-hash=58b6b476f5 openshift.io/component=network pod-template-generation=1 type=infra Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.131.0.4/23"],"mac_address":"0a:58:0a:83:00:04","gateway_ips":["10.131.0.1"],"ip_address":"10.131.0.4/23","... Status: Pending IP: IPs: <none> Controlled By: DaemonSet/network-metrics-daemon Containers: network-metrics-daemon: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:77cfdefb0c09c686a4d749db754aabc330c28e39975ed4263b4cb4dd82b7d56a Image ID: Port: <none> Host Port: <none> Command: /usr/bin/network-metrics Args: --node-name $(NODE_NAME) State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 10m memory: 100Mi Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7l5j2 (ro) kube-rbac-proxy: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940 Image ID: Port: 8443/TCP Host Port: 0/TCP Args: --logtostderr --secure-listen-address=:8443 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 --upstream=http://127.0.0.1:9091/ --tls-private-key-file=/etc/metrics/tls.key --tls-cert-file=/etc/metrics/tls.crt State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 10m memory: 20Mi Environment: <none> Mounts: /etc/metrics from metrics-certs (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7l5j2 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: metrics-certs: Type: Secret (a volume populated by a Secret) SecretName: metrics-daemon-secret Optional: false kube-api-access-7l5j2: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 4m1s (x7283 over 6d6h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-vv98k_openshift-multus_b2d3c2d2-cd90-407a-93f0-ac5c9029871c_0(234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a): error adding pod openshift-multus_network-metrics-daemon-vv98k to CNI network "multus-cni-network": [openshift-multus/network-metrics-daemon-vv98k:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-multus/network-metrics-daemon-vv98k 234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a] [openshift-multus/network-metrics-daemon-vv98k 234906b0fa9818ba3c67cf6ff332f9008abebf4910ea42588018c6605d00941a] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:00:04 [10.131.0.4/23] ' --------------------------- # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-09-06-055314 True False False 2d12h baremetal 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h cloud-controller-manager 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h cloud-credential 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h cluster-autoscaler 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h config-operator 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h console 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h csi-snapshot-controller 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h dns 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h etcd 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h image-registry 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h ingress 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h insights 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h kube-apiserver 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h kube-controller-manager 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h kube-scheduler 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h kube-storage-version-migrator 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h machine-api 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h machine-approver 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h machine-config 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h marketplace 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h monitoring 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h network 4.9.0-0.nightly-2021-09-06-055314 True True False 6d6h DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) node-tuning 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h openshift-apiserver 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h openshift-controller-manager 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h openshift-samples 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h operator-lifecycle-manager 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h service-ca 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h storage 4.9.0-0.nightly-2021-09-06-055314 True False False 6d6h
Hi Could you please provide us with a must-gather or a cluster where this issue is reproduced? Thanks in advance, Alex
@mkennell @aconstan let me know if you can't see the private Comment2 for the kubeconfig and must-gather. Bugzilla promoted "The assignee of this bug cannot see private comments!" when I marked it as private. I wonder if it is a fake alert as if you're members of redhat group, you should be able to see it.
Possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1996201
@qili I cannot access the kubeconfig. Permission denied. Can you correct this and give me access? Thank you. Like what Mike said, it maybe related to that issue. Can you check if there is a large amount of veth's on the node where this failed pod is scheduled? You may use oc debug node/${NODE_NAME} to get a terminal on the worker host. On that node run: # See number of OVS controlled interfaces ovs-vsctl --columns=name --data=bare --format=table list interface | wc -l # See number of interfaces actually on the node ifconfig | grep MULTICAST | wc -l This will give you an idea of the number of interfaces seen on the host and the number managed by OVS. If there is a large discrepancy then it may indicate the veth "leak" is occurring and what Mike linked above is occurring. Also, this veth leak will clean itself up after a period of time, so you may not see the veth's anymore. You may have to rerun your test suite, see if there is a rising level of veths present on the host.
@mkennell Sorry the test cluster has already been destroyed. This does not reproduce every time in my previous tests, I will try the step you suggest when I can reproduce it again. Thanks for the triage.
Thank you for the update. As stated by Qiujie Li, network-metrics-daemon fails to roll out during install on one node. This is not a blocker+ for OVN-K because it does not degrade provisioning of work loads following the failure seen. Test of workloads were still carried out successfully. I am trying to understand now why network-metrics-daemon failed to provision and is it related to OVN-K.
The must-gather logs were retrieved 7 days post install and when this error occurred. ovnkube-master logs unfortunately do not cover the period of time when this pods LSP was or was not added. @qili: Can you repro and get me a must-gather?
Chatted with Riccardo. He is further ahead with his understanding of this bug. Thank you Qiujie Li for repro this issue. I passed the details of the cluster to Riccardo. *** This bug has been marked as a duplicate of bug 1997205 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days