Bug 2055857

Summary: SNO could not recover from a DHCP outage due to error 'failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed)'
Product: OpenShift Container Platform Reporter: bzhai
Component: NetworkingAssignee: obraunsh
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: high CC: achernet, cback, dahernan, eglottma, ffernand, obraunsh, trozet
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-17 21:53:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description bzhai 2022-02-17 19:57:50 UTC
Description of problem:

OCP cluster including SNO and regular multiple nodes cluster could be recovered automatically from a DHCP outage


Version-Release number of selected component (if applicable):
4.9.18

How reproducible:
Always

Steps to Reproduce:
1. Stop the DHCP service in the lab and leave it as stopped for around 10 hours
2. Start the DHCP service
3. Keep monitoring if the cluster can be recovered automatically


Actual results:
The cluster cannot be recovered automatically.

Expected results:
The cluster shall be recovered automatically.


Additional info:
1. Some pods are stuck in ContainerCreating status, and new created pods are also stuck in ContainerCreating state
kni@jumphost ~/sno/sno147 $ oc get pods -A |grep -vE "Running|Completed"
NAMESPACE                                          NAME                                                         READY   STATUS              RESTARTS          AGE
default                                            nginx                                                        0/1     ContainerCreating   0                 55m
openshift-marketplace                              certified-operators-9rld8                                    0/1     ContainerCreating   0                 157m
openshift-marketplace                              community-operators-fw4wx                                    0/1     ContainerCreating   0                 157m
openshift-marketplace                              redhat-marketplace-pqvvn                                     0/1     ContainerCreating   0                 157m
openshift-marketplace                              redhat-operators-wb5m4                                       0/1     ContainerCreating   0                 157m
openshift-multus                                   ip-reconciler-27418560--1-nvc5j                              0/1     ContainerCreating   0                 9m17s
openshift-operator-lifecycle-manager               collect-profiles-27418410--1-pgq8q                           0/1     ContainerCreating   0                 151m
openshift-operator-lifecycle-manager               collect-profiles-27418425--1-zbxdm                           0/1     ContainerCreating   0                 144m
openshift-operator-lifecycle-manager               collect-profiles-27418440--1-s4w8t                           0/1     ContainerCreating   0                 129m
openshift-operator-lifecycle-manager               collect-profiles-27418455--1-jkrhc                           0/1     ContainerCreating   0                 114m
openshift-operator-lifecycle-manager               collect-profiles-27418470--1-rj57s                           0/1     ContainerCreating   0                 99m
openshift-operator-lifecycle-manager               collect-profiles-27418485--1-kwqvp                           0/1     ContainerCreating   0                 84m
openshift-operator-lifecycle-manager               collect-profiles-27418500--1-hp9fl                           0/1     ContainerCreating   0                 69m
openshift-operator-lifecycle-manager               collect-profiles-27418515--1-kd2fj                           0/1     ContainerCreating   0                 54m
openshift-operator-lifecycle-manager               collect-profiles-27418530--1-rvp4b                           0/1     ContainerCreating   0                 39m
openshift-operator-lifecycle-manager               collect-profiles-27418545--1-54g2j                           0/1     ContainerCreating   0                 24m

kni@jumphost ~/sno/sno147 $ oc get pods
NAME    READY   STATUS              RESTARTS   AGE
nginx   0/1     ContainerCreating   0          60m

kni@jumphost ~/sno/sno147 $ oc describe pods nginx
……
  Warning  FailedCreatePodSandBox  3m49s (x46 over 50m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx_default_93b27f51-58bf-4f80-916b-e10ae03ae10e_0(72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11): error adding pod default_nginx to CNI network "multus-cni-network": [default/nginx/93b27f51-58bf-4f80-916b-e10ae03ae10e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] [default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:00:19 [10.128.0.25/23]

This issue is tracked by BZ#2055865

2. Some pods are in Error status, this issue has been tracked in BZ#2054791

openshift-operator-lifecycle-manager               collect-profiles-27417165--1-8btqh                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-cgg9w                           0/1     Error               0                 17h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-dw9xx                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-jnlhp                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-nwj4q                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-s2kkc                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-wsnbr                           0/1     Error               0              23h


oc logs -f ip-reconciler-27415935--1-25s77   -n openshift-multus
I0215 20:15:04.488950       1 request.go:655] Throttling request took 1.188038381s, request: GET:https://172.30.0.1:443/apis/work.open-cluster-management.io/v1?timeout=32s
I0215 20:15:14.688690       1 request.go:655] Throttling request took 6.998699027s, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v2?timeout=32s
I0215 20:15:24.886392       1 request.go:655] Throttling request took 17.196413607s, request: GET:https://172.30.0.1:443/apis/admission.hive.openshift.io/v1?timeout=32s
2022-02-15T20:15:31Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-02-15T20:15:31Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

3. For the collect-profiles pods(a cron job), the pods number kept increasing, eventually this will lead the node reach to pod quota limits and get OutOfpods error. This issue will be tracked in BZ#2055861 to have a tuning for the default conjob settings:

oc get job -n openshift-operator-lifecycle-manager  -o jsonpath="{.items[0].spec.backoffLimit}"
6
oc get cronjob  -n openshift-operator-lifecycle-manager  collect-profiles -o jsonpath={.spec.concurrencyPolicy}
Allow

Comment 8 bzhai 2022-04-08 12:51:55 UTC
There is nothing in this BZ shall be fixed from code perspective, but it is better we close it after all the linked bugs are fixed/closed. 

//Borball