Description of problem: OCP cluster including SNO and regular multiple nodes cluster could be recovered automatically from a DHCP outage Version-Release number of selected component (if applicable): 4.9.18 How reproducible: Always Steps to Reproduce: 1. Stop the DHCP service in the lab and leave it as stopped for around 10 hours 2. Start the DHCP service 3. Keep monitoring if the cluster can be recovered automatically Actual results: The cluster cannot be recovered automatically. Expected results: The cluster shall be recovered automatically. Additional info: 1. Some pods are stuck in ContainerCreating status, and new created pods are also stuck in ContainerCreating state kni@jumphost ~/sno/sno147 $ oc get pods -A |grep -vE "Running|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE default nginx 0/1 ContainerCreating 0 55m openshift-marketplace certified-operators-9rld8 0/1 ContainerCreating 0 157m openshift-marketplace community-operators-fw4wx 0/1 ContainerCreating 0 157m openshift-marketplace redhat-marketplace-pqvvn 0/1 ContainerCreating 0 157m openshift-marketplace redhat-operators-wb5m4 0/1 ContainerCreating 0 157m openshift-multus ip-reconciler-27418560--1-nvc5j 0/1 ContainerCreating 0 9m17s openshift-operator-lifecycle-manager collect-profiles-27418410--1-pgq8q 0/1 ContainerCreating 0 151m openshift-operator-lifecycle-manager collect-profiles-27418425--1-zbxdm 0/1 ContainerCreating 0 144m openshift-operator-lifecycle-manager collect-profiles-27418440--1-s4w8t 0/1 ContainerCreating 0 129m openshift-operator-lifecycle-manager collect-profiles-27418455--1-jkrhc 0/1 ContainerCreating 0 114m openshift-operator-lifecycle-manager collect-profiles-27418470--1-rj57s 0/1 ContainerCreating 0 99m openshift-operator-lifecycle-manager collect-profiles-27418485--1-kwqvp 0/1 ContainerCreating 0 84m openshift-operator-lifecycle-manager collect-profiles-27418500--1-hp9fl 0/1 ContainerCreating 0 69m openshift-operator-lifecycle-manager collect-profiles-27418515--1-kd2fj 0/1 ContainerCreating 0 54m openshift-operator-lifecycle-manager collect-profiles-27418530--1-rvp4b 0/1 ContainerCreating 0 39m openshift-operator-lifecycle-manager collect-profiles-27418545--1-54g2j 0/1 ContainerCreating 0 24m kni@jumphost ~/sno/sno147 $ oc get pods NAME READY STATUS RESTARTS AGE nginx 0/1 ContainerCreating 0 60m kni@jumphost ~/sno/sno147 $ oc describe pods nginx …… Warning FailedCreatePodSandBox 3m49s (x46 over 50m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx_default_93b27f51-58bf-4f80-916b-e10ae03ae10e_0(72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11): error adding pod default_nginx to CNI network "multus-cni-network": [default/nginx/93b27f51-58bf-4f80-916b-e10ae03ae10e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] [default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:00:19 [10.128.0.25/23] This issue is tracked by BZ#2055865 2. Some pods are in Error status, this issue has been tracked in BZ#2054791 openshift-operator-lifecycle-manager collect-profiles-27417165--1-8btqh 0/1 Error 0 18h openshift-operator-lifecycle-manager collect-profiles-27417165--1-cgg9w 0/1 Error 0 17h openshift-operator-lifecycle-manager collect-profiles-27417165--1-dw9xx 0/1 Error 0 18h openshift-operator-lifecycle-manager collect-profiles-27417165--1-jnlhp 0/1 Error 0 18h openshift-operator-lifecycle-manager collect-profiles-27417165--1-nwj4q 0/1 Error 0 18h openshift-operator-lifecycle-manager collect-profiles-27417165--1-s2kkc 0/1 Error 0 18h openshift-operator-lifecycle-manager collect-profiles-27417165--1-wsnbr 0/1 Error 0 23h oc logs -f ip-reconciler-27415935--1-25s77 -n openshift-multus I0215 20:15:04.488950 1 request.go:655] Throttling request took 1.188038381s, request: GET:https://172.30.0.1:443/apis/work.open-cluster-management.io/v1?timeout=32s I0215 20:15:14.688690 1 request.go:655] Throttling request took 6.998699027s, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v2?timeout=32s I0215 20:15:24.886392 1 request.go:655] Throttling request took 17.196413607s, request: GET:https://172.30.0.1:443/apis/admission.hive.openshift.io/v1?timeout=32s 2022-02-15T20:15:31Z [error] failed to retrieve all IP pools: context deadline exceeded 2022-02-15T20:15:31Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded 3. For the collect-profiles pods(a cron job), the pods number kept increasing, eventually this will lead the node reach to pod quota limits and get OutOfpods error. This issue will be tracked in BZ#2055861 to have a tuning for the default conjob settings: oc get job -n openshift-operator-lifecycle-manager -o jsonpath="{.items[0].spec.backoffLimit}" 6 oc get cronjob -n openshift-operator-lifecycle-manager collect-profiles -o jsonpath={.spec.concurrencyPolicy} Allow
There is nothing in this BZ shall be fixed from code perspective, but it is better we close it after all the linked bugs are fixed/closed. //Borball