2055857 – SNO could not recover from a DHCP outage due to error 'failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed)'

Bug 2055857 - SNO could not recover from a DHCP outage due to error 'failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed)'

Summary: SNO could not recover from a DHCP outage due to error 'failed to configure po...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	obraunsh
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-17 19:57 UTC by bzhai
Modified:	2022-11-17 21:53 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-17 21:53:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1990125	1	high	CLOSED	co/image-registry is degrade because ImagePrunerDegraded: Job has reached the specified backoff limit	2024-12-20 20:37:18 UTC
Red Hat Bugzilla	2054791	1	urgent	CLOSED	IP reconciler cron job failing on single node	2023-03-09 01:33:54 UTC
Red Hat Bugzilla	2055861	1	high	CLOSED	cronjob collect-profiles failed leads node reach to OutOfpods status	2022-11-28 03:37:46 UTC
Red Hat Bugzilla	2055865	1	unspecified	CLOSED	northd doesn't wake up for poll intervals when DB IP cannot be reached	2022-04-20 04:19:09 UTC

Description bzhai 2022-02-17 19:57:50 UTC

Description of problem:

OCP cluster including SNO and regular multiple nodes cluster could be recovered automatically from a DHCP outage


Version-Release number of selected component (if applicable):
4.9.18

How reproducible:
Always

Steps to Reproduce:
1. Stop the DHCP service in the lab and leave it as stopped for around 10 hours
2. Start the DHCP service
3. Keep monitoring if the cluster can be recovered automatically


Actual results:
The cluster cannot be recovered automatically.

Expected results:
The cluster shall be recovered automatically.


Additional info:
1. Some pods are stuck in ContainerCreating status, and new created pods are also stuck in ContainerCreating state
kni@jumphost ~/sno/sno147 $ oc get pods -A |grep -vE "Running|Completed"
NAMESPACE                                          NAME                                                         READY   STATUS              RESTARTS          AGE
default                                            nginx                                                        0/1     ContainerCreating   0                 55m
openshift-marketplace                              certified-operators-9rld8                                    0/1     ContainerCreating   0                 157m
openshift-marketplace                              community-operators-fw4wx                                    0/1     ContainerCreating   0                 157m
openshift-marketplace                              redhat-marketplace-pqvvn                                     0/1     ContainerCreating   0                 157m
openshift-marketplace                              redhat-operators-wb5m4                                       0/1     ContainerCreating   0                 157m
openshift-multus                                   ip-reconciler-27418560--1-nvc5j                              0/1     ContainerCreating   0                 9m17s
openshift-operator-lifecycle-manager               collect-profiles-27418410--1-pgq8q                           0/1     ContainerCreating   0                 151m
openshift-operator-lifecycle-manager               collect-profiles-27418425--1-zbxdm                           0/1     ContainerCreating   0                 144m
openshift-operator-lifecycle-manager               collect-profiles-27418440--1-s4w8t                           0/1     ContainerCreating   0                 129m
openshift-operator-lifecycle-manager               collect-profiles-27418455--1-jkrhc                           0/1     ContainerCreating   0                 114m
openshift-operator-lifecycle-manager               collect-profiles-27418470--1-rj57s                           0/1     ContainerCreating   0                 99m
openshift-operator-lifecycle-manager               collect-profiles-27418485--1-kwqvp                           0/1     ContainerCreating   0                 84m
openshift-operator-lifecycle-manager               collect-profiles-27418500--1-hp9fl                           0/1     ContainerCreating   0                 69m
openshift-operator-lifecycle-manager               collect-profiles-27418515--1-kd2fj                           0/1     ContainerCreating   0                 54m
openshift-operator-lifecycle-manager               collect-profiles-27418530--1-rvp4b                           0/1     ContainerCreating   0                 39m
openshift-operator-lifecycle-manager               collect-profiles-27418545--1-54g2j                           0/1     ContainerCreating   0                 24m

kni@jumphost ~/sno/sno147 $ oc get pods
NAME    READY   STATUS              RESTARTS   AGE
nginx   0/1     ContainerCreating   0          60m

kni@jumphost ~/sno/sno147 $ oc describe pods nginx
……
  Warning  FailedCreatePodSandBox  3m49s (x46 over 50m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nginx_default_93b27f51-58bf-4f80-916b-e10ae03ae10e_0(72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11): error adding pod default_nginx to CNI network "multus-cni-network": [default/nginx/93b27f51-58bf-4f80-916b-e10ae03ae10e:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] [default/nginx 72a6865188234eaf3452a7412771e6b59250365d90603d687581da5f6f829e11] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:80:00:19 [10.128.0.25/23]

This issue is tracked by BZ#2055865

2. Some pods are in Error status, this issue has been tracked in BZ#2054791

openshift-operator-lifecycle-manager               collect-profiles-27417165--1-8btqh                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-cgg9w                           0/1     Error               0                 17h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-dw9xx                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-jnlhp                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-nwj4q                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-s2kkc                           0/1     Error               0                 18h
openshift-operator-lifecycle-manager               collect-profiles-27417165--1-wsnbr                           0/1     Error               0              23h


oc logs -f ip-reconciler-27415935--1-25s77   -n openshift-multus
I0215 20:15:04.488950       1 request.go:655] Throttling request took 1.188038381s, request: GET:https://172.30.0.1:443/apis/work.open-cluster-management.io/v1?timeout=32s
I0215 20:15:14.688690       1 request.go:655] Throttling request took 6.998699027s, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v2?timeout=32s
I0215 20:15:24.886392       1 request.go:655] Throttling request took 17.196413607s, request: GET:https://172.30.0.1:443/apis/admission.hive.openshift.io/v1?timeout=32s
2022-02-15T20:15:31Z [error] failed to retrieve all IP pools: context deadline exceeded
2022-02-15T20:15:31Z [error] failed to create the reconcile looper: failed to retrieve all IP pools: context deadline exceeded

3. For the collect-profiles pods(a cron job), the pods number kept increasing, eventually this will lead the node reach to pod quota limits and get OutOfpods error. This issue will be tracked in BZ#2055861 to have a tuning for the default conjob settings:

oc get job -n openshift-operator-lifecycle-manager  -o jsonpath="{.items[0].spec.backoffLimit}"
6
oc get cronjob  -n openshift-operator-lifecycle-manager  collect-profiles -o jsonpath={.spec.concurrencyPolicy}
Allow

Comment 8 bzhai 2022-04-08 12:51:55 UTC

There is nothing in this BZ shall be fixed from code perspective, but it is better we close it after all the linked bugs are fixed/closed. 

//Borball

Note You need to log in before you can comment on or make changes to this bug.