Bug 2055861

Summary:	cronjob collect-profiles failed leads node reach to OutOfpods status
Product:	OpenShift Container Platform	Reporter:	bzhai
Component:	OLM	Assignee:	Alexander Greene <agreene>
OLM sub component:	OLM	QA Contact:	Jian Zhang <jiazha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	achernet, agreene, bzvonar, cback, jiazha, krizza, pegoncal
Version:	4.9
Target Milestone:	---
Target Release:	4.11.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The collect-profiles job should only take a few seconds to run. There are instances, such as when the pod cannot be scheduled, where the job will not complete in a reasonable amount of time. Consequence: If enough jobs are scheduled but unable to run, the number of scheduled jobs can exceed pod quota limits. Fix: Given that the collect-profiles job should only take a few seconds to run and that the job is scheduled to run every 15 minutes, set the collect-profiles cronJob's spec.concurrencyPolicy to "Replace" so that only one active collect-profiles pod exists at any time. Result: The collect-profiles job will no longer create an excessive number of pods that leads to a OutOfPods condition.	Story Points:	---
Clone Of:
Clones:	2071941 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:50:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2071941

Description bzhai 2022-02-17 20:03:35 UTC

Description of problem:

The settings in cronjob collect-profiles shall be tuned. 


Version-Release number of selected component (if applicable):
4.9.z

How reproducible:
Always

Steps to Reproduce:
1. Deploy a SNO cluster
2. Bring down the DHCP server for couple of hours
3. Bring up the DHCP server
4. There are many collect-profiles-xxxxxx pods stuck in ContainerCreating state due to BZ#2055857, and those pods kept increasing and eventually leads the SNO node reach to OutOfpods status due to node pod quota limits. 

Actual results:
Should have so many retrying happened, if one job fail somehow, simply wait for next run.


Expected results:
Regardless of BZ#2055857, collect-profiles cron job should not have so many retrying happened, if one job fail somehow, simply wait for next run.


Additional info:
Those default settings shall be reviewed and tuned. 

oc get job -n openshift-operator-lifecycle-manager  -o jsonpath="{.items[0].spec.backoffLimit}"
6
oc get cronjob  -n openshift-operator-lifecycle-manager  collect-profiles -o jsonpath={.spec.concurrencyPolicy}
Allow

Comment 1 Jian Zhang 2022-02-25 06:31:26 UTC

Hi Borball,

> 2. Bring down the DHCP server for couple of hours

Sorry, I'm not familiar with it, did you deploy DHCP server on the SNO cluster? Is it available for you to paste the steps of deploying DHCP server here? Thanks!

Comment 2 bzhai 2022-02-25 21:29:45 UTC

Hi Jian Zhang,

DHCP is the one provides the IP assignment for SNO node, that is the prequiresite[1] for any OCP clusters not only SNO. In our lab we are using dnsmasq.

---
[1]: https://docs.openshift.com/container-platform/4.9/installing/installing_bare_metal_ipi/ipi-install-prerequisites.html#network-requirements-dhcp-reqs_ipi-install-prerequisites

//Borball

Comment 3 Jian Zhang 2022-02-28 09:40:00 UTC

Hi Borball,

> 3. Bring up the DHCP server

After the DHCP server restart, does the cluster node restart? And its IP changed, right? Thanks!

Comment 4 bzhai 2022-02-28 15:54:20 UTC

No, no action or changes on the cluster , and the IP is not changed, it is expetecd that the cluster could recover by itself.

Comment 5 Jian Zhang 2022-03-17 04:06:01 UTC

1, Create an OCP cluster that the fixed PR merged in via cluster-bot.
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.ci.test-2022-03-17-013610-ci-ln-vttlg4b-latest   True        False         30m     Cluster version is 4.11.0-0.ci.test-2022-03-17-013610-ci-ln-vttlg4b-latest

2, Cordon the worker nodes that the collect-profiles job pods running on.
mac:~ jianzhang$ oc adm cordon ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7
node/ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj cordoned
node/ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm cordoned
node/ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7 cordoned
mac:~ jianzhang$ oc get nodes
NAME                                       STATUS                     ROLES    AGE   VERSION
ci-ln-vttlg4b-72292-cbq8w-master-0         Ready                      master   52m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-master-1         Ready                      master   52m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-master-2         Ready                      master   52m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj   Ready,SchedulingDisabled   worker   43m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm   Ready,SchedulingDisabled   worker   43m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7   Ready,SchedulingDisabled   worker   43m   v1.23.3+0233338

3, Check if more collect-profiles pods generated.
mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager
NAME                                      READY   STATUS      RESTARTS   AGE
catalog-operator-7cd459f578-lv8wr         1/1     Running     0          139m
collect-profiles-27458100-nb5p8           0/1     Completed   0          63m
collect-profiles-27458115-dd792           0/1     Completed   0          48m
collect-profiles-27458130-shtcl           0/1     Completed   0          33m
collect-profiles-27458160-62scp           0/1     Pending     0          3m20s

As above, only one pod is pending after the job run twice. LGTM, verify it.

olm-operator-587bf9b6cb-wssvk             1/1     Running     0          139m

Comment 8 Per da Silva 2022-04-05 09:34:53 UTC

Hi Chris,

I've created it here: https://bugzilla.redhat.com/show_bug.cgi?id=2071941

Cheers,

Per

Comment 13 errata-xmlrpc 2022-08-10 10:50:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069