Bug 2055861 - cronjob collect-profiles failed leads node reach to OutOfpods status
Summary: cronjob collect-profiles failed leads node reach to OutOfpods status
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.9
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 4.11.0
Assignee: Alexander Greene
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks: 2071941
TreeView+ depends on / blocked
 
Reported: 2022-02-17 20:03 UTC by bzhai
Modified: 2022-11-28 03:37 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The collect-profiles job should only take a few seconds to run. There are instances, such as when the pod cannot be scheduled, where the job will not complete in a reasonable amount of time. Consequence: If enough jobs are scheduled but unable to run, the number of scheduled jobs can exceed pod quota limits. Fix: Given that the collect-profiles job should only take a few seconds to run and that the job is scheduled to run every 15 minutes, set the collect-profiles cronJob's spec.concurrencyPolicy to "Replace" so that only one active collect-profiles pod exists at any time. Result: The collect-profiles job will no longer create an excessive number of pods that leads to a OutOfPods condition.
Clone Of:
: 2071941 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:50:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 255 0 None open Bug 2055861: Replace collect-profile jobs that haven't completed 2022-02-28 22:02:19 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:50:35 UTC

Internal Links: 2055857

Description bzhai 2022-02-17 20:03:35 UTC
Description of problem:

The settings in cronjob collect-profiles shall be tuned. 


Version-Release number of selected component (if applicable):
4.9.z

How reproducible:
Always

Steps to Reproduce:
1. Deploy a SNO cluster
2. Bring down the DHCP server for couple of hours
3. Bring up the DHCP server
4. There are many collect-profiles-xxxxxx pods stuck in ContainerCreating state due to BZ#2055857, and those pods kept increasing and eventually leads the SNO node reach to OutOfpods status due to node pod quota limits. 

Actual results:
Should have so many retrying happened, if one job fail somehow, simply wait for next run.


Expected results:
Regardless of BZ#2055857, collect-profiles cron job should not have so many retrying happened, if one job fail somehow, simply wait for next run.


Additional info:
Those default settings shall be reviewed and tuned. 

oc get job -n openshift-operator-lifecycle-manager  -o jsonpath="{.items[0].spec.backoffLimit}"
6
oc get cronjob  -n openshift-operator-lifecycle-manager  collect-profiles -o jsonpath={.spec.concurrencyPolicy}
Allow

Comment 1 Jian Zhang 2022-02-25 06:31:26 UTC
Hi Borball,

> 2. Bring down the DHCP server for couple of hours

Sorry, I'm not familiar with it, did you deploy DHCP server on the SNO cluster? Is it available for you to paste the steps of deploying DHCP server here? Thanks!

Comment 2 bzhai 2022-02-25 21:29:45 UTC
Hi Jian Zhang,

DHCP is the one provides the IP assignment for SNO node, that is the prequiresite[1] for any OCP clusters not only SNO. In our lab we are using dnsmasq.

---
[1]: https://docs.openshift.com/container-platform/4.9/installing/installing_bare_metal_ipi/ipi-install-prerequisites.html#network-requirements-dhcp-reqs_ipi-install-prerequisites

//Borball

Comment 3 Jian Zhang 2022-02-28 09:40:00 UTC
Hi Borball,

> 3. Bring up the DHCP server

After the DHCP server restart, does the cluster node restart? And its IP changed, right? Thanks!

Comment 4 bzhai 2022-02-28 15:54:20 UTC
No, no action or changes on the cluster , and the IP is not changed, it is expetecd that the cluster could recover by itself.

Comment 5 Jian Zhang 2022-03-17 04:06:01 UTC
1, Create an OCP cluster that the fixed PR merged in via cluster-bot.
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.ci.test-2022-03-17-013610-ci-ln-vttlg4b-latest   True        False         30m     Cluster version is 4.11.0-0.ci.test-2022-03-17-013610-ci-ln-vttlg4b-latest

2, Cordon the worker nodes that the collect-profiles job pods running on.
mac:~ jianzhang$ oc adm cordon ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7
node/ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj cordoned
node/ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm cordoned
node/ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7 cordoned
mac:~ jianzhang$ oc get nodes
NAME                                       STATUS                     ROLES    AGE   VERSION
ci-ln-vttlg4b-72292-cbq8w-master-0         Ready                      master   52m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-master-1         Ready                      master   52m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-master-2         Ready                      master   52m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj   Ready,SchedulingDisabled   worker   43m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm   Ready,SchedulingDisabled   worker   43m   v1.23.3+0233338
ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7   Ready,SchedulingDisabled   worker   43m   v1.23.3+0233338

3, Check if more collect-profiles pods generated.
mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager
NAME                                      READY   STATUS      RESTARTS   AGE
catalog-operator-7cd459f578-lv8wr         1/1     Running     0          139m
collect-profiles-27458100-nb5p8           0/1     Completed   0          63m
collect-profiles-27458115-dd792           0/1     Completed   0          48m
collect-profiles-27458130-shtcl           0/1     Completed   0          33m
collect-profiles-27458160-62scp           0/1     Pending     0          3m20s

As above, only one pod is pending after the job run twice. LGTM, verify it.

olm-operator-587bf9b6cb-wssvk             1/1     Running     0          139m

Comment 8 Per da Silva 2022-04-05 09:34:53 UTC
Hi Chris,

I've created it here: https://bugzilla.redhat.com/show_bug.cgi?id=2071941

Cheers,

Per

Comment 13 errata-xmlrpc 2022-08-10 10:50:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.