Description of problem: The settings in cronjob collect-profiles shall be tuned. Version-Release number of selected component (if applicable): 4.9.z How reproducible: Always Steps to Reproduce: 1. Deploy a SNO cluster 2. Bring down the DHCP server for couple of hours 3. Bring up the DHCP server 4. There are many collect-profiles-xxxxxx pods stuck in ContainerCreating state due to BZ#2055857, and those pods kept increasing and eventually leads the SNO node reach to OutOfpods status due to node pod quota limits. Actual results: Should have so many retrying happened, if one job fail somehow, simply wait for next run. Expected results: Regardless of BZ#2055857, collect-profiles cron job should not have so many retrying happened, if one job fail somehow, simply wait for next run. Additional info: Those default settings shall be reviewed and tuned. oc get job -n openshift-operator-lifecycle-manager -o jsonpath="{.items[0].spec.backoffLimit}" 6 oc get cronjob -n openshift-operator-lifecycle-manager collect-profiles -o jsonpath={.spec.concurrencyPolicy} Allow
Hi Borball, > 2. Bring down the DHCP server for couple of hours Sorry, I'm not familiar with it, did you deploy DHCP server on the SNO cluster? Is it available for you to paste the steps of deploying DHCP server here? Thanks!
Hi Jian Zhang, DHCP is the one provides the IP assignment for SNO node, that is the prequiresite[1] for any OCP clusters not only SNO. In our lab we are using dnsmasq. --- [1]: https://docs.openshift.com/container-platform/4.9/installing/installing_bare_metal_ipi/ipi-install-prerequisites.html#network-requirements-dhcp-reqs_ipi-install-prerequisites //Borball
Hi Borball, > 3. Bring up the DHCP server After the DHCP server restart, does the cluster node restart? And its IP changed, right? Thanks!
No, no action or changes on the cluster , and the IP is not changed, it is expetecd that the cluster could recover by itself.
1, Create an OCP cluster that the fixed PR merged in via cluster-bot. mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.ci.test-2022-03-17-013610-ci-ln-vttlg4b-latest True False 30m Cluster version is 4.11.0-0.ci.test-2022-03-17-013610-ci-ln-vttlg4b-latest 2, Cordon the worker nodes that the collect-profiles job pods running on. mac:~ jianzhang$ oc adm cordon ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7 node/ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj cordoned node/ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm cordoned node/ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7 cordoned mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-vttlg4b-72292-cbq8w-master-0 Ready master 52m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-master-1 Ready master 52m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-master-2 Ready master 52m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj Ready,SchedulingDisabled worker 43m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm Ready,SchedulingDisabled worker 43m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7 Ready,SchedulingDisabled worker 43m v1.23.3+0233338 3, Check if more collect-profiles pods generated. mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-7cd459f578-lv8wr 1/1 Running 0 139m collect-profiles-27458100-nb5p8 0/1 Completed 0 63m collect-profiles-27458115-dd792 0/1 Completed 0 48m collect-profiles-27458130-shtcl 0/1 Completed 0 33m collect-profiles-27458160-62scp 0/1 Pending 0 3m20s As above, only one pod is pending after the job run twice. LGTM, verify it. olm-operator-587bf9b6cb-wssvk 1/1 Running 0 139m
Hi Chris, I've created it here: https://bugzilla.redhat.com/show_bug.cgi?id=2071941 Cheers, Per
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069