Bug 2055861
Summary: | cronjob collect-profiles failed leads node reach to OutOfpods status | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | bzhai | |
Component: | OLM | Assignee: | Alexander Greene <agreene> | |
OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | achernet, agreene, bzvonar, cback, jiazha, krizza, pegoncal | |
Version: | 4.9 | |||
Target Milestone: | --- | |||
Target Release: | 4.11.0 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: The collect-profiles job should only take a few seconds to run. There are instances, such as when the pod cannot be scheduled, where the job will not complete in a reasonable amount of time.
Consequence: If enough jobs are scheduled but unable to run, the number of scheduled jobs can exceed pod quota limits.
Fix: Given that the collect-profiles job should only take a few seconds to run and that the job is scheduled to run every 15 minutes, set the collect-profiles cronJob's spec.concurrencyPolicy to "Replace" so that only one active collect-profiles pod exists at any time.
Result: The collect-profiles job will no longer create an excessive number of pods that leads to a OutOfPods condition.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2071941 (view as bug list) | Environment: | ||
Last Closed: | 2022-08-10 10:50:22 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2071941 |
Description
bzhai
2022-02-17 20:03:35 UTC
Hi Borball,
> 2. Bring down the DHCP server for couple of hours
Sorry, I'm not familiar with it, did you deploy DHCP server on the SNO cluster? Is it available for you to paste the steps of deploying DHCP server here? Thanks!
Hi Jian Zhang, DHCP is the one provides the IP assignment for SNO node, that is the prequiresite[1] for any OCP clusters not only SNO. In our lab we are using dnsmasq. --- [1]: https://docs.openshift.com/container-platform/4.9/installing/installing_bare_metal_ipi/ipi-install-prerequisites.html#network-requirements-dhcp-reqs_ipi-install-prerequisites //Borball Hi Borball,
> 3. Bring up the DHCP server
After the DHCP server restart, does the cluster node restart? And its IP changed, right? Thanks!
No, no action or changes on the cluster , and the IP is not changed, it is expetecd that the cluster could recover by itself. 1, Create an OCP cluster that the fixed PR merged in via cluster-bot. mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.ci.test-2022-03-17-013610-ci-ln-vttlg4b-latest True False 30m Cluster version is 4.11.0-0.ci.test-2022-03-17-013610-ci-ln-vttlg4b-latest 2, Cordon the worker nodes that the collect-profiles job pods running on. mac:~ jianzhang$ oc adm cordon ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7 node/ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj cordoned node/ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm cordoned node/ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7 cordoned mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-vttlg4b-72292-cbq8w-master-0 Ready master 52m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-master-1 Ready master 52m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-master-2 Ready master 52m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-worker-a-vkbtj Ready,SchedulingDisabled worker 43m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-worker-b-58qcm Ready,SchedulingDisabled worker 43m v1.23.3+0233338 ci-ln-vttlg4b-72292-cbq8w-worker-c-ztqc7 Ready,SchedulingDisabled worker 43m v1.23.3+0233338 3, Check if more collect-profiles pods generated. mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-7cd459f578-lv8wr 1/1 Running 0 139m collect-profiles-27458100-nb5p8 0/1 Completed 0 63m collect-profiles-27458115-dd792 0/1 Completed 0 48m collect-profiles-27458130-shtcl 0/1 Completed 0 33m collect-profiles-27458160-62scp 0/1 Pending 0 3m20s As above, only one pod is pending after the job run twice. LGTM, verify it. olm-operator-587bf9b6cb-wssvk 1/1 Running 0 139m Hi Chris, I've created it here: https://bugzilla.redhat.com/show_bug.cgi?id=2071941 Cheers, Per Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |