Bug 2025458
Summary: | [IPI-AWS] cluster-baremetal-operator pod in a crashloop state after patching from 4.7.21 to 4.7.36 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Priyansh Magotra <pmagotra> |
Component: | Bare Metal Hardware Provisioning | Assignee: | Tomas Sedovic <tsedovic> |
Bare Metal Hardware Provisioning sub component: | cluster-baremetal-operator | QA Contact: | Ori Michaeli <omichael> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aguclu, aos-bugs, bpickard, csaggin, dahernan, eparis, lshilin, mmasters, npaez, rbartal, rpittau, sdasu, shardy, yboaron |
Version: | 4.7 | Keywords: | OtherQA, Triaged |
Target Milestone: | --- | ||
Target Release: | 4.9.z | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
Cluster Baremetal Operator does not disable itself properly on unsupported platforms when api-int DNS lookup fails.
Consequence:
Cluster Baremetal Operator stays in crashlooping state.
Fix:
Reordered Cluster Baremetal Operator api-int DNS check logic to be done only on supported platforms and skip on unsupported platforms.
Result:
Cluster Baremetal Operator will disable itself properly on unsupported platforms independent of api-int DNS lookup failures happening some cases.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:29:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2053581, 2055279 |
Description
Priyansh Magotra
2021-11-22 09:25:46 UTC
CBO is trying to lookup the API Server's IP (https://github.com/openshift/cluster-baremetal-operator/blob/f73e5fcb432e4b847cddec5ce8570f8c5c32e902/controllers/provisioning_controller.go#L428) to then later determine if it is an v6/v4 address. But the DNS failures are preventing this call from succeeding. Not sure if it is the SDN team or the mDNS team that needs to take a look. Assigning it to the SDN team to take a look first. Moving to DNS team, as this seems like a dns resolution problem. If you think its a networking issue please feel free to send it back (In reply to sdasu from comment #1) > CBO is trying to lookup the API Server's IP > (https://github.com/openshift/cluster-baremetal-operator/blob/ > f73e5fcb432e4b847cddec5ce8570f8c5c32e902/controllers/provisioning_controller. > go#L428) to then later determine if it is an v6/v4 address. > But the DNS failures are preventing this call from succeeding. CBO should disable itself on AWS, so I wonder if we should reorder these checks, such that we only lookup the IP and do the networkStack check on platforms where CBO is actually enabled? That said, it'd be good to clarify why CBO is failing to resolve the api-int endpoint, e.g is it an issue impacting other pods. CBO is trying to determine the network stack of the cluster and tries to contact the DNS to determine the IP address of the api-int server. This DNS failure is causing this call to fail. Updated the CBO to not do this lookup on unsupported platforms and AWS happens to be one of them. This still does not take care of the underlying DNS issue. The DNS team needs to take a look at the DNS logs to figure out the root cause for the failure. I have taken care of improving the behavior of CBO when this occurs. Can we get a must-gather from a setup where this failure occurs? Thanks for the must-gather. Passing it along to the DNS team to take a look at the DNS errors. CBO changes are complete at this time. Not a blocker as this doesn't appear to be a regression. Assigning to Chad to investigate. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |