Bug 2088428
Summary: | clusteroperator/baremetal stays in progressing: Applying metal3 resources state on a fresh install | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Siddhant More <simore> |
Component: | Bare Metal Hardware Provisioning | Assignee: | sdasu |
Bare Metal Hardware Provisioning sub component: | cluster-baremetal-operator | QA Contact: | Lubov <lshilin> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | high | CC: | augol, cgaynor, ealcaniz, eglottma, imelofer, janders, kurathod, lshilin, omichael, openshift-bugs-escalate, pibanezr, pmannidi, rbartal, sdasu, tsedovic, vkochuku, wking, zbitter |
Version: | 4.8 | Keywords: | FastFix, OtherQA, Triaged |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:12:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2091738 |
Description
Siddhant More
2022-05-19 12:45:51 UTC
Hi, From the messages we see on the logs and the CBO logs mentioning "client connection lost" we have a few assumptions. 2022-05-17T06:42:11.570481957Z 2022-05-17 06:42:11.569 1 ERROR ironic.drivers.modules.inspector ironic.common.exception.ImageDownloadFailed: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2540bd6208>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)) 2022-05-17T06:42:11.570481957Z 2022-05-17 06:42:11.569 1 ERROR ironic.drivers.modules.inspector 2022-05-17T06:42:13.479335725Z 2022-05-17 06:42:13.478 1 ERROR ironic.conductor.task_manager [req-40b05d56-7302-4e0e-b066-3ebce9bc1ede ironic-user - - - -] Node d736a30a-56e2-4142-9a59-b7b7066eb833 moved to provision state "inspect failed" from state "inspecting"; target provision state is "manageable": ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2540bca048>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)) Ironic was unable to download the image and this can be due to a proxy miss configuration or the image cache is down. Maybe there is also some network glitch happening ? Are they using multipath ? (Just wondering since they are using the HotFix 4.8.29-assembly.art3875 - contains a kernel fix if I recall) It looks like the CBO was successful in deploying all of its resources. However, the code that checks for the metal3 Deployment to be complete simply loops over the conditions in the Deployment and returns the first one that is true: https://github.com/openshift/cluster-baremetal-operator/blob/1146017627352b72e0f2399f9f7e3423f0327eb2/provisioning/baremetal_pod.go#L916-L920 For reasons that are completely unfathomable to me, when a Deployment is finished and no longer progressing, it sets the condition "Progressing: True": https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment So the result here will depend on the order of conditions, and indeed we see that all of the Deployments have the "Progressing" condition before the "Available" condition: status: availableReplicas: 1 conditions: - lastTransitionTime: "2022-05-17T06:35:32Z" lastUpdateTime: "2022-05-17T06:41:26Z" message: ReplicaSet "metal3-85d9dc5d56" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2022-05-17T06:44:00Z" lastUpdateTime: "2022-05-17T06:44:00Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available observedGeneration: 1 readyReplicas: 1 replicas: 1 updatedReplicas: 1 thus CBO decides that the Deployment is still progressing, and continues to report that it too is still progressing. It's not clear why the conditions are showing up in a different order to usual, but it is clear that the fault lies with the CBO logic. Nevertheless, this does not represent a real problem with the cluster. as we discussed, since the problem is not reproduced on our setups, verifying this one as OtherQA after reviewing code changes. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |