Bug 1975711

Summary: ironic hardware inspection failed due to NewConnectionError causes bm nodes stuck
Product: OpenShift Container Platform Reporter: Derek Higgins <derekh>
Component: Bare Metal Hardware ProvisioningAssignee: Derek Higgins <derekh>
Bare Metal Hardware Provisioning sub component: cluster-baremetal-operator QA Contact: elevin
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: low CC: aos-bugs, asalkeld, ctauchen, derekh, elevin, lshilin, nkononov, rbartal, tsze
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, when the configuration settings for a bare metal deployment included a value for `provisioningHostIP` even when `provisioningNetwork` was disabled, the metal3 pod would start with a provisioning IP address that was not maintained. Ironic took this provisioning IP address when it started, and would fail when the address stopped working. With this bug fix, the system ignores `provisioningHostIP` when `provisioningNetwork` is disabled. Ironic starts with a properly configured external IP address.
Story Points: ---
Clone Of: 1972753 Environment:
Last Closed: 2021-10-19 20:35:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1972753    
Bug Blocks:    

Description Derek Higgins 2021-06-24 09:25:08 UTC
+++ This bug was initially created as a clone of Bug #1972753 +++

Description of problem:

ocp 4.8.0-fc.9 deployment failed (idrac+vritual media) due to following error:

{"level":"info","ts":1623681701.1185718,"logger":"controllers.BareMetalHost","msg":"inspecting hardware","baremetalhost":"openshift-machine-api/hlxcl2-worker-0","provisioningState":"inspecting"}
{"level":"info","ts":1623681701.118577,"logger":"controllers.BareMetalHost","msg":"inspecting hardware","baremetalhost":"openshift-machine-api/hlxcl2-worker-0","provisioningState":"inspecting"}
{"level":"info","ts":1623681701.1185808,"logger":"provisioner.ironic","msg":"inspecting hardware","host":"openshift-machine-api~hlxcl2-worker-0"}
{"level":"info","ts":1623681701.1474018,"logger":"provisioner.ironic","msg":"updating boot mode before hardware inspection","host":"openshift-machine-api~hlxcl2-worker-0"}
{"level":"error","ts":1623681702.8093436,"logger":"controller-runtime.manager.controller.baremetalhost","msg":"Reconciler error","reconciler group":"metal3.io","reconciler kind":"BareMetalHost","name":"hlxcl2-worker-0","namespace":"openshift-machine-api","error":"action \"inspecting\" failed: hardware inspection failed: failed to update host boot mode settings in ironic: Internal Server Error","errorVerbose":"Internal Server Error\nfailed to update host boot mode settings in ironic\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).InspectHardware\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:708\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:671\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:360\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371\nhardware inspection failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:678\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:360\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371\naction \"inspecting\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:239\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:267\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1623681703.8316412,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/hlxcl2-worker-1"}
{"level":"info","ts":1623681703.9150536,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/hlxcl2-worker-1","provisioningState":"inspecting","credentials":{"credentials":{"name":"hlxcl2-worker-1-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"17565"}}
{"level":"info","ts":1623681703.940474,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~hlxcl2-worker-1","lastError":"Failed to inspect hardware. Reason: unable to start inspection: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1c33396048>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))","current":"inspect failed","target":"manageable"}


Version-Release number of selected component (if applicable):
ocp 4.8.0-fc.9
3 VM masters 
2 BM worker (Dell R740)


How reproducible:
Trigger ocp 4.8.0-fc.9 installation. 


Actual results:
3 masters are UP. 
2 BM workers stuck due to ironic issue. OCP installation failed 

Expected results:
OCP installed successfully on cluster 

Additional info:
There is a workaround. The only way to fix this connection error is to restart metal3 pod. After metal3 restarted deployment started working as expected. Some times we need to configure Provisioning ip manually along with metal3 pod restart. We noticed that provisioning ip could disappear randomly.
Mustgather logs attached.

--- Additional comment from Derek Higgins on 2021-06-16 16:38:17 IST ---

Your baremetal operator is failing with this error
2021-06-15T14:18:25.536730720Z {"level":"error","ts":1623766705.5358071,"logger":"controller-runtime.manager.controller.baremetalhost","msg":"Reconciler error","reconciler group":"metal3.io","reconciler kind":"BareMetalHost","name":"hlxcl2-worker-1","namespace":"openshift-machine-api","error":"action \"inspecting\" failed: hardware inspection failed: failed to update host boot mode settings in ironic: Internal Server Error","errorVerbose":"Internal Server Error\nfailed to update host boot mode settings in ironic

this corresponds with this error in your ironic-api log
2021-06-15T14:18:25.534677648Z 2021-06-15 14:18:25.533 52 ERROR ironic.api.method [req-9e364490-e81c-4783-b495-a3b1dc14b39f ironic-user - - - -] Server-side error: "Unable to establish connection to https://10.46.55.124:8089: HTTPSConnectionPool(host='10.46.55.124', port=8089): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f52f1c1ee80>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))". Detail:

The reason I think is that your your metal3 pod has a container to set the provisioning IP "metal3-static-ip-set"
but your missing the container that ensures that the IP isn't lost over time "metal3-static-ip-manager"

Looking at the CBO code "metal3-static-ip-set" is included if you have a provisioning IP set[1] but
"metal3-static-ip-manager" is only added if you have both a provisioning IP and have set and ProvisioningNetwork is not Disabled
this seems inconsistent as if you need one you need the other

you have 
    provisioningHostIP: 10.46.55.124
    provisioningNetwork: Disabled


So when the pod starts "metal3-static-ip-set" assigns an IP to the provisioning nic but you have no "metal3-static-ip-manager" to keep it there


If you don't need it, I think unsetting "provisioningHostIP" in your install-config should allow your workers to deploy (the external IP will be used by ironic)
Can you confirm if this works, then we can work on fixing the inconsistencies in the metal3 pod

1 - https://github.com/openshift/cluster-baremetal-operator/blob/04a2ae2/provisioning/baremetal_pod.go#L238-L240
2 - https://github.com/openshift/cluster-baremetal-operator/blob/04a2ae2/provisioning/baremetal_pod.go#L344-L346

Comment 2 Raviv Bar-Tal 2021-09-12 12:04:35 UTC
Hey @elevin 
This BZ is clone of BZ1972753 which you verified
Can you please verify this backport as well?

Thanks

Comment 4 elevin 2021-09-14 06:33:44 UTC
Hey @rbartal 
I'll do once I have an environment

Comment 12 elevin 2021-10-12 08:08:07 UTC
4.8.14 deployed successfully

Comment 16 errata-xmlrpc 2021-10-19 20:35:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3821

Comment 17 Red Hat Bugzilla 2023-09-15 01:10:30 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days