Bug 1975711 - ironic hardware inspection failed due to NewConnectionError causes bm nodes stuck
Summary: ironic hardware inspection failed due to NewConnectionError causes bm nodes s...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
: 4.8.z
Assignee: Derek Higgins
QA Contact: elevin
URL:
Whiteboard:
Depends On: 1972753
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-24 09:25 UTC by Derek Higgins
Modified: 2023-09-15 01:10 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when the configuration settings for a bare metal deployment included a value for `provisioningHostIP` even when `provisioningNetwork` was disabled, the metal3 pod would start with a provisioning IP address that was not maintained. Ironic took this provisioning IP address when it started, and would fail when the address stopped working. With this bug fix, the system ignores `provisioningHostIP` when `provisioningNetwork` is disabled. Ironic starts with a properly configured external IP address.
Clone Of: 1972753
Environment:
Last Closed: 2021-10-19 20:35:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-baremetal-operator pull 166 0 None Merged Bug 1975711: Only start static ip set if provisioning net not disabled 2021-09-17 15:45:50 UTC
Red Hat Product Errata RHBA-2021:3821 0 None None None 2021-10-19 20:35:46 UTC

Description Derek Higgins 2021-06-24 09:25:08 UTC
+++ This bug was initially created as a clone of Bug #1972753 +++

Description of problem:

ocp 4.8.0-fc.9 deployment failed (idrac+vritual media) due to following error:

{"level":"info","ts":1623681701.1185718,"logger":"controllers.BareMetalHost","msg":"inspecting hardware","baremetalhost":"openshift-machine-api/hlxcl2-worker-0","provisioningState":"inspecting"}
{"level":"info","ts":1623681701.118577,"logger":"controllers.BareMetalHost","msg":"inspecting hardware","baremetalhost":"openshift-machine-api/hlxcl2-worker-0","provisioningState":"inspecting"}
{"level":"info","ts":1623681701.1185808,"logger":"provisioner.ironic","msg":"inspecting hardware","host":"openshift-machine-api~hlxcl2-worker-0"}
{"level":"info","ts":1623681701.1474018,"logger":"provisioner.ironic","msg":"updating boot mode before hardware inspection","host":"openshift-machine-api~hlxcl2-worker-0"}
{"level":"error","ts":1623681702.8093436,"logger":"controller-runtime.manager.controller.baremetalhost","msg":"Reconciler error","reconciler group":"metal3.io","reconciler kind":"BareMetalHost","name":"hlxcl2-worker-0","namespace":"openshift-machine-api","error":"action \"inspecting\" failed: hardware inspection failed: failed to update host boot mode settings in ironic: Internal Server Error","errorVerbose":"Internal Server Error\nfailed to update host boot mode settings in ironic\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).InspectHardware\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:708\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:671\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:360\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371\nhardware inspection failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:678\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:360\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371\naction \"inspecting\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:239\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:267\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1623681703.8316412,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/hlxcl2-worker-1"}
{"level":"info","ts":1623681703.9150536,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/hlxcl2-worker-1","provisioningState":"inspecting","credentials":{"credentials":{"name":"hlxcl2-worker-1-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"17565"}}
{"level":"info","ts":1623681703.940474,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~hlxcl2-worker-1","lastError":"Failed to inspect hardware. Reason: unable to start inspection: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1c33396048>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))","current":"inspect failed","target":"manageable"}


Version-Release number of selected component (if applicable):
ocp 4.8.0-fc.9
3 VM masters 
2 BM worker (Dell R740)


How reproducible:
Trigger ocp 4.8.0-fc.9 installation. 


Actual results:
3 masters are UP. 
2 BM workers stuck due to ironic issue. OCP installation failed 

Expected results:
OCP installed successfully on cluster 

Additional info:
There is a workaround. The only way to fix this connection error is to restart metal3 pod. After metal3 restarted deployment started working as expected. Some times we need to configure Provisioning ip manually along with metal3 pod restart. We noticed that provisioning ip could disappear randomly.
Mustgather logs attached.

--- Additional comment from Derek Higgins on 2021-06-16 16:38:17 IST ---

Your baremetal operator is failing with this error
2021-06-15T14:18:25.536730720Z {"level":"error","ts":1623766705.5358071,"logger":"controller-runtime.manager.controller.baremetalhost","msg":"Reconciler error","reconciler group":"metal3.io","reconciler kind":"BareMetalHost","name":"hlxcl2-worker-1","namespace":"openshift-machine-api","error":"action \"inspecting\" failed: hardware inspection failed: failed to update host boot mode settings in ironic: Internal Server Error","errorVerbose":"Internal Server Error\nfailed to update host boot mode settings in ironic

this corresponds with this error in your ironic-api log
2021-06-15T14:18:25.534677648Z 2021-06-15 14:18:25.533 52 ERROR ironic.api.method [req-9e364490-e81c-4783-b495-a3b1dc14b39f ironic-user - - - -] Server-side error: "Unable to establish connection to https://10.46.55.124:8089: HTTPSConnectionPool(host='10.46.55.124', port=8089): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f52f1c1ee80>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))". Detail:

The reason I think is that your your metal3 pod has a container to set the provisioning IP "metal3-static-ip-set"
but your missing the container that ensures that the IP isn't lost over time "metal3-static-ip-manager"

Looking at the CBO code "metal3-static-ip-set" is included if you have a provisioning IP set[1] but
"metal3-static-ip-manager" is only added if you have both a provisioning IP and have set and ProvisioningNetwork is not Disabled
this seems inconsistent as if you need one you need the other

you have 
    provisioningHostIP: 10.46.55.124
    provisioningNetwork: Disabled


So when the pod starts "metal3-static-ip-set" assigns an IP to the provisioning nic but you have no "metal3-static-ip-manager" to keep it there


If you don't need it, I think unsetting "provisioningHostIP" in your install-config should allow your workers to deploy (the external IP will be used by ironic)
Can you confirm if this works, then we can work on fixing the inconsistencies in the metal3 pod

1 - https://github.com/openshift/cluster-baremetal-operator/blob/04a2ae2/provisioning/baremetal_pod.go#L238-L240
2 - https://github.com/openshift/cluster-baremetal-operator/blob/04a2ae2/provisioning/baremetal_pod.go#L344-L346

Comment 2 Raviv Bar-Tal 2021-09-12 12:04:35 UTC
Hey @elevin 
This BZ is clone of BZ1972753 which you verified
Can you please verify this backport as well?

Thanks

Comment 4 elevin 2021-09-14 06:33:44 UTC
Hey @rbartal 
I'll do once I have an environment

Comment 12 elevin 2021-10-12 08:08:07 UTC
4.8.14 deployed successfully

Comment 16 errata-xmlrpc 2021-10-19 20:35:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3821

Comment 17 Red Hat Bugzilla 2023-09-15 01:10:30 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.