Bug 1972753 - ironic hardware inspection failed due to NewConnectionError causes bm nodes stuck [NEEDINFO]
Summary: ironic hardware inspection failed due to NewConnectionError causes bm nodes s...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: All
OS: Linux
low
high
Target Milestone: ---
: 4.9.0
Assignee: Derek Higgins
QA Contact: elevin
jfrye
URL:
Whiteboard:
: 1991568 (view as bug list)
Depends On:
Blocks: 1975711
TreeView+ depends on / blocked
 
Reported: 2021-06-16 14:53 UTC by Nikita
Modified: 2022-04-21 19:12 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously if provisioningHostIP had been set it was being assigned to the metal3 pod even in cases where the provisioning network had been disabled. This no longer happens.
Clone Of:
: 1975711 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:34:35 UTC
Target Upstream Version:
derekh: needinfo? (elevin)


Attachments (Terms of Use)
Must gather (10.42 MB, application/gzip)
2021-06-16 14:53 UTC, Nikita
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-baremetal-operator pull 165 0 None open Bug 1972753: Only start static ip set if provisioning net not disabled 2021-06-17 13:35:34 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:34:59 UTC

Description Nikita 2021-06-16 14:53:41 UTC
Created attachment 1791564 [details]
Must gather

Description of problem:

ocp 4.8.0-fc.9 deployment failed (idrac+vritual media) due to following error:

{"level":"info","ts":1623681701.1185718,"logger":"controllers.BareMetalHost","msg":"inspecting hardware","baremetalhost":"openshift-machine-api/hlxcl2-worker-0","provisioningState":"inspecting"}
{"level":"info","ts":1623681701.118577,"logger":"controllers.BareMetalHost","msg":"inspecting hardware","baremetalhost":"openshift-machine-api/hlxcl2-worker-0","provisioningState":"inspecting"}
{"level":"info","ts":1623681701.1185808,"logger":"provisioner.ironic","msg":"inspecting hardware","host":"openshift-machine-api~hlxcl2-worker-0"}
{"level":"info","ts":1623681701.1474018,"logger":"provisioner.ironic","msg":"updating boot mode before hardware inspection","host":"openshift-machine-api~hlxcl2-worker-0"}
{"level":"error","ts":1623681702.8093436,"logger":"controller-runtime.manager.controller.baremetalhost","msg":"Reconciler error","reconciler group":"metal3.io","reconciler kind":"BareMetalHost","name":"hlxcl2-worker-0","namespace":"openshift-machine-api","error":"action \"inspecting\" failed: hardware inspection failed: failed to update host boot mode settings in ironic: Internal Server Error","errorVerbose":"Internal Server Error\nfailed to update host boot mode settings in ironic\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).InspectHardware\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:708\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:671\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:360\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371\nhardware inspection failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:678\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleInspecting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:360\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:199\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371\naction \"inspecting\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:239\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:267\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1623681703.8316412,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/hlxcl2-worker-1"}
{"level":"info","ts":1623681703.9150536,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"openshift-machine-api/hlxcl2-worker-1","provisioningState":"inspecting","credentials":{"credentials":{"name":"hlxcl2-worker-1-bmc-secret","namespace":"openshift-machine-api"},"credentialsVersion":"17565"}}
{"level":"info","ts":1623681703.940474,"logger":"provisioner.ironic","msg":"current provision state","host":"openshift-machine-api~hlxcl2-worker-1","lastError":"Failed to inspect hardware. Reason: unable to start inspection: Failed to download image http://localhost:6181/images/ironic-python-agent.kernel, reason: HTTPConnectionPool(host='localhost', port=6181): Max retries exceeded with url: /images/ironic-python-agent.kernel (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1c33396048>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))","current":"inspect failed","target":"manageable"}


Version-Release number of selected component (if applicable):
ocp 4.8.0-fc.9
3 VM masters 
2 BM worker (Dell R740)


How reproducible:
Trigger ocp 4.8.0-fc.9 installation. 


Actual results:
3 masters are UP. 
2 BM workers stuck due to ironic issue. OCP installation failed 

Expected results:
OCP installed successfully on cluster 

Additional info:
There is a workaround. The only way to fix this connection error is to restart metal3 pod. After metal3 restarted deployment started working as expected. Some times we need to configure Provisioning ip manually along with metal3 pod restart. We noticed that provisioning ip could disappear randomly.
Mustgather logs attached.

Comment 1 Derek Higgins 2021-06-16 15:38:17 UTC
Your baremetal operator is failing with this error
2021-06-15T14:18:25.536730720Z {"level":"error","ts":1623766705.5358071,"logger":"controller-runtime.manager.controller.baremetalhost","msg":"Reconciler error","reconciler group":"metal3.io","reconciler kind":"BareMetalHost","name":"hlxcl2-worker-1","namespace":"openshift-machine-api","error":"action \"inspecting\" failed: hardware inspection failed: failed to update host boot mode settings in ironic: Internal Server Error","errorVerbose":"Internal Server Error\nfailed to update host boot mode settings in ironic

this corresponds with this error in your ironic-api log
2021-06-15T14:18:25.534677648Z 2021-06-15 14:18:25.533 52 ERROR ironic.api.method [req-9e364490-e81c-4783-b495-a3b1dc14b39f ironic-user - - - -] Server-side error: "Unable to establish connection to https://10.46.55.124:8089: HTTPSConnectionPool(host='10.46.55.124', port=8089): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f52f1c1ee80>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))". Detail:

The reason I think is that your your metal3 pod has a container to set the provisioning IP "metal3-static-ip-set"
but your missing the container that ensures that the IP isn't lost over time "metal3-static-ip-manager"

Looking at the CBO code "metal3-static-ip-set" is included if you have a provisioning IP set[1] but
"metal3-static-ip-manager" is only added if you have both a provisioning IP and have set and ProvisioningNetwork is not Disabled
this seems inconsistent as if you need one you need the other

you have 
    provisioningHostIP: 10.46.55.124
    provisioningNetwork: Disabled


So when the pod starts "metal3-static-ip-set" assigns an IP to the provisioning nic but you have no "metal3-static-ip-manager" to keep it there


If you don't need it, I think unsetting "provisioningHostIP" in your install-config should allow your workers to deploy (the external IP will be used by ironic)
Can you confirm if this works, then we can work on fixing the inconsistencies in the metal3 pod

1 - https://github.com/openshift/cluster-baremetal-operator/blob/04a2ae2/provisioning/baremetal_pod.go#L238-L240
2 - https://github.com/openshift/cluster-baremetal-operator/blob/04a2ae2/provisioning/baremetal_pod.go#L344-L346

Comment 2 Nikita 2021-06-17 12:14:07 UTC
Hi Derek,

We tried to unset "provisioningHostIP" and it solved the issue. Without "provisioningHostIP" deployment completed successfully.

Comment 5 elevin 2021-06-30 05:30:05 UTC
Verified on 4.8.0-rc.1

Comment 6 Derek Higgins 2021-07-19 09:05:33 UTC
(In reply to elevin from comment #5)
> Verified on 4.8.0-rc.1

The fix for this hasn't yet merged into 4.8, it needs to be verified on 4.9, can you verify there.

Comment 7 Andreas Karis 2021-08-10 18:57:57 UTC
*** Bug 1991568 has been marked as a duplicate of this bug. ***

Comment 8 elevin 2021-09-01 06:09:48 UTC
4.9.0-fc.0 deployed successfully

Comment 11 errata-xmlrpc 2021-10-18 17:34:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.