Bug 1854142
Summary: | BareMetalHost CRD doesn't validate allowed fields | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Marko Karg <mkarg> | ||||||||||
Component: | Cloud Compute | Assignee: | Doug Hellmann <dhellmann> | ||||||||||
Cloud Compute sub component: | BareMetal Provider | QA Contact: | Raviv Bar-Tal <rbartal> | ||||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||||
Severity: | high | ||||||||||||
Priority: | high | CC: | beth.white, dblack, dhellmann, shardy, stbenjam | ||||||||||
Version: | 4.4 | Keywords: | Triaged | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | 4.6.0 | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2020-10-27 16:12:20 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Created attachment 1700037 [details]
ironic conductor logs
I took a look at the environment where this happened, so some details of what I found below: Looking at the API output we see e.g: "power_state": "power off", "provision_state": "adopt failed", So there are two questions, why did the adopt fail (and why only for the workers added via scale-out?), and then why in that situation did Ironic decide to power off the nodes vs do nothing. In 0.log.20200706-140219 we see 24-b5f6-f8253d4f2f14 - - - - -] During sync_power_state, node c11f9bfc-2f57-4072-aba6-11d3c168bf51 state 'power on' does not match expected state. Changing hardware state to 'power off'.^[[00m However in the BMH we see: errorMessage: 'Host adoption failed: Error while attempting to adopt node c1 1f9bfc-2f57-4072-aba6-11d3c168bf51: Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 does not have any port associate d poweredOn: true ID: c11f9bfc-2f57-4072-aba6-11d3c168bf51 So we can see the adoption failed, but the BMH poweredOn (and the actual state of the physical server) is powered on. From 0.log.20200706-113047 we see: 2020-07-06T11:25:52.086225364+00:00 stderr F 2020-07-06 11:25:52.086 1 DEBUG ironic.conductor.manager [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Conductor master-1.test422.myocp4.com attempting to adopt node c11f9bfc-2f57-4072-aba6-11d3c168bf51 _do_adoption /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:2079^[[00m 2020-07-06T11:25:52.086417481+00:00 stderr F 2020-07-06 11:25:52.086 1 ERROR ironic.conductor.manager [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Error while attempting to adopt node c11f9bfc-2f57-4072-aba6-11d3c168bf51: Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 does not have any port associated with it..: ironic.common.exception.MissingParameterValue: Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 does not have any port associated with it.^[[00m 2020-07-06T11:25:52.086552122+00:00 stderr F 2020-07-06 11:25:52.086 1 DEBUG ironic.common.states [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Exiting old state 'adopting' in response to event 'fail' on_exit /usr/lib/python3.6/site-packages/ironic/common/states.py:294^[[00m 2020-07-06T11:25:52.086665505+00:00 stderr F 2020-07-06 11:25:52.086 1 DEBUG ironic.common.states [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Entering new state 'adopt failed' in response to event 'fail' on_enter /usr/lib/python3.6/site-packages/ironic/common/states.py:300^[[00m 2020-07-06T11:25:52.095324152+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG oslo_concurrency.processutils [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] CMD "ipmitool -I lanplus -H mgmt-e17-h20-b01-fc640.rdu2.scalelab.redhat.com -L ADMINISTRATOR -p 623 -U quads -R 12 -N 5 -f /tmp/tmpbyn4p0my power status" returned: 0 in 0.228s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409^[[00m 2020-07-06T11:25:52.095543368+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG ironic.common.utils [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] Execution completed, command line is "ipmitool -I lanplus -H mgmt-e17-h20-b01-fc640.rdu2.scalelab.redhat.com -L ADMINISTRATOR -p 623 -U quads -R 12 -N 5 -f /tmp/tmpbyn4p0my power status" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:77^[[00m 2020-07-06T11:25:52.095704170+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG ironic.common.utils [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] Command stdout is: "Chassis Power is off 2020-07-06T11:25:52.095704170+00:00 stderr F " execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:78^[[00m 2020-07-06T11:25:52.095856584+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG ironic.common.utils [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] Command stderr is: "" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:79^[[00m 2020-07-06T11:25:52.096155973+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG ironic.conductor.task_manager [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] Successfully released shared lock for power state sync on node a60ddd23-a82d-4d2e-b7b5-9c48ef380f1a (lock was held 0.28 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356^[[00m 2020-07-06T11:25:52.097107560+00:00 stderr F 2020-07-06 11:25:52.096 1 ERROR ironic.conductor.task_manager [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 moved to provision state "adopt failed" from state "adopting"; target provision state is "active": ironic.common.exception.MissingParameterValue: Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 does not have any port associated with it.^[[00m Created attachment 1700042 [details]
yaml file used to add workers
I noticed that for the hosts that failed to adopt we're missing the bootMACAddress in the BMH output, e.g: $ oc get bmh -o json | jq '.items[] | (.metadata.name,.spec.bootMACAddress)' | head -n 20 "master-0" "40:a6:b7:00:5d:20" "master-1" "3c:fd:fe:e7:95:10" "master-2" "3c:fd:fe:e7:b5:50" "worker-0" "3c:fd:fe:e7:98:50" "worker-1" "40:a6:b7:00:5e:70" "worker-2" "40:a6:b7:00:63:10" "worker-3" "40:a6:b7:00:69:60" "worker004" null "worker005" null "worker006" null ... However in the attachment from comment#3 we see e.g `bootMacAddress: 40:a6:b7:00:5b:f0` so I wonder if the difference in capitalization means the data is silently ignored... Created attachment 1700049 [details]
baremetal operator logs
Ok so digging some more this may be a BMO problem, becaus we see the following error: $ oc logs metal3-57fbfdfd87-xwhvg --timestamps=true -c metal3-baremetal-operator -n openshift-machine-api | grep "worker004" 2020-07-06T16:19:18.133541251Z {"level":"info","ts":1594052358.1335247,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"openshift-machine-api","Request.Name":"worker004"} 2020-07-06T16:19:18.133576083Z {"level":"info","ts":1594052358.1335642,"logger":"baremetalhost","msg":"stopping on host error","Request.Namespace":"openshift-machine-api","Request.Name":"worker004","provisioningState":"registration error","message":"Host adoption failed: Error while attempting to adopt node f6b66ff4-2ffd-44c8-abea-22ad7eccb246: Node f6b66ff4-2ffd-44c8-abea-22ad7eccb246 does not have any port associated with it.."} However I don't see any corresponding call to the code which creates the port for worker004 (even after manually patching the BMH to correct the bootMACAddress naming and restarting the metal3 pod: https://github.com/openshift/baremetal-operator/blob/release-4.4/pkg/provisioner/ironic/ironic.go#L210 [kni@e16-h12-b01-fc640 ~]$ oc logs metal3-57fbfdfd87-xwhvg --timestamps=true -c metal3-baremetal-operator -n openshift-machine-api | grep "validating management access" | grep worker004 [kni@e16-h12-b01-fc640 ~]$ oc logs metal3-57fbfdfd87-xwhvg --timestamps=true -c metal3-baremetal-operator -n openshift-machine-api | grep "validating management access" | grep worker-3 2020-07-06T16:19:17.073637151Z {"level":"info","ts":1594052357.0735617,"logger":"baremetalhost_ironic","msg":"validating management access","host":"worker-3"} [kni@e16-h12-b01-fc640 ~]$ oc get bmh -o json | jq '.items[] | select(.metadata.name == "worker004") | .spec' { "bmc": { "address": "ipmi://mgmt-e16-h18-b04-fc640.rdu2.scalelab.redhat.com", "credentialsName": "worker-004-bmc-secret" }, "bootMACAddress": "40:a6:b7:00:5b:f0", We also don't see any existing node lookup e.g: https://github.com/openshift/baremetal-operator/blob/release-4.4/pkg/provisioner/ironic/ironic.go#L872 oc logs metal3-57fbfdfd87-xwhvg --timestamps=true -c metal3-baremetal-operator -n openshift-machine-api | grep "existing" | grep worker004 [kni@e16-h12-b01-fc640 ~]$ So the root cause of this is the typo in the BMH definition, but I raised https://github.com/metal3-io/baremetal-operator/issues/580 to discuss how we may improve validation and also the recovery from this error path. Also raised https://storyboard.openstack.org/#!/story/2007901 to track the issue that Ironic powered down the hosts on adopt failed, which seems wrong as adopt is supposed to respect the existing/initial power state. https://github.com/metal3-io/baremetal-operator/pull/597 proposed upstream which enables pruning - this allows stricter client validation which would catch the typo in the BMH CR which caused this behavior. https://github.com/openshift/machine-api-operator/pull/661 should resolve this as it enables pruning in the BMH CRD maintained in the MAO repo Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |
Created attachment 1700036 [details] worker nodes Description of problem: After a simulated power outage on a 120 node baremetal cluster, 3 masters have been successfully powered on again and formed a cluster. Powering on the worker nodes does not work, they immediately get powered off through an IPMI call from ironic again (except for worker nodes that have been deployed during the installation) Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-06-01-021027 How reproducible: always Steps to Reproduce: 1. set up a 3 master / 117 worker cluster 2. pull the power plug 3. restart the masters, wait for them to form quorum, then start the workers Actual results: Except for the workers that have been deployed during the cluster installation, all workers will be shut down within seconds after being powered on. Expected results: All workers should boot after being powered on and join the cluster so that they can take workloads again. Additional info: contents of the log director for the ironic conductor and the output of `curl master-1:6385/v1/nodes | jq .` is attached.