Bug 1854142

Summary: BareMetalHost CRD doesn't validate allowed fields
Product: OpenShift Container Platform Reporter: Marko Karg <mkarg>
Component: Cloud ComputeAssignee: Doug Hellmann <dhellmann>
Cloud Compute sub component: BareMetal Provider QA Contact: Raviv Bar-Tal <rbartal>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: beth.white, dblack, dhellmann, shardy, stbenjam
Version: 4.4Keywords: Triaged
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:12:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
worker nodes
none
ironic conductor logs
none
yaml file used to add workers
none
baremetal operator logs none

Description Marko Karg 2020-07-06 14:36:08 UTC
Created attachment 1700036 [details]
worker nodes

Description of problem:

After a simulated power outage on a 120 node baremetal cluster, 3 masters have been successfully powered on again and formed a cluster. Powering on the worker nodes does not work, they immediately get powered off through an IPMI call from ironic again (except for worker nodes that have been deployed during the installation)

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-06-01-021027

How reproducible:
always

Steps to Reproduce:
1. set up a 3 master / 117 worker cluster
2. pull the power plug
3. restart the masters, wait for them to form quorum, then start the workers

Actual results:
Except for the workers that have been deployed during the cluster installation, all workers will be shut down within seconds after being powered on. 

Expected results:
All workers should boot after being powered on and join the cluster so that they can take workloads again. 


Additional info:

contents of the log director for the ironic conductor and the output of `curl master-1:6385/v1/nodes | jq .` is attached.

Comment 1 Marko Karg 2020-07-06 14:36:54 UTC
Created attachment 1700037 [details]
ironic conductor logs

Comment 2 Steven Hardy 2020-07-06 15:02:43 UTC
I took a look at the environment where this happened, so some details of what I found below:

Looking at the API output we see e.g:

      "power_state": "power off",
      "provision_state": "adopt failed",

So there are two questions, why did the adopt fail (and why only for the workers added via scale-out?), and then why in that situation did Ironic decide to power off the nodes vs do nothing.

In 0.log.20200706-140219 we see

  24-b5f6-f8253d4f2f14 - - - - -] During sync_power_state, node c11f9bfc-2f57-4072-aba6-11d3c168bf51 state 'power on' does not match expected state. Changing hardware state to 'power  off'.^[[00m

However in the BMH we see:

    errorMessage: 'Host adoption failed: Error while attempting to adopt node c1
1f9bfc-2f57-4072-aba6-11d3c168bf51:
      Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 does not have any port associate
d
    poweredOn: true
      ID: c11f9bfc-2f57-4072-aba6-11d3c168bf51

So we can see the adoption failed, but the BMH poweredOn (and the actual state of the physical server) is powered on.

From 0.log.20200706-113047 we see:

2020-07-06T11:25:52.086225364+00:00 stderr F 2020-07-06 11:25:52.086 1 DEBUG ironic.conductor.manager [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Conductor master-1.test422.myocp4.com attempting to adopt node c11f9bfc-2f57-4072-aba6-11d3c168bf51 _do_adoption /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:2079^[[00m
2020-07-06T11:25:52.086417481+00:00 stderr F 2020-07-06 11:25:52.086 1 ERROR ironic.conductor.manager [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Error while attempting to adopt node c11f9bfc-2f57-4072-aba6-11d3c168bf51: Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 does not have any port associated with it..: ironic.common.exception.MissingParameterValue: Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 does not have any port associated with it.^[[00m
2020-07-06T11:25:52.086552122+00:00 stderr F 2020-07-06 11:25:52.086 1 DEBUG ironic.common.states [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Exiting old state 'adopting' in response to event 'fail' on_exit /usr/lib/python3.6/site-packages/ironic/common/states.py:294^[[00m
2020-07-06T11:25:52.086665505+00:00 stderr F 2020-07-06 11:25:52.086 1 DEBUG ironic.common.states [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Entering new state 'adopt failed' in response to event 'fail' on_enter /usr/lib/python3.6/site-packages/ironic/common/states.py:300^[[00m
2020-07-06T11:25:52.095324152+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG oslo_concurrency.processutils [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] CMD "ipmitool -I lanplus -H mgmt-e17-h20-b01-fc640.rdu2.scalelab.redhat.com -L ADMINISTRATOR -p 623 -U quads -R 12 -N 5 -f /tmp/tmpbyn4p0my power status" returned: 0 in 0.228s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409^[[00m
2020-07-06T11:25:52.095543368+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG ironic.common.utils [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] Execution completed, command line is "ipmitool -I lanplus -H mgmt-e17-h20-b01-fc640.rdu2.scalelab.redhat.com -L ADMINISTRATOR -p 623 -U quads -R 12 -N 5 -f /tmp/tmpbyn4p0my power status" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:77^[[00m
2020-07-06T11:25:52.095704170+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG ironic.common.utils [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] Command stdout is: "Chassis Power is off
2020-07-06T11:25:52.095704170+00:00 stderr F " execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:78^[[00m
2020-07-06T11:25:52.095856584+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG ironic.common.utils [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] Command stderr is: "" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:79^[[00m
2020-07-06T11:25:52.096155973+00:00 stderr F 2020-07-06 11:25:52.095 1 DEBUG ironic.conductor.task_manager [req-0bafc479-63a7-4224-b5f6-f8253d4f2f14 - - - - -] Successfully released shared lock for power state sync on node a60ddd23-a82d-4d2e-b7b5-9c48ef380f1a (lock was held 0.28 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356^[[00m
2020-07-06T11:25:52.097107560+00:00 stderr F 2020-07-06 11:25:52.096 1 ERROR ironic.conductor.task_manager [req-71902bb2-1331-46d7-acd6-8f7c0340a77d - - - - -] Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 moved to provision state "adopt failed" from state "adopting"; target provision state is "active": ironic.common.exception.MissingParameterValue: Node c11f9bfc-2f57-4072-aba6-11d3c168bf51 does not have any port associated with it.^[[00m

Comment 3 Marko Karg 2020-07-06 15:17:51 UTC
Created attachment 1700042 [details]
yaml file used to add workers

Comment 4 Steven Hardy 2020-07-06 15:36:52 UTC
I noticed that for the hosts that failed to adopt we're missing the bootMACAddress in the BMH output, e.g:

$ oc get bmh -o json | jq '.items[] | (.metadata.name,.spec.bootMACAddress)' | head -n 20
"master-0"
"40:a6:b7:00:5d:20"
"master-1"
"3c:fd:fe:e7:95:10"
"master-2"
"3c:fd:fe:e7:b5:50"
"worker-0"
"3c:fd:fe:e7:98:50"
"worker-1"
"40:a6:b7:00:5e:70"
"worker-2"
"40:a6:b7:00:63:10"
"worker-3"
"40:a6:b7:00:69:60"
"worker004"
null
"worker005"
null
"worker006"
null
...

However in the attachment from comment#3 we see e.g `bootMacAddress: 40:a6:b7:00:5b:f0` so I wonder if the difference in capitalization means the data is silently ignored...

Comment 5 Marko Karg 2020-07-06 15:51:26 UTC
Created attachment 1700049 [details]
baremetal operator logs

Comment 6 Steven Hardy 2020-07-06 16:54:47 UTC
Ok so digging some more this may be a BMO problem, becaus we see the following error:

$ oc logs metal3-57fbfdfd87-xwhvg --timestamps=true -c metal3-baremetal-operator -n openshift-machine-api | grep "worker004"
2020-07-06T16:19:18.133541251Z {"level":"info","ts":1594052358.1335247,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"openshift-machine-api","Request.Name":"worker004"}
2020-07-06T16:19:18.133576083Z {"level":"info","ts":1594052358.1335642,"logger":"baremetalhost","msg":"stopping on host error","Request.Namespace":"openshift-machine-api","Request.Name":"worker004","provisioningState":"registration error","message":"Host adoption failed: Error while attempting to adopt node f6b66ff4-2ffd-44c8-abea-22ad7eccb246: Node f6b66ff4-2ffd-44c8-abea-22ad7eccb246 does not have any port associated with it.."}


However I don't see any corresponding call to the code which creates the port for worker004 (even after manually patching the BMH to correct the bootMACAddress naming and restarting the metal3 pod:

https://github.com/openshift/baremetal-operator/blob/release-4.4/pkg/provisioner/ironic/ironic.go#L210

[kni@e16-h12-b01-fc640 ~]$ oc logs metal3-57fbfdfd87-xwhvg --timestamps=true -c metal3-baremetal-operator -n openshift-machine-api | grep "validating management access" | grep worker004
[kni@e16-h12-b01-fc640 ~]$ oc logs metal3-57fbfdfd87-xwhvg --timestamps=true -c metal3-baremetal-operator -n openshift-machine-api | grep "validating management access" | grep worker-3
2020-07-06T16:19:17.073637151Z {"level":"info","ts":1594052357.0735617,"logger":"baremetalhost_ironic","msg":"validating management access","host":"worker-3"}

[kni@e16-h12-b01-fc640 ~]$ oc get bmh -o json | jq '.items[] | select(.metadata.name == "worker004") | .spec'
{
  "bmc": {
    "address": "ipmi://mgmt-e16-h18-b04-fc640.rdu2.scalelab.redhat.com",
    "credentialsName": "worker-004-bmc-secret"
  },
  "bootMACAddress": "40:a6:b7:00:5b:f0",


We also don't see any existing node lookup e.g:

https://github.com/openshift/baremetal-operator/blob/release-4.4/pkg/provisioner/ironic/ironic.go#L872

oc logs metal3-57fbfdfd87-xwhvg --timestamps=true -c metal3-baremetal-operator -n openshift-machine-api | grep "existing" | grep worker004
[kni@e16-h12-b01-fc640 ~]$

Comment 7 Steven Hardy 2020-07-07 14:47:26 UTC
So the root cause of this is the typo in the BMH definition, but I raised https://github.com/metal3-io/baremetal-operator/issues/580 to discuss how we may improve validation and also the recovery from this error path.

Comment 8 Steven Hardy 2020-07-08 11:07:35 UTC
Also raised https://storyboard.openstack.org/#!/story/2007901 to track the issue that Ironic powered down the hosts on adopt failed, which seems wrong as adopt is supposed to respect the existing/initial power state.

Comment 9 Steven Hardy 2020-07-27 17:07:01 UTC
https://github.com/metal3-io/baremetal-operator/pull/597 proposed upstream which enables pruning - this allows stricter client validation which would catch the typo in the BMH CR which caused this behavior.

Comment 10 Steven Hardy 2020-07-31 09:08:37 UTC
https://github.com/openshift/machine-api-operator/pull/661 should resolve this as it enables pruning in the BMH CRD maintained in the MAO repo

Comment 20 errata-xmlrpc 2020-10-27 16:12:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196