Description of problem: A IPI baremetal deployment with disks exposed through multiple SAN paths, if one of them is passive, the cleaning process of this disks will fail and will render the clean phase as failed. Version-Release number of selected component (if applicable): OCP 4.8.29 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: This was extracted from a failed inspection: ~~~ 2022-03-29 07:32:12.322 1 DEBUG ironic.drivers.modules.agent_client [-] Status of agent commands for node 365a2544-613a-4850-b9fe-deb7c62da9f2: get_clean_steps: result "{'clean_steps': {'GenericHardwareManager': [{'step': 'erase_devices', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'erase_devices_metadata', 'priority': 99, 'interface': 'deploy', 'reboot_requested': Fal se, 'abortable': True}, {'step': 'erase_pstore', 'priority': 0, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True}, {'step': 'delete_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}, {'step': 'create_configuration', 'priority': 0, 'interface': 'raid', 'reboot_requested': False, 'abortable': True}]}, 'hardware_manager_version': {'generic_hardware_manager ': '1.1'}}", error "None"; execute_clean_step: result "None", error "{'type': 'CleaningError', 'code': 500, 'message': 'Clean step failed', 'details': 'Error performing clean_step erase_devices_metadata: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdd": Unexpected error while running command.\nCommand: dd bs=512 if=/dev/zero of=/dev/sdd count=33\nExit code: 1\nStdout: \'\'\nStderr: "dd: error writing \'/dev/sdd\': Input/output error\\n1+0 records in\\n0+0 records out\\n0 bytes copied, 0.000211819 s, 0.0 kB/s\\n"; "/dev/sdc": Unexpected error while running command.\nCommand: dd bs=512 if=/dev/zero of=/dev/sdc count=33\nExit code: 1\nStdout: \'\'\nStderr: "dd: error writing \'/dev/sdc\': Input/output error\\n1+0 records in\\n0+0 records out\\n0 bytes copied, 0.000180593 s, 0.0 kB/s\\n"'}" get_commands_status /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:343 ~~~ This is how the disk is seen by the inspector image: ~~~ [ 12.807781] scsi 3:0:0:0: Direct-Access Nimble Server 1.0 PQ: 0 ANSI: 5 [ 12.819297] scsi 3:0:0:0: alua: supports implicit TPGS [ 12.837085] scsi 3:0:0:0: alua: device t10.Nimble f9004c23913a954c6c9ce90019112fc8 port group 1 rel port 2 [ 12.846215] sd 3:0:0:0: Attached scsi generic sg4 type 0 [ 12.846726] sd 3:0:0:0: Power-on or device reset occurred [ 12.855666] sd 3:0:0:0: alua: port group 01 state S non-preferred supports tolusna [ 12.873560] sd 3:0:0:0: alua: port group 01 state S non-preferred supports tolusna [ 12.875683] sd 3:0:0:0: [sdd] 419430400 512-byte logical blocks: (215 GB/200 GiB) [ 12.891015] sd 3:0:0:0: [sdd] Write Protect is off [ 12.899652] sd 3:0:0:0: [sdd] Mode Sense: 9b 00 00 08 [ 12.899763] sd 3:0:0:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [ 12.918825] Dev sdd: unable to read RDB block 0 [ 12.934829] sdd: unable to read partition table [ 12.955206] sd 3:0:0:0: [sdd] Attached SCSI disk ~~~ More logs: ~~~ Mar 30 02:20:15 localhost.localdomain smartd[1166]: Configuration file /etc/smartmontools/smartd.conf was parsed, found DEVICESCAN, scanning devices Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdb, opened Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdb, [Nimble Server 1.0 ], lu id: 0xf9004c23913a954c6c9ce90019112fc8, S/N: f9004c23913a954c6c9ce90019112fc8, 214 GB Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdb, NOT READY (e.g. spun down); skip device Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdc, opened Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdc, [Nimble Server 1.0 ], lu id: 0xf9004c23913a954c6c9ce90019112fc8, S/N: f9004c23913a954c6c9ce90019112fc8, 214 GB Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdc, is SMART capable. Adding to "monitor" list. Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdd, opened Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdd, [Nimble Server 1.0 ], lu id: 0xf9004c23913a954c6c9ce90019112fc8, S/N: f9004c23913a954c6c9ce90019112fc8, 214 GB Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sdd, same identity as /dev/sdc, ignored Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sde, opened Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sde, [Nimble Server 1.0 ], lu id: 0xf9004c23913a954c6c9ce90019112fc8, S/N: f9004c23913a954c6c9ce90019112fc8, 214 GB Mar 30 02:20:15 localhost.localdomain smartd[1166]: Device: /dev/sde, same identity as /dev/sdc, ignored # lsblk -t NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sdb 0 512 0 512 512 0 none 5884 128 1G sdc 0 512 0 512 512 0 none 5884 128 1G sdd 0 512 0 512 512 0 none 5884 128 1G sde 0 512 0 512 512 0 none 5884 128 1G sr0 0 2048 0 2048 2048 1 mq-deadline 2 128 0B ~~~ Expected results: Ideally it should only erase metadata on one disk, but it is also valid that it deletes metadata from the active devices only (prevents failure), even though that means erasing the same disk twice. Additional info:
For reference, we've encountered a similar issue in UPI-land related to non-optimized paths. In the end, we added full support for installing to multipath devices. Some links: - https://docs.openshift.com/container-platform/4.8/installing/installing_bare_metal/installing-bare-metal.html#rhcos-enabling-multipath_installing-bare-metal - https://github.com/coreos/fedora-coreos-config/pull/1011 - https://github.com/openshift/os/blob/master/docs/faq.md#q-does-rhcos-support-multipath It seems like Ironic needs to learn to do the same thing here: assemble the multipath, and only manipulate the multipathed device itself and not underlying singular paths. For hooking into RHCOS' support for first-boot multipath, it would also need to add the kernel arguments described in the OpenShift docs above.
As a quick and dirty solution, can't we just skip devices that cannot be used? at least as a workaround.
(In reply to Mario Abajo from comment #2) > As a quick and dirty solution, can't we just skip devices that cannot be > used? at least as a workaround. This is technically possible (with a code change to ironic-python-agent) but if we start ignoring errors cleaning disks then other ironic users could end up leaking data between customers if we failed to clean a disk and ignored it.
My $0.02 is just load up multipathd and hopefully it will recognize the san pathing and de-duplicate it. Unfortunately SAN controllers often offer different behavior or sometimes need special configurations which is why we have been shy about incorporating in in to ramdisks by default. I'd personally prefer we don't ignore failed devices due to the data leakage risk being so high when that starts to happen.
Hello everyone, I've talked with Julia today and we think we need to add a new element to the ramdisk (to be able to identify multipath devices) I've pushed the upstream change already and we will work on backporting from 4.11 to 4.8
Since using a release image with the modified ironic-ipa-downloader image (which contains https://review.opendev.org/c/openstack/ironic-python-agent/+/837784 )we can try manually updating the ironic-ipa-downloader image after having the cluster. The procedure I've tested locally and worked is: After having you deployment you should first check if there no unmanaged resources in you cluster $ oc get -o json clusterversion version | jq .spec.overrides After verifying that, make sure that move the cluster-baremetal-operator-images config map to unmanaged, this can be done by running the following command: $ oc patch clusterversion version --namespace openshift-cluster-version --type merge -p '{"spec":{"overrides":[{"kind":"ConfigMap","group":"v1","name":"cluster-baremetal-operator-images","namespace":"openshift-machine-api","unmanaged":true}]}}' clusterversion.config.openshift.io/version patched Check if the clusterversion shows a new resource that should be unmanaged $ oc get -o json clusterversion version | jq .spec.overrides [ { "group": "v1", "kind": "ConfigMap", "name": "cluster-baremetal-operator-images", "namespace": "openshift-machine-api", "unmanaged": true } ] Edit the ConfigMap for the cluster-baremetal-operator-images and change the value of baremetalIpaDownloader to the new image, in our case quay.io/imelofer/ipa-multipath@sha256:cead7e5a6fe9ad2c5027282a8a74ec12224aafe6b4524fd879f25c4ecc996485 $ oc edit ConfigMap cluster-baremetal-operator-images configmap/cluster-baremetal-operator-images edited Wait about 3min and double check if the ConfigMap still contains the right config $ oc describe ConfigMap cluster-baremetal-operator-images | grep "imelofer" Delete the CBO and wait for the CVO to bring back (after a new CBO starts you can try to add new nodes to your cluster) $ oc get pods -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-78dbcdbf85-hdp44 2/2 Running 0 78m cluster-baremetal-operator-58b9dd5c45-pfhwd 2/2 Running 1 78m machine-api-controllers-5bb58fb7bf-lp4fn 7/7 Running 1 73m machine-api-operator-658749fccf-rq6c8 2/2 Running 1 78m $ oc delete po cluster-baremetal-operator-58b9dd5c45-pfhwd -n openshift-machine-api pod "cluster-baremetal-operator-58b9dd5c45-pfhwd" deleted $ oc get pods -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-78dbcdbf85-hdp44 2/2 Running 0 79m cluster-baremetal-operator-58b9dd5c45-72rsp 2/2 Running 0 25s machine-api-controllers-5bb58fb7bf-lp4fn 7/7 Running 1 74m machine-api-operator-658749fccf-rq6c8 2/2 Running 1 79m
*** Bug 2077067 has been marked as a duplicate of this bug. ***
Setting target release for this BZ to 4.8.z, since we already have the bugs for 4.11, 4.10, 4.9
Deployment of 4.8.0-0.nightly-2022-06-15-131405 passed successfully and sanity tests passed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.45 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5167
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days