Description of Problem: Note: This article is published in the public state at the request of RedHat. Auto cleaning step(erase_devices_metadata)in Prepare stage failed with the following error: --------------------------- Agent returned error for clean step {'step': 'erase_devices_metadata', 'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable': True, 'requires_ramdisk': True} on node XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX : Error performing clean_step erase_devices_metadata: Error erasing block device: Failed to erase the metadata on the device(s): \"/dev/sda\": Unexpected error while running command.\nCommand: dd bs=512 if=/dev/zero of=/dev/sda count=33\n Exit code: 1\nStdout: ''\nStderr: \"dd: error writing '/dev/sda': Input/output error\\n1+0 records in\\n0+0 records out\\n0 bytes copied, 0.00108204 s, 0.0 kB/s\\n\ --------------------------- Since RAID cannot be used immediately after the configuration is completed, it takes a certain amount of time to initialize. At this time, data cannot be written to the disk, and the above error occurs. This will result in entering the clean fail state and re-executing manual clean and auto clean, then it causes the above error again. The root cause is that Ironic does not wait for completion of RAID configuration when Ironic does it. So we need to implement such waiting process in Ironic. On the other hand, there is also the problem in Metal3. In Metal3, manual clean and auto clean are executed in the Preparing stage, but their failure statuses are both clean failed. Therefore, failure of auto clean will also cause manual clean to be executed again. This leads to an infinite loop. Regarding Metal3 side, we have already had PR to address such infinite loop situation. https://github.com/metal3-io/baremetal-operator/pull/929 Version-Release number of selected component: This issue was detected in the Pre-GA version. Red Hat OpenShift Container Platform Version Number: 4.9.0-0.nightly-2021-07-26-071921 Release Number: 4.9 Kubernetes Version: 1.21 Cri-o Version: 0.1.0 Related Component: None Related Middleware/Application: None Underlying RHCOS Release Number: 4.9 Underlying RHCOS Architecture: x86_64 Underlying RHCOS Kernel Version: 4.18.0 Drivers or hardware or architecture dependency: This is a common problem with RAID controllers. How reproducible: The error occurs if it takes a time to initialize after RAID setting. Step to Reproduce: 1. Create install-config.yaml in clusterconfigs: RAID card is installed on the worker machine. $ vim ~/clusterconfigs/install-config.yaml 2. Create manifests: $ openshift-baremetal-install --dir ~/clusterconfigs create manifests 3. Modify manifests: Write the RAID configuration into the corresponding yaml. $ vim ~/clusterconfigs/~/clusterconfigs/99_openshift-cluster-api_hosts-3.yaml 4. Create cluster: $ openshift-baremetal-install --dir ~/clusterconfigs --log-level debug create cluster Actual Results: Auto cleaning step in Prepare stage failed. Expected Results: Auto cleaning step in Prepare stage does not fail. Summary of actions taken to resolve issue: We need to merge upstream(Metal3) and downstream(RHOCP) PRs. - Upstream: https://github.com/metal3-io/baremetal-operator/pull/929 - Downstream: https://github.com/openshift/baremetal-operator/pull/166 Location of diagnostic data: None Hardware configuration: Model: RX2540 M4 Target Release: RHOCP4.9 Additional Info: None
Please, could you verify this?
Hi, Lubov, Yes, Fujitsu is going to verify this once nightly build which includes PR166 is released. Best Regards, Yasuhiro Futakawa
(In reply to Fujitsu container team from comment #5) > Hi, Lubov, > > Yes, Fujitsu is going to verify this once nightly build which includes PR166 > is released. > > Best Regards, > Yasuhiro Futakawa Hi, I believe the PR is in, could you verify it, please?
Hi, Lubov, > Hi, I believe the PR is in, could you verify it, please? Correct, but we found the following bugs during verification. We have already sent patches to Ironic community. https://review.opendev.org/c/openstack/ironic/+/809022 https://review.opendev.org/c/openstack/ironic/+/809023 We are going to merge the above patches to the next version of OCP, and backport that to OCP4.9.z. https://bugzilla.redhat.com/show_bug.cgi?id=2005163 https://bugzilla.redhat.com/show_bug.cgi?id=2005165 We plan to verify it again after the backport is complete. I believe we can then close this BZ. Best Regards, Yasuhiro Futakawa
assigning to @fj-lsoft-rh-cnt.fujitsu.com to clear our backlog
@fj-lsoft-rh-cnt.fujitsu.com @shardy Can someone verify this bug fix on 4.9?
(In reply to Mike Fiedler from comment #10) > @fj-lsoft-rh-cnt.fujitsu.com @shardy Can someone verify this bug fix > on 4.9? Hi Mike, This isn't yet backported to 4.9 as we need to have https://bugzilla.redhat.com/show_bug.cgi?id=2011753 verified first (which I've been working on in collaboration with our Fujitsu colleagues who submitted the Ironic patches). I will need to create a new public BZ to cover this issue as well as https://bugzilla.redhat.com/show_bug.cgi?id=1986656 in order to be able to merge the fix due to GH/OCP bot automation requirements (this BZ won't be able to be used as it is assigned to the Fujitsu group). When https://bugzilla.redhat.com/show_bug.cgi?id=2011753 is VERIFIED, I will create a new 4.9 BZ and close off this and https://bugzilla.redhat.com/show_bug.cgi?id=1986656 as duplicates of the new BZ. I hope this will happen this week or next week. Meanwhile, reassigning this to myself. @Steve @Fujitsu team, feel free to clear needinfo unless you wanted to add anything more to my comment?
Further work on this will continue in https://bugzilla.redhat.com/show_bug.cgi?id=2012798 which is a public version if this BZ (created due to Github automation requirements). *** This bug has been marked as a duplicate of bug 2012798 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days