1986654 – [OCP4.9 Bug] Auto cleaning step in Prepare stage failed

Bug 1986654 - [OCP4.9 Bug] Auto cleaning step in Prepare stage failed

Summary: [OCP4.9 Bug] Auto cleaning step in Prepare stage failed

Keywords:
Status:	CLOSED DUPLICATE of bug 2012798
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Jacob Anders
QA Contact:	Fujitsu container team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1920358
TreeView+	depends on / blocked

Reported:	2021-07-28 02:27 UTC by Fujitsu container team
Modified:	2023-09-15 01:12 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-11 10:44:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift baremetal-operator pull 166	0	None	open	Fix auto clean failure in preparing state	2021-07-28 02:27:35 UTC

Description Fujitsu container team 2021-07-28 02:27:35 UTC

Description of Problem:

Note: This article is published in the public state at the request of RedHat.

Auto cleaning step(erase_devices_metadata)in Prepare stage failed with the following error:

---------------------------
Agent returned error for clean step {'step': 'erase_devices_metadata',
'priority': 10, 'interface': 'deploy', 'reboot_requested': False, 'abortable':
True, 'requires_ramdisk': True} on node XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX :
Error performing clean_step erase_devices_metadata: Error erasing block device:
Failed to erase the metadata on the device(s): \"/dev/sda\": Unexpected error
while running command.\nCommand: dd bs=512 if=/dev/zero of=/dev/sda count=33\n
Exit code: 1\nStdout: ''\nStderr: \"dd: error writing '/dev/sda': Input/output
error\\n1+0 records in\\n0+0 records out\\n0 bytes copied, 0.00108204 s, 0.0 kB/s\\n\
---------------------------

Since RAID cannot be used immediately after the configuration is completed, it takes a certain amount of time to initialize.
At this time, data cannot be written to the disk, and the above error occurs.
This will result in entering the clean fail state and re-executing manual clean and auto clean, then it causes the above error again.

The root cause is that Ironic does not wait for completion of RAID configuration when Ironic does it.
So we need to implement such waiting process in Ironic.

On the other hand, there is also the problem in Metal3.
In Metal3, manual clean and auto clean are executed in the Preparing stage, but their failure statuses are both clean failed.
Therefore, failure of auto clean will also cause manual clean to be executed again.
This leads to an infinite loop.

Regarding Metal3 side, we have already had PR to address such infinite loop situation.
https://github.com/metal3-io/baremetal-operator/pull/929

Version-Release number of selected component:

This issue was detected in the Pre-GA version.

Red Hat OpenShift Container Platform Version Number: 4.9.0-0.nightly-2021-07-26-071921
Release Number: 4.9
Kubernetes Version: 1.21
Cri-o Version: 0.1.0
Related Component: None
Related Middleware/Application: None
Underlying RHCOS Release Number: 4.9
Underlying RHCOS Architecture: x86_64
Underlying RHCOS Kernel Version: 4.18.0

Drivers or hardware or architecture dependency:

This is a common problem with RAID controllers.

How reproducible:
The error occurs if it takes a time to initialize after RAID setting.

Step to Reproduce:

1. Create install-config.yaml in clusterconfigs:

RAID card is installed on the worker machine.

$ vim ~/clusterconfigs/install-config.yaml

2. Create manifests:

$ openshift-baremetal-install --dir ~/clusterconfigs create manifests

3. Modify manifests:

Write the RAID configuration into the corresponding yaml.

$ vim ~/clusterconfigs/~/clusterconfigs/99_openshift-cluster-api_hosts-3.yaml

4. Create cluster:

$ openshift-baremetal-install --dir ~/clusterconfigs --log-level debug create cluster

Actual Results:

Auto cleaning step in Prepare stage failed.

Expected Results:

Auto cleaning step in Prepare stage does not fail.

Summary of actions taken to resolve issue:

We need to merge upstream(Metal3) and downstream(RHOCP) PRs.
- Upstream: https://github.com/metal3-io/baremetal-operator/pull/929
- Downstream: https://github.com/openshift/baremetal-operator/pull/166

Location of diagnostic data:

None

Hardware configuration:

Model: RX2540 M4

Target Release:

RHOCP4.9

Additional Info:

None

Comment 4 Lubov 2021-08-31 05:14:57 UTC

Please, could you verify this?

Comment 5 Fujitsu container team 2021-08-31 05:32:22 UTC

Hi, Lubov,

Yes, Fujitsu is going to verify this once nightly build which includes PR166 is released.

Best Regards,
Yasuhiro Futakawa

Comment 7 Lubov 2021-09-30 13:39:57 UTC

(In reply to Fujitsu container team from comment #5)
> Hi, Lubov,
> 
> Yes, Fujitsu is going to verify this once nightly build which includes PR166
> is released.
> 
> Best Regards,
> Yasuhiro Futakawa

Hi, I believe the PR is in, could you verify it, please?

Comment 8 Fujitsu container team 2021-10-01 01:13:32 UTC

Hi, Lubov,

> Hi, I believe the PR is in, could you verify it, please?

Correct, but we found the following bugs during verification.
We have already sent patches to Ironic community.

https://review.opendev.org/c/openstack/ironic/+/809022
https://review.opendev.org/c/openstack/ironic/+/809023

We are going to merge the above patches to the next version of OCP, and backport that to OCP4.9.z.
https://bugzilla.redhat.com/show_bug.cgi?id=2005163
https://bugzilla.redhat.com/show_bug.cgi?id=2005165

We plan to verify it again after the backport is complete. 
I believe we can then close this BZ.

Best Regards,
Yasuhiro Futakawa

Comment 9 Lubov 2021-10-05 08:39:28 UTC

assigning to @fj-lsoft-rh-cnt.fujitsu.com to clear our backlog

Comment 10 Mike Fiedler 2021-10-07 18:51:58 UTC

@fj-lsoft-rh-cnt.fujitsu.com @shardy Can someone verify this bug fix on 4.9?

Comment 11 Jacob Anders 2021-10-10 22:57:54 UTC

(In reply to Mike Fiedler from comment #10)
> @fj-lsoft-rh-cnt.fujitsu.com @shardy Can someone verify this bug fix
> on 4.9?

Hi Mike,

This isn't yet backported to 4.9 as we need to have https://bugzilla.redhat.com/show_bug.cgi?id=2011753 verified first (which I've been working on in collaboration with our Fujitsu colleagues who submitted the Ironic patches).

I will need to create a new public BZ to cover this issue as well as https://bugzilla.redhat.com/show_bug.cgi?id=1986656 in order to be able to merge the fix due to GH/OCP bot automation requirements (this BZ won't be able to be used as it is assigned to the Fujitsu group).

When https://bugzilla.redhat.com/show_bug.cgi?id=2011753 is VERIFIED, I will create a new 4.9 BZ and close off this and https://bugzilla.redhat.com/show_bug.cgi?id=1986656 as duplicates of the new BZ. I hope this will happen this week or next week.

Meanwhile, reassigning this to myself.

@Steve @Fujitsu team, feel free to clear needinfo unless you wanted to add anything more to my comment?

Comment 12 Jacob Anders 2021-10-11 10:44:33 UTC

Further work on this will continue in https://bugzilla.redhat.com/show_bug.cgi?id=2012798 which is a public version if this BZ (created due to Github automation requirements).

*** This bug has been marked as a duplicate of bug 2012798 ***

Comment 13 Red Hat Bugzilla 2023-09-15 01:12:13 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.