1917482 – periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi failing with "cannot go from state 'deploy failed' to state 'manageable'"

Bug 1917482 - periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi failing with "cannot go from state 'deploy failed' to state 'manageable'"

Summary: periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi failing with "cann...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Derek Higgins
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-18 14:59 UTC by Seth Jennings
Modified:	2021-07-27 22:36 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: While provisioning an image to nodes, qemu-image was restricted to 1G of RAM. Consequence: This sometime resulted in qemu-img crashing Fix: Increased limit to 2G Result: qemu-img now completes provisioning reliably
Clone Of:
Environment:
Last Closed:	2021-07-27 22:36:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ironic-ipa-downloader pull 62	0	None	open	Update ipa ramdisk version for OCP 4.8	2021-03-11 11:53:47 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:36:33 UTC

Description Seth Jennings 2021-01-18 14:59:29 UTC

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi

Example failure:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi/1350998892358930432

It has blocked 3 of the last 5 nightly promotions

 level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m31s [id=8f4ed831-bbf9-4a54-9187-463cda007444]
level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[2]: Still creating... [1m40s elapsed]
level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[2]: Still creating... [1m50s elapsed]
level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[2]: Still creating... [2m0s elapsed]
level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[2]: Creation complete after 2m1s [id=4abe1cd8-7190-4258-bc1f-107ae6541241]
level=error
level=error msg=Error: cannot go from state 'deploy failed' to state 'manageable'
level=error
level=error msg=  on ../../tmp/openshift-install-939336658/masters/main.tf line 38, in resource "ironic_deployment" "openshift-master-deployment":
level=error msg=  38: resource "ironic_deployment" "openshift-master-deployment" {
level=error
level=error
level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change

Comment 1 Zane Bitter 2021-01-19 17:40:27 UTC

There's a high probability that https://github.com/metal3-io/baremetal-operator/pull/745 will fix this particular error message and allow us to retry, but the priority is to investigate why it is failing in the first place - it's highly possible that any retries will also fail for the same reasons.

Comment 2 Derek Higgins 2021-01-20 11:35:55 UTC

I've run the IPv4 job locally several times trying to reproduce this and its finished to completion each time. Looking at the Job history there has been no failures in the last 13 runs, it looks like what ever caused this has passed. 

Finding the root cause was difficult because we don't log the metal3 logs for the ironic containers running on the bootstrap node, I'm working on a patch to capture these in future 
https://github.com/openshift/release/pull/15081/files

But given the problem seems to have been transient are you ok with closing this?

Comment 3 Seth Jennings 2021-01-20 15:16:27 UTC

It does seem this has cleared up in the last 24hrs.  I'm good with closing for now.

Comment 4 jamo luhrsen 2021-02-01 23:05:35 UTC

I think this may still be happening. I saw these two jobs when looking at why our nightly payload got
rejected recently.
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-ipv6/1356109370693259264
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi/1356193268836077568

it might be easier to see that it's happening with this search:
  https://search.ci.openshift.org/?search=cannot+go+from+state+%27deploy+failed%27+to+state+%27manageable%27&maxAge=168h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

do we need to re-open this bug?

Comment 5 Jakub Hrozek 2021-02-04 19:25:22 UTC

I noticed the issue again during my build watcher shift:
 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1357294675551064064
Given this and comment #4 earlier, I'm reopening the bug. Feel free to close again if this is not appropriate.

Comment 6 Derek Higgins 2021-02-08 12:37:21 UTC

(In reply to jamo luhrsen from comment #4)
> I think this may still be happening. I saw these two jobs when looking at
> why our nightly payload got
> rejected recently.
>  
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-ipv6/1356109370693259264
^^ doesn't have the bootstrap logs saved, going to deal with the CI run
below and assume for now its the same problem.

>  
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-ocp-4.7-e2e-metal-ipi/1356193268836077568

The underlying problem can be found in the ironic conductor bootstrap logs (./bootstrap/pods/c0117539972b.log), qemu-img core dumped

2021-02-01 11:19:52.184 1 ERROR ironic.conductor.utils [-] Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 5ee1d1c3-7a50-4ed4-a8d6-8b6f1788f16c : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 134. stdout: write_image.sh: Erasing existing GPT and MBR data structures from /dev/sda
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
write_image.sh: Imaging /tmp/rhcos-47.83.202101161239-0-compressed.x86_64.qcow2 to /dev/sda
. stderr: 33+0 records in
33+0 records out
16896 bytes (17 kB, 16 KiB) copied, 0.000969165 s, 17.4 MB/s
33+0 records in
33+0 records out
16896 bytes (17 kB, 16 KiB) copied, 0.000461273 s, 36.6 MB/s
qemu: qemu_thread_create: Resource temporarily unavailable
/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/../shell/write_image.sh: line 51:  1187 Aborted                 (core dumped) qemu-img convert -t directsync -O host_device $IMAGEFILE $DEVICE
.^[[00m


> 
> it might be easier to see that it's happening with this search:
>  
> https://search.ci.openshift.org/
> ?search=cannot+go+from+state+%27deploy+failed%27+to+state+%27manageable%27&ma
> xAge=168h&context=1&type=build-
> log&name=&maxMatches=5&maxBytes=20971520&groupBy=job

This error message is fairly generic and just indicates that there was a problem with
ironic on the bootstrap node, the underlying problem could be anything provisioning
related.



(In reply to Jakub Hrozek from comment #5)
> I noticed the issue again during my build watcher shift:
>  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-aws-serial-4.7/1357294675551064064
> Given this and comment #4 earlier, I'm reopening the bug. Feel free to close
> again if this is not appropriate.

This isn't a ipi job, did you paste the wrong link?

Comment 7 Iury Gregory Melo Ferreira 2021-02-26 01:37:56 UTC

I think we can increase the memory limit for image convert.

In ironic.conf we can set image_convert_memory_limit (option) in the disk_utils section. The default is 1024 (https://docs.openstack.org/ironic/victoria/configuration/sample-config.html), in OSP the default is 2048 (used by tripleO)

```
[disk_utils]
image_convert_memory_limit = 2048
```

Comment 8 Derek Higgins 2021-03-01 10:51:49 UTC

(In reply to Iury Gregory Melo Ferreira from comment #7)
> I think we can increase the memory limit for image convert.
> 
> In ironic.conf we can set image_convert_memory_limit (option) in the
> disk_utils section. The default is 1024
> (https://docs.openstack.org/ironic/victoria/configuration/sample-config.
> html), in OSP the default is 2048 (used by tripleO)
> 
> ```
> [disk_utils]
> image_convert_memory_limit = 2048
> ```

Memory might be the problem (maybe the CI host is running out or something)
but I don't think that config option will help, that controls the memory for
qemu-image in ironic. The command with the problem is being run in IPA
(see ./ironic_python_agent/shell/write_image.sh)

Comment 9 Derek Higgins 2021-03-01 17:17:56 UTC

I can't reliably reproduce this but it does seem to go away when I increase the memory limit enforced in ironic_python_agent/shell/write_image.sh

I've proposed we double the limit here to 2G
https://review.opendev.org/c/openstack/ironic-python-agent/+/778035

Comment 10 Derek Higgins 2021-03-11 11:53:48 UTC

The fix is merged into IPA, now being propossed for the downloader-image

Comment 11 Derek Higgins 2021-04-08 11:34:28 UTC

(In reply to Derek Higgins from comment #10)
> The fix is merged into IPA, now being propossed for the downloader-image

This patch is now in the release, if you don't see the error any longer are you happy to close the bug.

Comment 15 errata-xmlrpc 2021-07-27 22:36:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.