https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi Example failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi/1350998892358930432 It has blocked 3 of the last 5 nightly promotions level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[1]: Creation complete after 1m31s [id=8f4ed831-bbf9-4a54-9187-463cda007444] level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[2]: Still creating... [1m40s elapsed] level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[2]: Still creating... [1m50s elapsed] level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[2]: Still creating... [2m0s elapsed] level=debug msg=module.masters.ironic_deployment.openshift-master-deployment[2]: Creation complete after 2m1s [id=4abe1cd8-7190-4258-bc1f-107ae6541241] level=error level=error msg=Error: cannot go from state 'deploy failed' to state 'manageable' level=error level=error msg= on ../../tmp/openshift-install-939336658/masters/main.tf line 38, in resource "ironic_deployment" "openshift-master-deployment": level=error msg= 38: resource "ironic_deployment" "openshift-master-deployment" { level=error level=error level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
There's a high probability that https://github.com/metal3-io/baremetal-operator/pull/745 will fix this particular error message and allow us to retry, but the priority is to investigate why it is failing in the first place - it's highly possible that any retries will also fail for the same reasons.
I've run the IPv4 job locally several times trying to reproduce this and its finished to completion each time. Looking at the Job history there has been no failures in the last 13 runs, it looks like what ever caused this has passed. Finding the root cause was difficult because we don't log the metal3 logs for the ironic containers running on the bootstrap node, I'm working on a patch to capture these in future https://github.com/openshift/release/pull/15081/files But given the problem seems to have been transient are you ok with closing this?
It does seem this has cleared up in the last 24hrs. I'm good with closing for now.
I think this may still be happening. I saw these two jobs when looking at why our nightly payload got rejected recently. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-ipv6/1356109370693259264 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi/1356193268836077568 it might be easier to see that it's happening with this search: https://search.ci.openshift.org/?search=cannot+go+from+state+%27deploy+failed%27+to+state+%27manageable%27&maxAge=168h&context=1&type=build-log&name=&maxMatches=5&maxBytes=20971520&groupBy=job do we need to re-open this bug?
I noticed the issue again during my build watcher shift: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1357294675551064064 Given this and comment #4 earlier, I'm reopening the bug. Feel free to close again if this is not appropriate.
(In reply to jamo luhrsen from comment #4) > I think this may still be happening. I saw these two jobs when looking at > why our nightly payload got > rejected recently. > > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci- > openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-ipv6/1356109370693259264 ^^ doesn't have the bootstrap logs saved, going to deal with the CI run below and assume for now its the same problem. > > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci- > openshift-release-master-ocp-4.7-e2e-metal-ipi/1356193268836077568 The underlying problem can be found in the ironic conductor bootstrap logs (./bootstrap/pods/c0117539972b.log), qemu-img core dumped 2021-02-01 11:19:52.184 1 ERROR ironic.conductor.utils [-] Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 5ee1d1c3-7a50-4ed4-a8d6-8b6f1788f16c : Error performing deploy_step write_image: Error writing image to device: Writing image to device /dev/sda failed with exit code 134. stdout: write_image.sh: Erasing existing GPT and MBR data structures from /dev/sda Creating new GPT entries. GPT data structures destroyed! You may now partition the disk using fdisk or other utilities. write_image.sh: Imaging /tmp/rhcos-47.83.202101161239-0-compressed.x86_64.qcow2 to /dev/sda . stderr: 33+0 records in 33+0 records out 16896 bytes (17 kB, 16 KiB) copied, 0.000969165 s, 17.4 MB/s 33+0 records in 33+0 records out 16896 bytes (17 kB, 16 KiB) copied, 0.000461273 s, 36.6 MB/s qemu: qemu_thread_create: Resource temporarily unavailable /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/../shell/write_image.sh: line 51: 1187 Aborted (core dumped) qemu-img convert -t directsync -O host_device $IMAGEFILE $DEVICE .^[[00m > > it might be easier to see that it's happening with this search: > > https://search.ci.openshift.org/ > ?search=cannot+go+from+state+%27deploy+failed%27+to+state+%27manageable%27&ma > xAge=168h&context=1&type=build- > log&name=&maxMatches=5&maxBytes=20971520&groupBy=job This error message is fairly generic and just indicates that there was a problem with ironic on the bootstrap node, the underlying problem could be anything provisioning related. (In reply to Jakub Hrozek from comment #5) > I noticed the issue again during my build watcher shift: > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift- > ocp-installer-e2e-aws-serial-4.7/1357294675551064064 > Given this and comment #4 earlier, I'm reopening the bug. Feel free to close > again if this is not appropriate. This isn't a ipi job, did you paste the wrong link?
I think we can increase the memory limit for image convert. In ironic.conf we can set image_convert_memory_limit (option) in the disk_utils section. The default is 1024 (https://docs.openstack.org/ironic/victoria/configuration/sample-config.html), in OSP the default is 2048 (used by tripleO) ``` [disk_utils] image_convert_memory_limit = 2048 ```
(In reply to Iury Gregory Melo Ferreira from comment #7) > I think we can increase the memory limit for image convert. > > In ironic.conf we can set image_convert_memory_limit (option) in the > disk_utils section. The default is 1024 > (https://docs.openstack.org/ironic/victoria/configuration/sample-config. > html), in OSP the default is 2048 (used by tripleO) > > ``` > [disk_utils] > image_convert_memory_limit = 2048 > ``` Memory might be the problem (maybe the CI host is running out or something) but I don't think that config option will help, that controls the memory for qemu-image in ironic. The command with the problem is being run in IPA (see ./ironic_python_agent/shell/write_image.sh)
I can't reliably reproduce this but it does seem to go away when I increase the memory limit enforced in ironic_python_agent/shell/write_image.sh I've proposed we double the limit here to 2G https://review.opendev.org/c/openstack/ironic-python-agent/+/778035
The fix is merged into IPA, now being propossed for the downloader-image
(In reply to Derek Higgins from comment #10) > The fix is merged into IPA, now being propossed for the downloader-image This patch is now in the release, if you don't see the error any longer are you happy to close the bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438