Created attachment 1766250 [details] ironic-conductor.logs Created attachment 1766250 [details] ironic-conductor.logs Created attachment 1766250 [details] ironic-conductor.logs Description of problem: Virtual master is not provisioned due to an error from ironic.conductor. When this is happening it is blocking our CI run. cnfdd5-master-0 is not provisioned: [root@cnfdd5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-115.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 7h48m v1.20.0+39c0afe dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com Ready master,virtual 7h48m v1.20.0+39c0afe [root@cnfdd5-installer ~]# oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api cnfdd5-master-0 cnfdd5-j2ps7-master-0 true openshift-machine-api cnfdd5-master-1 cnfdd5-j2ps7-master-1 true openshift-machine-api cnfdd5-master-2 cnfdd5-j2ps7-master-2 true openshift-machine-api cnfdd5-worker-0 true openshift-machine-api cnfdd5-worker-1 true openshift-machine-api cnfdd5-worker-2 true From ironic.conductor logs: 2021-03-25 01:18:06.828 1 ERROR ironic.conductor.utils [-] Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'int erface': 'deploy'} on node ca1ba6d2-8806-48cd-b708-857d248f6a79 : Error performing deploy_step write_image: Error writing image to device: Writing image to de vice /dev/sda failed with exit code 1. stdout: write_image.sh: Erasing existing GPT and MBR data structures from /dev/sda Creating new GPT entries. GPT data structures destroyed! You may now partition the disk using fdisk or other utilities. write_image.sh: Imaging /tmp/rhcos-48.83.202103221318-0-compressed.x86_64.qcow2 to /dev/sda . stderr: 33+0 records in 33+0 records out 16896 bytes (17 kB, 16 KiB) copied, 0.00319755 s, 5.3 MB/s 33+0 records in 33+0 records out 16896 bytes (17 kB, 16 KiB) copied, 0.000414175 s, 40.8 MB/s qemu-img: /dev/sda: error while converting host_device: Failed to clear the new image's first sector: Device or resource busy .ESC[00m Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-03-24-200346 How reproducible: Deploy OCP 4.8 cluster with 3 virtual masters. This is inconsistent, in some of the times the cluster is deployed successfully. In case this is happening it is blocking our CI run. Steps to Reproduce: Deploy OCP 4.8 cluster with 3 virtual masters. Actual results: One master wasn't provisioned due to an error from ironic.conductor. Expected results: All the masters should be provisioned and join the cluster. Additional info: Adding ironic.conductor logs and must-gather.
Created attachment 1766252 [details] must-gather
Created attachment 1767731 [details] IPA journal
I don't see any obvious problem in the IPA log (extracted from conductor logs and attached) When this problem occurs is it always on the same master node? Also I'm wondering can you double check if there if any shortage of storage on the virt host?
Hey, It just happened with another cluster on a different hypervisor. [root@cnfdc9-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-146.clus2.t5g.lab.eng.bos.redhat.com Ready master 13h v1.20.0+29a606d dhcp19-17-204.clus2.t5g.lab.eng.bos.redhat.com Ready master 13h v1.20.0+29a606d [root@cnfdc9-installer ~]# oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api cnfdc9-master-0 cnfdc9-mt8n5-master-0 true openshift-machine-api cnfdc9-master-1 cnfdc9-mt8n5-master-1 true openshift-machine-api cnfdc9-master-2 cnfdc9-mt8n5-master-2 true openshift-machine-api cnfdc9-worker-0 true openshift-machine-api cnfdc9-worker-1 true [root@cnfdc9-installer ~]# oc version Client Version: 4.8.0-0.nightly-2021-03-30-123133 Server Version: 4.8.0-0.nightly-2021-03-30-123133 Kubernetes Version: v1.20.0+29a606d There is enough storage on the hypervisor: [kni@cnfdc1 ~]$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 189G 0 189G 0% /dev tmpfs 189G 168K 189G 1% /dev/shm tmpfs 189G 4.1G 185G 3% /run tmpfs 189G 0 189G 0% /sys/fs/cgroup /dev/mapper/rhel_cnfdc1-root 50G 4.2G 46G 9% / /dev/nvme0n1 1.5T 586G 905G 40% /storage /dev/sda1 1006M 167M 840M 17% /boot /dev/mapper/rhel_cnfdc1-home 392G 11G 382G 3% /home tmpfs 38G 828K 38G 1% /run/user/1000 cnfdb0.clus2.t5g.lab.eng.bos.redhat.com:/var/www/kcli-openshift4-baremetal 3.3T 503G 2.9T 15% /home/kni/cnf-plans
And again in our CI cluster: [root@cnfdd5-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-115.clus2.t5g.lab.eng.bos.redhat.com Ready master 53m v1.20.0+29a606d dhcp19-17-116.clus2.t5g.lab.eng.bos.redhat.com Ready master 53m v1.20.0+29a606d [root@cnfdd5-installer ~]# oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api cnfdd5-master-0 cnfdd5-wc7fb-master-0 true openshift-machine-api cnfdd5-master-1 cnfdd5-wc7fb-master-1 true openshift-machine-api cnfdd5-master-2 cnfdd5-wc7fb-master-2 true openshift-machine-api cnfdd5-worker-0 true openshift-machine-api cnfdd5-worker-1 true openshift-machine-api cnfdd5-worker-2 true There is enough storage on the hypervisor: [kni@cnfdd1 ~]$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 189G 0 189G 0% /dev tmpfs 189G 84K 189G 1% /dev/shm tmpfs 189G 4.0G 185G 3% /run tmpfs 189G 0 189G 0% /sys/fs/cgroup /dev/mapper/rhel_cnfdd100-root 50G 4.1G 46G 9% / /dev/mapper/rhel_cnfdd100-home 392G 3.9G 388G 1% /home /dev/sda1 1006M 167M 840M 17% /boot tmpfs 38G 24K 38G 1% /run/user/1000 /dev/mapper/rhel_cnfdd1-home 1.4T 539G 876G 39% /storage cnfdb0.clus2.t5g.lab.eng.bos.redhat.com:/var/www/kcli-openshift4-baremetal 3.3T 502G 2.9T 15% /home/kni/cnf-plans tmpfs 38G 0 38G 0% /run/user/1001
This one is from another lab located in TLV: [root@cnfdt08-installer ~]# oc get node NAME STATUS ROLES AGE VERSION dhcp-55-185.lab.eng.tlv2.redhat.com Ready master 19h v1.20.0+29a606d dhcp-55-218.lab.eng.tlv2.redhat.com Ready master 19h v1.20.0+29a606d [root@cnfdt08-installer ~]# oc version Client Version: 4.8.0-0.nightly-2021-03-30-123133 Server Version: 4.8.0-0.nightly-2021-03-30-123133 Kubernetes Version: v1.20.0+29a606d [root@cnfdt08-installer ~]# oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api cnfdt08-master-0 cnfdt08-fwqhp-master-0 true openshift-machine-api cnfdt08-master-1 cnfdt08-fwqhp-master-1 true openshift-machine-api cnfdt08-master-2 cnfdt08-fwqhp-master-2 true openshift-machine-api cnfdt08-worker-0 true openshift-machine-api cnfdt08-worker-1 true [root@cnfdt08-installer ~]# logout Connection to 10.46.55.195 closed. [kni@cnfdt03 ~]$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 189G 0 189G 0% /dev tmpfs 189G 84K 189G 1% /dev/shm tmpfs 189G 4.0G 185G 3% /run tmpfs 189G 0 189G 0% /sys/fs/cgroup /dev/mapper/vg0-LVroot 435G 5.4G 407G 2% / /dev/sda2 976M 184M 726M 21% /boot /dev/nvme0n1 1.5T 553G 937G 38% /var/lib/libvirt/images tmpfs 38G 28K 38G 1% /run/user/1000 cnfdb0.clus2.t5g.lab.eng.bos.redhat.com:/var/www/kcli-openshift4-baremetal 3.3T 505G 2.9T 15% /home/kni/cnf-plans
I took a look around the host mentioned in the comment #6 (cnfdt08), this appears to be a different problem, there is no mention of a busy sda in the ironic logs, ironic successfully provisioned all 3 masters 2021-03-30 17:19:16.095 1 INFO ironic.conductor.task_manager [req-b8144aae-ac4f-476c-8a11-7a38da25413c - - - - -] Node 136f1c93-2957-4fb0-b4bc-2c6be19bb0bf moved to provision state "active" from state "deploying"; target provision state is "None" 2021-03-30 17:19:22.351 1 INFO ironic.conductor.task_manager [req-8716fa3a-4758-4c4b-a346-090f8971cc9d - - - - -] Node 3bd66761-807b-4b10-9f31-e9bbf51f9e50 moved to provision state "active" from state "deploying"; target provision state is "None" 2021-03-30 17:19:22.367 1 INFO ironic.conductor.task_manager [req-ee1b565a-4164-438f-b3fb-c9ccdd3780b1 - - - - -] Node e25e5090-aecc-40e6-9d83-8949f880276c moved to provision state "active" from state "deploying"; target provision state is "None" It looks like master-0 is stuck trying to get the ignition config [root@cnfdt03 var]# virsh dumpxml cnfdt08-master-0 | grep service <source mode='bind' host='127.0.0.1' service='52837' tls='no'/> <source mode='bind' host='127.0.0.1' service='52837' tls='no'/> [root@cnfdt03 var]# nc localhost 52837 [*** ] A start job is running for Ignition (fetch) (21h 32min 33s / no limit) [*** ] A start job is running for Ignition (fetch) (21h 32min 35s / no limit)[77559.953921] ignition[778]: GET https://10.46.55.79:22623/config/master: attempt #15513 [77559.958560] ignition[778]: GET error: Get "https://10.46.55.79:22623/config/master": dial tcp 10.46.55.79:22623: connect: network is unreachable [ **] A start job is running for Ignition (fetch) (21h 32min 37s / no limit)^C So there is likely a problem after ironic booted the image to RHCOS, can you open this as a separate issue (do you have the complete console log for these nodes, I can't see if there were any errors further up) if you pass on the details to one of the other systems with the qemu-img error I'll take a look
Reducing the priority as this doesn't appear to be happening as much as was thought, all of the failed clusters we looked at that reproduce a master failing to join the cluster are are result of problems downloading ignition data and are nothing to do with this (/dev/sda is busy) bug The other problem is now reported here https://bugzilla.redhat.com/show_bug.cgi?id=1946079 Once this ignition/network problem is resolved we can revisit this ( /dev/sda is busy ) problem as it should be easier to identify
Happened to me a couple of times in a row: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/17425/rehearse-17425-pull-ci-openshift-assisted-service-master-e2e-metal-assisted-installer-operator/1384102388654149632 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/17425/rehearse-17425-pull-ci-openshift-assisted-service-master-e2e-metal-assisted-installer-operator/1384081617571549184 With the regular baremetalds-devscripts-setup registry step on prow Raising the priority as it produces a lot of flakes...
Hi, It seems to be happening again and it is blocking out CI cluster. level=error msg=Error: cannot go from state 'deploy failed' to state 'manageable' , last error was 'Agent returned error for deploy step {'step': 'write_image ', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node cc00eb0d-f2f0-405d-9ad4-2056a1a550c0 : Error performing deploy_step write_image: Error wri ting image to device: Writing image to device /dev/sda failed with exit code 1. stdout: write_image.sh: Erasing existing GPT and MBR data structures from /dev /sda level=error msg=Creating new GPT entries. level=error msg=GPT data structures destroyed! You may now partition the disk using fdisk or level=error msg=other utilities. level=error msg=write_image.sh: Imaging /tmp/cached-rhcos-48.83.202103221318-0-openstack.x86_64.qcow2 to /dev/sda level=error msg=. stderr: 33+0 records in level=error msg=33+0 records out level=error msg=16896 bytes (17 kB, 16 KiB) copied, 0.00231626 s, 7.3 MB/s level=error msg=33+0 records in level=error msg=33+0 records out level=error msg=16896 bytes (17 kB, 16 KiB) copied, 0.000494256 s, 34.2 MB/s level=error msg=qemu-img: /dev/sda: error while converting host_device: Failed to clear the new image's first sector: Device or resource busy level=error msg=.' level=error level=error msg= on ../tmp/openshift-install-996911139/masters/main.tf line 38, in resource "ironic_deployment" "openshift-master-deployment": level=error msg= 38: resource "ironic_deployment" "openshift-master-deployment" { level=error level=error level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the chang e level=debug msg=OpenShift Installer 4.8.0-0.nightly-2021-04-24-234710 [root@cnfdd5-installer manifests]# oc get node NAME STATUS ROLES AGE VERSION dhcp19-17-115.clus2.t5g.lab.eng.bos.redhat.com NotReady master 11m v1.21.0-rc.0+bdfd684 [root@cnfdd5-installer manifests]# oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR openshift-machine-api cnfdd5-master-0 cnfdd5-cnp4s-master-0 true openshift-machine-api cnfdd5-master-1 cnfdd5-cnp4s-master-1 true openshift-machine-api cnfdd5-master-2 cnfdd5-cnp4s-master-2 true openshift-machine-api cnfdd5-worker-0 true openshift-machine-api cnfdd5-worker-1 true openshift-machine-api cnfdd5-worker-2 true [root@cnfdd5-installer manifests]# virsh console cnfdd5-master-2 Connected to domain cnfdd5-master-2 Escape character is ^] [ *** ] A start job is running for Ignition (fetch) (10min 43s / no limit)
BTW, HV is certainly not overloaded: ``` $ free -m total used free shared buff/cache available Mem: 385320 166826 106059 3399 112434 212529 Swap: 4095 4080 15 ```
Derek can you add a link to the u/s patch?
(In reply to Yuval Kashtan from comment #18) > Derek can you add a link to the u/s patch? The patch to master had already been attached, I've added the backport to the stable branch
Any objection to adding tag (IPI) to whiteboard ?
Moving to MODIFIED since the fix should be already available in ironic-ipa-downloader https://github.com/openshift/ironic-ipa-downloader/pull/68 (This PR covers this BZ and https://bugzilla.redhat.com/show_bug.cgi?id=1955114 )
tried to deploy twice with latest 4.8 build and deployments succeed, so i close the issue as not reproducible in 4.8 [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-18-072155 True False 34m Cluster version is 4.8.0-0.nightly-2021-05-18-072155
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438