Description of problem: I've encountered situations when the VMs time out pxe booting after the image deployment stage. I suspect this is caused by a high disk i/o load on the undercloud because I am able to reproduce this with a high number of nodes - 3 controllers, 2 computes, 1 ceph node. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploy overcloud: 3 ctrl + 2 computes + 1 ceph node 2. Check the VMs console Actual results: We can observe that in the initial phase all the VMs pxe boot and do the image deployment. After the first reboot some of the nodes are unable to pxe boot (see attached screenshot). After manually rebooting the VM it is able to pxe boot correctly. Additional info: I noticed that during the image deployment the undercloud load goes very high and I suspect this is causing the timeouts. I could see: [root@undercloud ~]# uptime 03:34:41 up 2 days, 18:33, 2 users, load average: 13.40, 6.09, 2.95 As a workaround I set the max_concurrent_builds to 2 in nova.conf so it limits the simultaneous instances build to 2: crudini --set /etc/nova/nova.conf DEFAULT max_concurrent_builds 2; openstack-service restart nova This is more an environmental issue imo but I opened it in case others hit this issue as well.
Created attachment 1177212 [details] console screenshot
Closing this out, as its an issue with the environment no fix is planned.