Created attachment 1546708 [details] ironic.tar.gz Description of problem: Director deployed OCP 3.11 deployment fails during Ironic deploying the worker nodes: (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------------+--------+------------------------+----------------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------------+--------+------------------------+----------------+--------+ | 91f0adcf-ac6f-4283-8ead-87def0ab7fe0 | openshift-worker-1 | BUILD | | overcloud-full | worker | | 233daa2d-523e-453b-b6bd-2a1e62561920 | openshift-worker-2 | BUILD | | overcloud-full | worker | | 8d70b9ad-7474-43af-b529-eb834e818949 | openshift-infra-0 | ACTIVE | ctlplane=192.168.24.8 | overcloud-full | infra | | ce04f408-8a1d-4fd3-9a59-c5bec1fdeecf | openshift-master-1 | ACTIVE | ctlplane=192.168.24.20 | overcloud-full | master | | e2b76597-7551-49e5-bc97-022c6dc81db2 | openshift-master-0 | ACTIVE | ctlplane=192.168.24.19 | overcloud-full | master | | f90e7b89-72d1-496b-8ad3-faa39a25b6f9 | openshift-infra-1 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | infra | | 6022f7a2-c9d2-4da2-9422-3aeec0003e4b | openshift-worker-0 | BUILD | | overcloud-full | worker | | 22046fab-8ae4-45f1-83dd-9bf624bae02b | openshift-infra-2 | ACTIVE | ctlplane=192.168.24.21 | overcloud-full | infra | | f982330b-982e-4596-964e-bb51c04e5ffe | openshift-master-2 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | master | +--------------------------------------+--------------------+--------+------------------------+----------------+--------+ Deployment of openshift-worker nodes never succeeds. Version-Release number of selected component (if applicable): openstack-ironic-api:20190320.1 How reproducible: 100% Steps to Reproduce: 1. Deploy OCP 3.11 overcloud with 3 x master + 3 x infra + 3 x worker nodes. Worker nodes have 2 additional disks for Gluster storage. Actual results: Ironic fails to deploy the worker nodes. Expected results: No failures. Additional info: Attaching Ironic logs.
2019-03-21 17:01:54.094 1 ERROR ironic.drivers.modules.agent_client [req-4f1bc06c-2bbb-4b8b-9608-0abfc4b7ca9e - - - - -] Failed to connect to the agent running on node 7e542470-ed6a-4b9c-9a19-c47215d57407 for invoking command image.install_bootloader. Error: HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read timeout=60): requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read timeout=60) May be a firewall or DHCP error? Could you check what is going on the nodes themselves? The deploy logs you have seem to show success for the nodes that have failures in ironic-conductor logs.
Hi Marius - any update on this?
(In reply to Dmitry Tantsur from comment #1) > 2019-03-21 17:01:54.094 1 ERROR ironic.drivers.modules.agent_client > [req-4f1bc06c-2bbb-4b8b-9608-0abfc4b7ca9e - - - - -] Failed to connect to > the agent running on node 7e542470-ed6a-4b9c-9a19-c47215d57407 for invoking > command image.install_bootloader. Error: > HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read > timeout=60): requests.exceptions.ReadTimeout: > HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read > timeout=60) > > May be a firewall or DHCP error? Could you check what is going on the nodes > themselves? The deploy logs you have seem to show success for the nodes that > have failures in ironic-conductor logs. I tried calling the agent url while deploying and it took longer than 60s to return but I don't know how to live debug what's going on on the node: time curl 192.168.24.14:9999 {"name": "OpenStack Ironic Python Agent API", "description": "Ironic Python Agent is a provisioning agent for OpenStack Ironic", "versions": [{"id": "v1", "links": [{"href": "http://192.168.24.14:9999/v1/", "rel": "self"}]}], "default_version": {"id": "v1", "links": [{"href": "http://192.168.24.14:9999/v1/", "rel": "self"}]}} real 2m21.189s What is odd is that this issue occurs only for nodes which 4 disks. Deployment of nodes with 1 or 2 disks works fine. Looking at the deploy logs for the failed node I can see that it tried running grub2-mkconfig 3 times: [root@undercloud-0 9516cf39-4aa6-4e66-ae0d-37add36b8d15]# grep grub2-mk journal | grep CMD Mar 27 20:04:29 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:04:29.684 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp1kv7n9k8 /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.550s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 Mar 27 20:06:01 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:06:01.674 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp00fqhe1u /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.340s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 Mar 27 20:07:33 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:07:33.615 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmpquhri5xn /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.277s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 Compared to a node with 2 disks: [root@undercloud-0 e2761690-9cf0-445a-8456-d76205661d9f]# grep grub2-mk journal | grep CMD Mar 27 20:03:49 host-192-168-24-22 ironic-python-agent[681]: 2019-03-27 20:03:49.506 681 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp28_55k_m /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 51.376s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 and a node with 1 disk: [root@undercloud-0 cb458d2c-66b9-4108-ad7b-8671a953c3f2]# grep grub2-mk journal | grep CMD Mar 27 20:03:29 host-192-168-24-24 ironic-python-agent[702]: 2019-03-27 20:03:29.209 702 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp8owj_d0u /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 31.407s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409
If it blocks eventlet (it shouldn't, but how knows), it can explain timing out. However, even 30 seconds sounds insane to me.
Do you have any previous information on the disks? If yes, could you try with metadata cleaning enabled?
(In reply to Dmitry Tantsur from comment #6) > Do you have any previous information on the disks? If yes, could you try > with metadata cleaning enabled? No, this is an initial deployment on virt environment where the disk files have just been created.
Looking at the deploy logs for the various nodes, for some reason each grub2-mkconfig command takes longer for nodes with multiple disks 1 disk - 31 seconds 2 disk - 51 seconds 3 disk - 91 seconds and I guess its longer for more disks if this is blocking eventlet, then it explains the "Read timed out. (read timeout=60)" so we'd need to figure out why the command takes so long and why it progressively gets longer the more disks that are present.
I got onto a running IPA instance The grub2-mkconfig command that is taking so long is waiting on a lvs command to complete, running the lvs command manually on a running IPA instance inside the same chroot you get the following error [root@host-192-168-24-18 /]# lvs --noheadings --separator : -o vg_name,lv_name WARNING: Device /dev/vda not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vda2 not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vdb not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vdc not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vdd not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vde not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vdf not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vda2 not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vdb not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vdc not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vdd not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vde not initialized in udev database even after waiting 10000000 microseconds. WARNING: Device /dev/vdf not initialized in udev database even after waiting 10000000 microseconds. lvs is trying to access /run/lvm, bind mounting in /run and the lvs command runs without error or delay Can you test if this patch (on the IPA image) solves the problem https://review.opendev.org/655582
*** Bug 1702413 has been marked as a duplicate of this bug. ***
deployment with RHOS_TRUNK-15.0-RHEL-8-20190520.n.2 hosp-director-images.noarch 15.0-20190520.1.el8ost @rhelosp-15.0-trunk rhosp-director-images-ipa-x86_64.noarch 15.0-20190520.1.el8ost @rhelosp-15.0-trunk rhosp-director-images-x86_64.noarch 15.0-20190520.1.el8ost @rhelosp-15.0-trunk STACK_COMP="undercloud:1,controller:3,compute:2,ceph:3,ironic:2" ceph nodes: brw-rw----. 1 root disk 252, 0 May 21 21:13 /dev/vda brw-rw----. 1 root disk 252, 1 May 21 21:13 /dev/vda1 brw-rw----. 1 root disk 252, 2 May 21 21:13 /dev/vda2 brw-rw----. 1 root disk 252, 16 May 21 21:42 /dev/vdb brw-rw----. 1 root disk 252, 32 May 21 21:42 /dev/vdc brw-rw----. 1 root disk 252, 48 May 21 21:42 /dev/vdd brw-rw----. 1 root disk 252, 64 May 21 21:42 /dev/vde brw-rw----. 1 root disk 252, 80 May 21 21:42 /dev/vdf [root@ceph-0 ~]# pvs PV VG Fmt Attr PSize PFree /dev/vdb ceph-70454c2f-88d9-4fd0-a3cd-93eae8147816 lvm2 a-- 14.00g 0 /dev/vdc ceph-0b1c1299-4c41-4b9a-80e9-4fdab50ab52d lvm2 a-- 14.00g 0 /dev/vdd ceph-277d25e0-e1bb-40c7-a051-d353212df02d lvm2 a-- 14.00g 0 /dev/vde ceph-a3b319df-1ef7-4b9d-a130-0f2342dca6be lvm2 a-- 14.00g 0 /dev/vdf ceph-bfe64677-f8ba-48d3-8320-a144cd683d2c lvm2 a-- 14.00g 0 Deployment suceeded without issue using images listed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2811
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days