Bug 1691551 - Deployment of overcloud nodes with more than 2 disks fails
Summary: Deployment of overcloud nodes with more than 2 disks fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic-python-agent
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 15.0 (Stein)
Assignee: Derek Higgins
QA Contact: mlammon
URL:
Whiteboard:
: 1702413 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-21 21:25 UTC by Marius Cornea
Modified: 2023-09-18 00:15 UTC (History)
9 users (show)

Fixed In Version: openstack-ironic-python-agent-3.5.1-0.20190206013118.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-21 11:20:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ironic.tar.gz (1.74 MB, application/gzip)
2019-03-21 21:25 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 655582 0 None MERGED Bind mount /run into chroot when installing grub 2021-01-08 22:40:40 UTC
OpenStack gerrit 655810 0 None MERGED Bind mount /run into chroot when installing grub 2021-01-08 22:40:40 UTC
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:21:01 UTC

Internal Links: 1702413

Description Marius Cornea 2019-03-21 21:25:33 UTC
Created attachment 1546708 [details]
ironic.tar.gz

Description of problem:

Director deployed OCP 3.11 deployment fails during Ironic deploying the worker nodes:

(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------------+--------+------------------------+----------------+--------+
| ID                                   | Name               | Status | Networks               | Image          | Flavor |
+--------------------------------------+--------------------+--------+------------------------+----------------+--------+
| 91f0adcf-ac6f-4283-8ead-87def0ab7fe0 | openshift-worker-1 | BUILD  |                        | overcloud-full | worker |
| 233daa2d-523e-453b-b6bd-2a1e62561920 | openshift-worker-2 | BUILD  |                        | overcloud-full | worker |
| 8d70b9ad-7474-43af-b529-eb834e818949 | openshift-infra-0  | ACTIVE | ctlplane=192.168.24.8  | overcloud-full | infra  |
| ce04f408-8a1d-4fd3-9a59-c5bec1fdeecf | openshift-master-1 | ACTIVE | ctlplane=192.168.24.20 | overcloud-full | master |
| e2b76597-7551-49e5-bc97-022c6dc81db2 | openshift-master-0 | ACTIVE | ctlplane=192.168.24.19 | overcloud-full | master |
| f90e7b89-72d1-496b-8ad3-faa39a25b6f9 | openshift-infra-1  | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | infra  |
| 6022f7a2-c9d2-4da2-9422-3aeec0003e4b | openshift-worker-0 | BUILD  |                        | overcloud-full | worker |
| 22046fab-8ae4-45f1-83dd-9bf624bae02b | openshift-infra-2  | ACTIVE | ctlplane=192.168.24.21 | overcloud-full | infra  |
| f982330b-982e-4596-964e-bb51c04e5ffe | openshift-master-2 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | master |
+--------------------------------------+--------------------+--------+------------------------+----------------+--------+

Deployment of openshift-worker nodes never succeeds.


Version-Release number of selected component (if applicable):
openstack-ironic-api:20190320.1 

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 3.11 overcloud with 3 x master + 3 x infra + 3 x worker nodes. Worker nodes have 2 additional disks for Gluster storage.

Actual results:
Ironic fails to deploy the worker nodes.

Expected results:
No failures.

Additional info:
Attaching Ironic logs.

Comment 1 Dmitry Tantsur 2019-03-25 10:49:10 UTC
2019-03-21 17:01:54.094 1 ERROR ironic.drivers.modules.agent_client [req-4f1bc06c-2bbb-4b8b-9608-0abfc4b7ca9e - - - - -] Failed to connect to the agent running on node 7e542470-ed6a-4b9c-9a19-c47215d57407 for invoking command image.install_bootloader. Error: HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read timeout=60): requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read timeout=60)

May be a firewall or DHCP error? Could you check what is going on the nodes themselves? The deploy logs you have seem to show success for the nodes that have failures in ironic-conductor logs.

Comment 2 Bob Fournier 2019-03-27 12:37:05 UTC
Hi Marius - any update on this?

Comment 3 Marius Cornea 2019-03-28 01:05:59 UTC
(In reply to Dmitry Tantsur from comment #1)
> 2019-03-21 17:01:54.094 1 ERROR ironic.drivers.modules.agent_client
> [req-4f1bc06c-2bbb-4b8b-9608-0abfc4b7ca9e - - - - -] Failed to connect to
> the agent running on node 7e542470-ed6a-4b9c-9a19-c47215d57407 for invoking
> command image.install_bootloader. Error:
> HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read
> timeout=60): requests.exceptions.ReadTimeout:
> HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read
> timeout=60)
> 
> May be a firewall or DHCP error? Could you check what is going on the nodes
> themselves? The deploy logs you have seem to show success for the nodes that
> have failures in ironic-conductor logs.

I tried calling the agent url while deploying and it took longer than 60s to return but I don't know how to live debug what's going on on the node:

time curl 192.168.24.14:9999
{"name": "OpenStack Ironic Python Agent API", "description": "Ironic Python Agent is a provisioning agent for OpenStack Ironic", "versions": [{"id": "v1", "links": [{"href": "http://192.168.24.14:9999/v1/",
 "rel": "self"}]}], "default_version": {"id": "v1", "links": [{"href": "http://192.168.24.14:9999/v1/", "rel": "self"}]}}
real 2m21.189s

What is odd is that this issue occurs only for nodes which 4 disks. Deployment of nodes with 1 or 2 disks works fine.

Looking at the deploy logs for the failed node I can see that it tried running grub2-mkconfig 3 times:

 [root@undercloud-0 9516cf39-4aa6-4e66-ae0d-37add36b8d15]# grep grub2-mk journal  | grep CMD
Mar 27 20:04:29 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:04:29.684 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp1kv7n9k8 /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.550s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409
Mar 27 20:06:01 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:06:01.674 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp00fqhe1u /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.340s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409
Mar 27 20:07:33 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:07:33.615 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmpquhri5xn /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.277s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409


Compared to a node with 2 disks:

 [root@undercloud-0 e2761690-9cf0-445a-8456-d76205661d9f]# grep grub2-mk journal  | grep CMD
Mar 27 20:03:49 host-192-168-24-22 ironic-python-agent[681]: 2019-03-27 20:03:49.506 681 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp28_55k_m /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 51.376s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409


and a node with 1 disk:

 [root@undercloud-0 cb458d2c-66b9-4108-ad7b-8671a953c3f2]# grep grub2-mk journal  | grep CMD
Mar 27 20:03:29 host-192-168-24-24 ironic-python-agent[702]: 2019-03-27 20:03:29.209 702 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp8owj_d0u /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 31.407s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409

Comment 5 Dmitry Tantsur 2019-03-29 15:55:07 UTC
If it blocks eventlet (it shouldn't, but how knows), it can explain timing out. However, even 30 seconds sounds insane to me.

Comment 6 Dmitry Tantsur 2019-03-29 16:06:14 UTC
Do you have any previous information on the disks? If yes, could you try with metadata cleaning enabled?

Comment 7 Marius Cornea 2019-03-29 20:31:25 UTC
(In reply to Dmitry Tantsur from comment #6)
> Do you have any previous information on the disks? If yes, could you try
> with metadata cleaning enabled?

No, this is an initial deployment on virt environment where the disk files have just been created.

Comment 8 Derek Higgins 2019-04-24 16:05:01 UTC
Looking at the deploy logs for the various nodes, for some reason each grub2-mkconfig command takes longer for nodes with multiple disks
1 disk - 31 seconds
2 disk - 51 seconds
3 disk - 91 seconds

and I guess its longer for more disks

if this is blocking eventlet, then it explains the "Read timed out. (read timeout=60)" so we'd need to figure out why the command takes so long and why it progressively gets longer the more disks that are present.

Comment 10 Derek Higgins 2019-04-24 22:35:47 UTC
I got onto a running IPA instance

The grub2-mkconfig command that is taking so long is waiting on a lvs command to complete, running the
lvs command manually on a running IPA instance inside the same chroot you get the following error

[root@host-192-168-24-18 /]# lvs --noheadings --separator : -o vg_name,lv_name
  WARNING: Device /dev/vda not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vda2 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vdb not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vdc not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vdd not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vde not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vdf not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vda2 not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vdb not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vdc not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vdd not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vde not initialized in udev database even after waiting 10000000 microseconds.
  WARNING: Device /dev/vdf not initialized in udev database even after waiting 10000000 microseconds.

lvs is trying to access /run/lvm, bind mounting in /run and the lvs command runs without error or delay
Can you test if this patch (on the IPA image) solves the problem
https://review.opendev.org/655582

Comment 11 Bob Fournier 2019-04-24 23:39:30 UTC
*** Bug 1702413 has been marked as a duplicate of this bug. ***

Comment 16 Alistair Tonner 2019-05-23 14:44:27 UTC
deployment with RHOS_TRUNK-15.0-RHEL-8-20190520.n.2

hosp-director-images.noarch                  15.0-20190520.1.el8ost                               @rhelosp-15.0-trunk
rhosp-director-images-ipa-x86_64.noarch       15.0-20190520.1.el8ost                               @rhelosp-15.0-trunk
rhosp-director-images-x86_64.noarch           15.0-20190520.1.el8ost                               @rhelosp-15.0-trunk

STACK_COMP="undercloud:1,controller:3,compute:2,ceph:3,ironic:2"

ceph nodes:

brw-rw----. 1 root disk 252,  0 May 21 21:13 /dev/vda
brw-rw----. 1 root disk 252,  1 May 21 21:13 /dev/vda1
brw-rw----. 1 root disk 252,  2 May 21 21:13 /dev/vda2
brw-rw----. 1 root disk 252, 16 May 21 21:42 /dev/vdb
brw-rw----. 1 root disk 252, 32 May 21 21:42 /dev/vdc
brw-rw----. 1 root disk 252, 48 May 21 21:42 /dev/vdd
brw-rw----. 1 root disk 252, 64 May 21 21:42 /dev/vde
brw-rw----. 1 root disk 252, 80 May 21 21:42 /dev/vdf
[root@ceph-0 ~]# pvs
  PV         VG                                        Fmt  Attr PSize  PFree
  /dev/vdb   ceph-70454c2f-88d9-4fd0-a3cd-93eae8147816 lvm2 a--  14.00g    0
  /dev/vdc   ceph-0b1c1299-4c41-4b9a-80e9-4fdab50ab52d lvm2 a--  14.00g    0
  /dev/vdd   ceph-277d25e0-e1bb-40c7-a051-d353212df02d lvm2 a--  14.00g    0
  /dev/vde   ceph-a3b319df-1ef7-4b9d-a130-0f2342dca6be lvm2 a--  14.00g    0
  /dev/vdf   ceph-bfe64677-f8ba-48d3-8320-a144cd683d2c lvm2 a--  14.00g    0




 Deployment suceeded without issue using images listed

Comment 19 errata-xmlrpc 2019-09-21 11:20:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811

Comment 20 Red Hat Bugzilla 2023-09-18 00:15:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.