Description of problem: Overcloud deployment fails when deploying nodes with more than 2 disks: (undercloud) [stack@undercloud-0 ~]$ nova list /usr/lib/python3.6/site-packages/urllib3/connection.py:374: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python3.6/site-packages/urllib3/connection.py:374: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | 9148dc1b-64ed-4721-b827-9ceb374c7b1a | ceph-0 | ERROR | - | NOSTATE | | | 2f519f9f-79be-4dad-acab-2901307074e4 | ceph-1 | BUILD | scheduling | NOSTATE | | | 5871d3e1-9f95-46e7-91be-0bedf055d149 | ceph-2 | BUILD | scheduling | NOSTATE | | | 304be7f2-9709-4687-8263-58f5549653ec | compute-0 | ACTIVE | - | Running | ctlplane=192.168.24.8 | | 01273806-5f86-48f7-9888-a7fed31453ac | compute-1 | ACTIVE | - | Running | ctlplane=192.168.24.15 | | 50e02de4-4d55-45c2-aadb-ede3f7c8fb43 | compute-2 | ACTIVE | - | Running | ctlplane=192.168.24.10 | | f2936327-1e0f-42ba-8cd7-42ee0ea85b2e | controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.9 | | 386013ee-94d6-4d2e-88e1-90a9d0cc9af1 | controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.11 | | bdd88f96-f8c2-4c65-8a1e-f2575622c26f | controller-2 | ACTIVE | - | Running | ctlplane=192.168.24.21 | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ The ceph nodes have 6 disks with the following configuration: https://github.com/redhat-openstack/infrared/blob/master/plugins/virsh/defaults/topology/nodes/ceph.yml#L11-L53 Version-Release number of selected component (if applicable): 15 -p RHOS_TRUNK-15.0-RHEL-8-20190418.n.0 How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with nodes that have more than 2 disks Actual results: Overcloud deployment fails because the nodes with multiple disks fail to get deployed Expected results: Overcloud deployment passes without issues. Additional info:
This error doesn't seem to be related to disks: 2019-04-23 15:14:07.330 8 ERROR ironic.drivers.modules.agent_client [req-a8564393-0c08-4adf-8d0c-af1a2e26dff4 - - - - -] Failed to connect to the agent running on node 81724853-5131-4d60-b568-134931b8b60e for invoking command image.install_bootloader. Error: HTTPConnectionPool(host='192.168.24.16', port=9999): Read timed out. (read timeout=60): requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.24.16', port=9999): Read timed out. (read timeout=60) It also seems transient, since some commands succeed. Can you confirm that 192.168.24.16 is the correct address and is reachable from the undercloud?
(In reply to Dmitry Tantsur from comment #2) > This error doesn't seem to be related to disks: > > 2019-04-23 15:14:07.330 8 ERROR ironic.drivers.modules.agent_client > [req-a8564393-0c08-4adf-8d0c-af1a2e26dff4 - - - - -] Failed to connect to > the agent running on node 81724853-5131-4d60-b568-134931b8b60e for invoking > command image.install_bootloader. Error: > HTTPConnectionPool(host='192.168.24.16', port=9999): Read timed out. (read > timeout=60): requests.exceptions.ReadTimeout: > HTTPConnectionPool(host='192.168.24.16', port=9999): Read timed out. (read > timeout=60) > > It also seems transient, since some commands succeed. Can you confirm that > 192.168.24.16 is the correct address and is reachable from the undercloud? I don't have this environment anymore but I'll confirm on the next one. Based on my observations though deployment passes when I leave only 2 disks for the ceph nodes.
See https://bugzilla.redhat.com/show_bug.cgi?id=1691551#c10
Nice find Derek! As this has the same symptoms as https://bugzilla.redhat.com/show_bug.cgi?id=1691551, marking this as a duplicate so we have one place to track this issue *** This bug has been marked as a duplicate of bug 1691551 ***