Bug 1702413

Summary:	Overcloud deployment fails when deploying nodes with more than 2 disks
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	rhosp-director	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED DUPLICATE	QA Contact:	Sasha Smolyak <ssmolyak>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	15.0 (Stein)	CC:	bfournie, dbecker, derekh, dsneddon, dtantsur, hjensas, johfulto, mburns, morazi
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-04-24 23:39:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2019-04-23 17:31:40 UTC

Description of problem:

Overcloud deployment fails when deploying nodes with more than 2 disks:

(undercloud) [stack@undercloud-0 ~]$ nova list
/usr/lib/python3.6/site-packages/urllib3/connection.py:374: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python3.6/site-packages/urllib3/connection.py:374: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| 9148dc1b-64ed-4721-b827-9ceb374c7b1a | ceph-0       | ERROR  | -          | NOSTATE     |                        |
| 2f519f9f-79be-4dad-acab-2901307074e4 | ceph-1       | BUILD  | scheduling | NOSTATE     |                        |
| 5871d3e1-9f95-46e7-91be-0bedf055d149 | ceph-2       | BUILD  | scheduling | NOSTATE     |                        |
| 304be7f2-9709-4687-8263-58f5549653ec | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |
| 01273806-5f86-48f7-9888-a7fed31453ac | compute-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.15 |
| 50e02de4-4d55-45c2-aadb-ede3f7c8fb43 | compute-2    | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| f2936327-1e0f-42ba-8cd7-42ee0ea85b2e | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
| 386013ee-94d6-4d2e-88e1-90a9d0cc9af1 | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.11 |
| bdd88f96-f8c2-4c65-8a1e-f2575622c26f | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.21 |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+


The ceph nodes have 6 disks with the following configuration:

https://github.com/redhat-openstack/infrared/blob/master/plugins/virsh/defaults/topology/nodes/ceph.yml#L11-L53

Version-Release number of selected component (if applicable):
15  -p RHOS_TRUNK-15.0-RHEL-8-20190418.n.0


How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with nodes that have more than 2 disks

Actual results:
Overcloud deployment fails because the nodes with multiple disks fail to get deployed

Expected results:
Overcloud deployment passes without issues.

Additional info:

Comment 2 Dmitry Tantsur 2019-04-24 12:47:29 UTC

This error doesn't seem to be related to disks:

2019-04-23 15:14:07.330 8 ERROR ironic.drivers.modules.agent_client [req-a8564393-0c08-4adf-8d0c-af1a2e26dff4 - - - - -] Failed to connect to the agent running on node 81724853-5131-4d60-b568-134931b8b60e for invoking command image.install_bootloader. Error: HTTPConnectionPool(host='192.168.24.16', port=9999): Read timed out. (read timeout=60): requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.24.16', port=9999): Read timed out. (read timeout=60)

It also seems transient, since some commands succeed. Can you confirm that 192.168.24.16 is the correct address and is reachable from the undercloud?

Comment 3 Marius Cornea 2019-04-24 12:53:22 UTC

(In reply to Dmitry Tantsur from comment #2)
> This error doesn't seem to be related to disks:
> 
> 2019-04-23 15:14:07.330 8 ERROR ironic.drivers.modules.agent_client
> [req-a8564393-0c08-4adf-8d0c-af1a2e26dff4 - - - - -] Failed to connect to
> the agent running on node 81724853-5131-4d60-b568-134931b8b60e for invoking
> command image.install_bootloader. Error:
> HTTPConnectionPool(host='192.168.24.16', port=9999): Read timed out. (read
> timeout=60): requests.exceptions.ReadTimeout:
> HTTPConnectionPool(host='192.168.24.16', port=9999): Read timed out. (read
> timeout=60)
> 
> It also seems transient, since some commands succeed. Can you confirm that
> 192.168.24.16 is the correct address and is reachable from the undercloud?

I don't have this environment anymore but I'll confirm on the next one. Based on my observations though deployment passes when I leave only 2 disks for the ceph nodes.

Comment 4 Derek Higgins 2019-04-24 22:36:36 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1691551#c10

Comment 5 Bob Fournier 2019-04-24 23:39:30 UTC

Nice find Derek!

As this has the same symptoms as https://bugzilla.redhat.com/show_bug.cgi?id=1691551, marking this as a duplicate so we have one place to track this issue

*** This bug has been marked as a duplicate of bug 1691551 ***