Bug 1416622

Summary:	Deployment always fails while using shared ILO port.
Product:	Red Hat OpenStack	Reporter:	VIKRANT <vaggarwa>
Component:	openstack-ironic	Assignee:	Dmitry Tantsur <dtantsur>
Status:	CLOSED WONTFIX	QA Contact:	Raviv Bar-Tal <rbartal>
Severity:	high	Docs Contact:
Priority:	high
Version:	10.0 (Newton)	CC:	aschultz, athomas, dbecker, dtantsur, hjensas, mburns, mcornea, morazi, rhel-osp-director-maint, srevivo
Target Milestone:	async
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-26 16:18:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description VIKRANT 2017-01-26 04:42:59 UTC

Description of problem:

 OSP 10 deployment on physical nodes which are having iLO configured with shared LOM port. 

Eno1 talk to the ILO network 
eno52 is the provisioning interface. 

ILO of these servers are shared physically with their eno1 port - they use the same RJ45 socket.


Version-Release number of selected component (if applicable):
RHEL OSP 10

How reproducible:
Everytime for Cu. 

Steps to Reproduce:
1.  Try the deployment using simple configuration with 3 controller and 3 compute nodes. 
2.  Deployment is getting failed. 
3.  

Actual results:
Deployment is getting failed at different stages in various attempts. 

Expected results:
It should get completed successfully. 

Additional info:

1) here is the deployment command:

~~~
openstack overcloud deploy	--templates		                                   								\
				  -e /home/stack/templates/overcloud/network-environment.yaml						\
				  -e /home/stack/templates/overcloud/firstboot.yaml							\
				  -e /home/stack/templates/overcloud/timezone.yaml							\
				  -e /home/stack/templates/overcloud/rhel-registration/environment-rhel-registration.yaml		\
				  -e /home/stack/templates/overcloud/rhel-registration/rhel-registration-resource-registry.yaml		\
				  -e /home/stack/templates/overcloud/cinder-solidfire-environment.yaml					\
				  -e /home/stack/templates/overcloud/logging-environment.yaml						\
				--stack			overcloud 									\
				--control-scale		3										\
				--compute-scale		4										\
				--ceph-storage-scale	0										\
				--ntp-server		pool.ntp.org									\
                                --neutron-network-type	vxlan										\
				--neutron-tunnel-types	vxlan										\
				--timeout		600	
~~~

2) It seems that the introspection process is more resilient to short gaps in the availability of the iLO IP address during reboots etc..

I do see errors (last_error on nodes) such as: 

~~~
Failed to change power state to 'power on'. Error: HTTPSConnectionPool(host='xx.xx.30.46', port=44
3): Max retries exceeded with url: /ribcl (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x5825a10>: Failed to establish a new connection: [Errno 113] EHOS
TUNREACH',))

or 

Failed to change power state to 'power on'. Error: iLO get_power_status failed, error: EOF occurre
d in violation of protocol (_ssl.c:579)
~~~

but it then recovers when the introspection image is up and running.

Comment 8 Dmitry Tantsur 2017-01-31 12:31:55 UTC

Thanks for your report. I'm glad you have some luck with pxe_ipmitool. We don't have big experience with the iLO drivers, so I've escalated it to proliantutils developers. I'm assigning myself to track it.