1257355 – RHEL-OSP Installer shutdown causes corosync & pcs problems for deployments

Bug 1257355 - RHEL-OSP Installer shutdown causes corosync & pcs problems for deployments

Summary: RHEL-OSP Installer shutdown causes corosync & pcs problems for deployments

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhel-osp-installer
Sub Component:
Version:	6.0 (Juno)
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	8.0 (Liberty)
Assignee:	Mike Burns
QA Contact:	Omri Hochman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-26 21:25 UTC by Dave Cain
Modified:	2016-09-29 13:44 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-09-29 13:44:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dave Cain 2015-08-26 21:25:20 UTC

Description: Hi folks.  When using the rhel-osp-installer on RHEL-OSP6 (RHEL7.1) to build an OpenStack deployment, it seems that if we shutdown the installer server (or if it physically fails) afterwards, we observe a cluster communication problem between all of the Controller nodes.  

Reason being, the DHCP server running on the Installer server no longer responds on the physical NIC interfaces associated with the PxE network, which then don't receive leases and then are dropped.  Also, pacemaker commands time out as all the nodes are essentially isolated from each other, and OpenStack services begin to shutdown on each Controller node, causing an outage for the entire environment.

"pcs status" output shows the nodes as offline (no network connectivity), and top output shows that the 'corosync' and 'crmd' processes on the controller nodes shows an extremely busy CPU:

3264 root      20   0  213316  67460  34328 R  83.3  0.0 374:21.15 corosync
2376 haclust+  20   0  139648   9584   6312 S  56.9  0.0   8:06.99 crmd

However, once the Installer server is restored and dhclient is ran on the PxE physical NIC interfaces and they have addresses, pacemaker is happy again, services can be started, etc.

The default lease time seems to be set to 600 seconds, or 10 minutes.  Is there a work-around or something we can do to make sure the Installer server isn't a single point of failure for a deployment if it fails or is shutdown for some reason?

Version: 6.0-A3

External links:

Severity (U/H/M/L): H

Business Priority: Must Have

Comment 3 Muhammad Afzal 2015-08-26 22:04:25 UTC

Just to add more into Dave comments above. To isolate the problem, even if you shutdown the dhcpd service, you get the same behavior. PXE interface on the controller and compute hosts looses its IP addresses after a while, hence stopping the communication of OpenStack management network as well.

Comment 4 Dave Cain 2015-08-28 21:05:56 UTC

Just as an aside for the Red Hat team:

Will putting the traffic types of "Cluster Management, Admin API, Management, and Storage Clustering" on a 802.1q VLAN tagged network cause any problems that you know of?

Comment 5 Jaromir Coufal 2016-09-29 13:44:54 UTC

Closing list of bugs for RHEL OSP Installer since its support cycle has already ended [0]. If there is some bug closed by mistake, feel free to re-open.

For new deployments, please, use RHOSP director (starting with version 7).

-- Jaromir Coufal
-- Sr. Product Manager
-- Red Hat OpenStack Platform

[0] https://access.redhat.com/support/policy/updates/openstack/platform

Note You need to log in before you can comment on or make changes to this bug.