Bug 1257355

Summary: RHEL-OSP Installer shutdown causes corosync & pcs problems for deployments
Product: Red Hat OpenStack Reporter: Dave Cain <dcain>
Component: rhel-osp-installerAssignee: Mike Burns <mburns>
Status: CLOSED EOL QA Contact: Omri Hochman <ohochman>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.0 (Juno)CC: ctatman, mburns, muafzal, rhos-maint, scohen, sreichar, srevivo, tkatarki
Target Milestone: ---Keywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-29 13:44:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dave Cain 2015-08-26 21:25:20 UTC
Description: Hi folks.  When using the rhel-osp-installer on RHEL-OSP6 (RHEL7.1) to build an OpenStack deployment, it seems that if we shutdown the installer server (or if it physically fails) afterwards, we observe a cluster communication problem between all of the Controller nodes.  

Reason being, the DHCP server running on the Installer server no longer responds on the physical NIC interfaces associated with the PxE network, which then don't receive leases and then are dropped.  Also, pacemaker commands time out as all the nodes are essentially isolated from each other, and OpenStack services begin to shutdown on each Controller node, causing an outage for the entire environment.

"pcs status" output shows the nodes as offline (no network connectivity), and top output shows that the 'corosync' and 'crmd' processes on the controller nodes shows an extremely busy CPU:

3264 root      20   0  213316  67460  34328 R  83.3  0.0 374:21.15 corosync
2376 haclust+  20   0  139648   9584   6312 S  56.9  0.0   8:06.99 crmd

However, once the Installer server is restored and dhclient is ran on the PxE physical NIC interfaces and they have addresses, pacemaker is happy again, services can be started, etc.

The default lease time seems to be set to 600 seconds, or 10 minutes.  Is there a work-around or something we can do to make sure the Installer server isn't a single point of failure for a deployment if it fails or is shutdown for some reason?

Version: 6.0-A3

External links:

Severity (U/H/M/L): H

Business Priority: Must Have

Comment 3 Muhammad Afzal 2015-08-26 22:04:25 UTC
Just to add more into Dave comments above. To isolate the problem, even if you shutdown the dhcpd service, you get the same behavior. PXE interface on the controller and compute hosts looses its IP addresses after a while, hence stopping the communication of OpenStack management network as well.

Comment 4 Dave Cain 2015-08-28 21:05:56 UTC
Just as an aside for the Red Hat team:

Will putting the traffic types of "Cluster Management, Admin API, Management, and Storage Clustering" on a 802.1q VLAN tagged network cause any problems that you know of?

Comment 5 Jaromir Coufal 2016-09-29 13:44:54 UTC
Closing list of bugs for RHEL OSP Installer since its support cycle has already ended [0]. If there is some bug closed by mistake, feel free to re-open.

For new deployments, please, use RHOSP director (starting with version 7).

-- Jaromir Coufal
-- Sr. Product Manager
-- Red Hat OpenStack Platform

[0] https://access.redhat.com/support/policy/updates/openstack/platform