Bug 1257355 - RHEL-OSP Installer shutdown causes corosync & pcs problems for deployments
RHEL-OSP Installer shutdown causes corosync & pcs problems for deployments
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhel-osp-installer (Show other bugs)
6.0 (Juno)
x86_64 Linux
unspecified Severity high
: ---
: 8.0 (Liberty)
Assigned To: Mike Burns
Omri Hochman
: ZStream
Depends On:
  Show dependency treegraph
Reported: 2015-08-26 17:25 EDT by Dave Cain
Modified: 2016-09-29 09:44 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-09-29 09:44:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Dave Cain 2015-08-26 17:25:20 EDT
Description: Hi folks.  When using the rhel-osp-installer on RHEL-OSP6 (RHEL7.1) to build an OpenStack deployment, it seems that if we shutdown the installer server (or if it physically fails) afterwards, we observe a cluster communication problem between all of the Controller nodes.  

Reason being, the DHCP server running on the Installer server no longer responds on the physical NIC interfaces associated with the PxE network, which then don't receive leases and then are dropped.  Also, pacemaker commands time out as all the nodes are essentially isolated from each other, and OpenStack services begin to shutdown on each Controller node, causing an outage for the entire environment.

"pcs status" output shows the nodes as offline (no network connectivity), and top output shows that the 'corosync' and 'crmd' processes on the controller nodes shows an extremely busy CPU:

3264 root      20   0  213316  67460  34328 R  83.3  0.0 374:21.15 corosync
2376 haclust+  20   0  139648   9584   6312 S  56.9  0.0   8:06.99 crmd

However, once the Installer server is restored and dhclient is ran on the PxE physical NIC interfaces and they have addresses, pacemaker is happy again, services can be started, etc.

The default lease time seems to be set to 600 seconds, or 10 minutes.  Is there a work-around or something we can do to make sure the Installer server isn't a single point of failure for a deployment if it fails or is shutdown for some reason?

Version: 6.0-A3

External links:

Severity (U/H/M/L): H

Business Priority: Must Have
Comment 3 MUHAMMAD AFZAL 2015-08-26 18:04:25 EDT
Just to add more into Dave comments above. To isolate the problem, even if you shutdown the dhcpd service, you get the same behavior. PXE interface on the controller and compute hosts looses its IP addresses after a while, hence stopping the communication of OpenStack management network as well.
Comment 4 Dave Cain 2015-08-28 17:05:56 EDT
Just as an aside for the Red Hat team:

Will putting the traffic types of "Cluster Management, Admin API, Management, and Storage Clustering" on a 802.1q VLAN tagged network cause any problems that you know of?
Comment 5 Jaromir Coufal 2016-09-29 09:44:54 EDT
Closing list of bugs for RHEL OSP Installer since its support cycle has already ended [0]. If there is some bug closed by mistake, feel free to re-open.

For new deployments, please, use RHOSP director (starting with version 7).

-- Jaromir Coufal
-- Sr. Product Manager
-- Red Hat OpenStack Platform

[0] https://access.redhat.com/support/policy/updates/openstack/platform

Note You need to log in before you can comment on or make changes to this bug.