Red Hat Bugzilla – Bug 1257355
RHEL-OSP Installer shutdown causes corosync & pcs problems for deployments
Last modified: 2016-09-29 09:44:54 EDT
Description: Hi folks. When using the rhel-osp-installer on RHEL-OSP6 (RHEL7.1) to build an OpenStack deployment, it seems that if we shutdown the installer server (or if it physically fails) afterwards, we observe a cluster communication problem between all of the Controller nodes.
Reason being, the DHCP server running on the Installer server no longer responds on the physical NIC interfaces associated with the PxE network, which then don't receive leases and then are dropped. Also, pacemaker commands time out as all the nodes are essentially isolated from each other, and OpenStack services begin to shutdown on each Controller node, causing an outage for the entire environment.
"pcs status" output shows the nodes as offline (no network connectivity), and top output shows that the 'corosync' and 'crmd' processes on the controller nodes shows an extremely busy CPU:
3264 root 20 0 213316 67460 34328 R 83.3 0.0 374:21.15 corosync
2376 haclust+ 20 0 139648 9584 6312 S 56.9 0.0 8:06.99 crmd
However, once the Installer server is restored and dhclient is ran on the PxE physical NIC interfaces and they have addresses, pacemaker is happy again, services can be started, etc.
The default lease time seems to be set to 600 seconds, or 10 minutes. Is there a work-around or something we can do to make sure the Installer server isn't a single point of failure for a deployment if it fails or is shutdown for some reason?
Severity (U/H/M/L): H
Business Priority: Must Have
Just to add more into Dave comments above. To isolate the problem, even if you shutdown the dhcpd service, you get the same behavior. PXE interface on the controller and compute hosts looses its IP addresses after a while, hence stopping the communication of OpenStack management network as well.
Just as an aside for the Red Hat team:
Will putting the traffic types of "Cluster Management, Admin API, Management, and Storage Clustering" on a 802.1q VLAN tagged network cause any problems that you know of?
Closing list of bugs for RHEL OSP Installer since its support cycle has already ended . If there is some bug closed by mistake, feel free to re-open.
For new deployments, please, use RHOSP director (starting with version 7).
-- Jaromir Coufal
-- Sr. Product Manager
-- Red Hat OpenStack Platform