Description of problem: After performing FFU, controllers cannot be restarted gracefully. They just hang on reboot, and they need to be rebooted with virsh, or with with nova reboot --hard. Before executing a reboot, pcs status was showing that all services were ok. After executing the reboot on controller-0, when looking at the other 2 controllers, pcs status was showing the following: Failed Actions: * rabbitmq-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=146, status=Timed Out, exitreason='', last-rc-change='Mon Jul 23 11:49:25 2018', queued=0ms, exec=20003ms * galera-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=144, status=Timed Out, exitreason='', last-rc-change='Mon Jul 23 11:49:05 2018', queued=1ms, exec=20002ms * redis-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=142, status=Timed Out, exitreason='', last-rc-change='Mon Jul 23 11:48:45 2018', queued=0ms, exec=20002ms * haproxy-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=134, status=Timed Out, exitreason='', last-rc-change='Mon Jul 23 11:48:24 2018', queued=0ms, exec=20005ms * openstack-cinder-volume-docker-0_stop_0 on controller-0 'unknown error' (1): call=136, status=Timed Out, exitreason='', last-rc-change='Mon Jul 23 11:48:24 2018', queued=0ms, exec=20004ms After a hard recover on controller-0, what i can see on logs is: Failed Actions: * rabbitmq-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=29, status=complete, exitreason='', last-rc-change='Mon Jul 23 11:59:47 2018', queued=0ms, exec=3126ms * galera-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=43, status=complete, exitreason='', last-rc-change='Mon Jul 23 11:59:48 2018', queued=0ms, exec=1769ms * redis-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=57, status=complete, exitreason='', last-rc-change='Mon Jul 23 11:59:49 2018', queued=0ms, exec=1672ms * haproxy-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=71, status=complete, exitreason='', last-rc-change='Mon Jul 23 11:59:52 2018', queued=0ms, exec=1801ms * openstack-cinder-volume-docker-0_monitor_0 on controller-0 'unknown error' (1): call=83, status=complete, exitreason='', last-rc-change='Mon Jul 23 11:59:52 2018', queued=0ms, exec=1753ms And on logs on controller-0 i can see: Jul 23 11:49:45 controller-0 lrmd[673335]: warning: rabbitmq-bundle-docker-0_stop_0 process (PID 100247) timed out Jul 23 11:49:45 controller-0 lrmd[673335]: warning: rabbitmq-bundle-docker-0_stop_0:100247 - timed out after 20000ms Jul 23 11:49:45 controller-0 crmd[673338]: error: Result of stop operation for rabbitmq-bundle-docker-0 on controller-0: Timed Out
Can we get sosreports from all the controller nodes please?
So we didn't have fencing enabled, and this caused the reboot to don't be clean. In order to reboot without fencing, first the pacemaker cluster needs to be stopped on that node. pcs cluster stop needs to be executed on the node before rebooting