1607376 – FFU: After FFU, i cannot restart the controllers

Bug 1607376 - FFU: After FFU, i cannot restart the controllers

Summary: FFU: After FFU, i cannot restart the controllers

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	James Slagle
QA Contact:	Arik Chernetsky
Docs Contact:
URL:
Whiteboard:
Depends On:	1601472
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-23 12:04 UTC by Yolanda Robla
Modified:	2018-07-23 13:42 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-23 13:42:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Yolanda Robla 2018-07-23 12:04:59 UTC

Description of problem:

After performing FFU, controllers cannot be restarted gracefully. They just hang on reboot, and they need to be rebooted with virsh, or with with nova reboot --hard.
Before executing a reboot, pcs status was showing that all services were ok.
After executing the reboot on controller-0, when looking at the other 2 controllers, pcs status was showing the following:

Failed Actions:
* rabbitmq-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=146, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:49:25 2018', queued=0ms, exec=20003ms
* galera-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=144, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:49:05 2018', queued=1ms, exec=20002ms
* redis-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=142, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:48:45 2018', queued=0ms, exec=20002ms
* haproxy-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=134, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:48:24 2018', queued=0ms, exec=20005ms
* openstack-cinder-volume-docker-0_stop_0 on controller-0 'unknown error' (1): call=136, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:48:24 2018', queued=0ms, exec=20004ms

After a hard recover on controller-0, what i can see on logs is:

Failed Actions:
* rabbitmq-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=29, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:47 2018', queued=0ms, exec=3126ms
* galera-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=43, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:48 2018', queued=0ms, exec=1769ms
* redis-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=57, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:49 2018', queued=0ms, exec=1672ms
* haproxy-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=71, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:52 2018', queued=0ms, exec=1801ms
* openstack-cinder-volume-docker-0_monitor_0 on controller-0 'unknown error' (1): call=83, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:52 2018', queued=0ms, exec=1753ms


And on logs on controller-0 i can see:

Jul 23 11:49:45 controller-0 lrmd[673335]: warning: rabbitmq-bundle-docker-0_stop_0 process (PID 100247) timed out
Jul 23 11:49:45 controller-0 lrmd[673335]: warning: rabbitmq-bundle-docker-0_stop_0:100247 - timed out after 20000ms
Jul 23 11:49:45 controller-0 crmd[673338]:   error: Result of stop operation for rabbitmq-bundle-docker-0 on controller-0: Timed Out

Comment 1 Michele Baldessari 2018-07-23 12:18:54 UTC

Can we get sosreports from all the controller nodes please?

Comment 3 Yolanda Robla 2018-07-23 13:42:08 UTC

So we didn't have fencing enabled, and this caused the reboot to don't be clean. In order to reboot without fencing, first the pacemaker cluster needs to be stopped on that node.
pcs cluster stop needs to be executed on the node before rebooting

Note You need to log in before you can comment on or make changes to this bug.