Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1607376

Summary: FFU: After FFU, i cannot restart the controllers
Product: Red Hat OpenStack Reporter: Yolanda Robla <yroblamo>
Component: openstack-tripleoAssignee: James Slagle <jslagle>
Status: CLOSED NOTABUG QA Contact: Arik Chernetsky <achernet>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: atelang, cfontain, dciabrin, lbezdick, mburns, mcornea, michele, skramaja, yrachman, yroblamo, zgreenbe
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-23 13:42:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1601472    
Bug Blocks:    

Description Yolanda Robla 2018-07-23 12:04:59 UTC
Description of problem:

After performing FFU, controllers cannot be restarted gracefully. They just hang on reboot, and they need to be rebooted with virsh, or with with nova reboot --hard.
Before executing a reboot, pcs status was showing that all services were ok.
After executing the reboot on controller-0, when looking at the other 2 controllers, pcs status was showing the following:

Failed Actions:
* rabbitmq-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=146, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:49:25 2018', queued=0ms, exec=20003ms
* galera-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=144, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:49:05 2018', queued=1ms, exec=20002ms
* redis-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=142, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:48:45 2018', queued=0ms, exec=20002ms
* haproxy-bundle-docker-0_stop_0 on controller-0 'unknown error' (1): call=134, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:48:24 2018', queued=0ms, exec=20005ms
* openstack-cinder-volume-docker-0_stop_0 on controller-0 'unknown error' (1): call=136, status=Timed Out, exitreason='',
    last-rc-change='Mon Jul 23 11:48:24 2018', queued=0ms, exec=20004ms

After a hard recover on controller-0, what i can see on logs is:

Failed Actions:
* rabbitmq-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=29, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:47 2018', queued=0ms, exec=3126ms
* galera-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=43, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:48 2018', queued=0ms, exec=1769ms
* redis-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=57, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:49 2018', queued=0ms, exec=1672ms
* haproxy-bundle-docker-0_monitor_0 on controller-0 'unknown error' (1): call=71, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:52 2018', queued=0ms, exec=1801ms
* openstack-cinder-volume-docker-0_monitor_0 on controller-0 'unknown error' (1): call=83, status=complete, exitreason='',
    last-rc-change='Mon Jul 23 11:59:52 2018', queued=0ms, exec=1753ms


And on logs on controller-0 i can see:

Jul 23 11:49:45 controller-0 lrmd[673335]: warning: rabbitmq-bundle-docker-0_stop_0 process (PID 100247) timed out
Jul 23 11:49:45 controller-0 lrmd[673335]: warning: rabbitmq-bundle-docker-0_stop_0:100247 - timed out after 20000ms
Jul 23 11:49:45 controller-0 crmd[673338]:   error: Result of stop operation for rabbitmq-bundle-docker-0 on controller-0: Timed Out

Comment 1 Michele Baldessari 2018-07-23 12:18:54 UTC
Can we get sosreports from all the controller nodes please?

Comment 3 Yolanda Robla 2018-07-23 13:42:08 UTC
So we didn't have fencing enabled, and this caused the reboot to don't be clean. In order to reboot without fencing, first the pacemaker cluster needs to be stopped on that node.
pcs cluster stop needs to be executed on the node before rebooting