Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1598511

Summary: FFU: restore procedure after successful 'openstack overcloud ffwd-upgrade run' fails with galera pcs resource not starting: exitreason='Failed initial monitor action
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: documentationAssignee: RHOS Documentation Team <rhos-docs>
Status: CLOSED DUPLICATE QA Contact: RHOS Documentation Team <rhos-docs>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: adahms, ccamacho, dbecker, fherrman, jfrancoa, lnatapov, mburns, mcornea, mmagr, morazi, pkilambi, srevivo, ykawada
Target Milestone: zstreamKeywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-24 05:22:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
restore commands output
none
sosreport none

Description Marius Cornea 2018-07-05 16:44:35 UTC
Created attachment 1456801 [details]
restore commands output

Description of problem:
FFU: restore procedure after successful 'openstack overcloud ffwd-upgrade run' fails with galera pcs rsource not starting: exitreason='Failed initial monitor action:

[root@controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Thu Jul  5 16:39:26 2018
Last change: Thu Jul  5 16:36:43 2018 by hacluster via crmd on controller-0

3 nodes configured
19 resources configured (7 DISABLED)

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

 ip-192.168.24.8	(ocf::heartbeat:IPaddr2):	Started controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: galera-master [galera]
     galera	(ocf::heartbeat:galera):	FAILED Master controller-0 (blocked)
     Masters: [ controller-1 controller-2 ]
 ip-172.17.4.14	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.3.18	(ocf::heartbeat:IPaddr2):	Started controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
     Stopped (disabled): [ controller-0 controller-1 controller-2 ]
 ip-172.17.1.12	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.108	(ocf::heartbeat:IPaddr2):	Started controller-1
 Master/Slave Set: redis-master [redis]
     Stopped (disabled): [ controller-0 controller-1 controller-2 ]
 ip-172.17.1.17	(ocf::heartbeat:IPaddr2):	Started controller-2
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Stopped (disabled)

Failed Actions:
* galera_promote_0 on controller-0 'unknown error' (1): call=66, status=complete, exitreason='Failed initial monitor action',
    last-rc-change='Thu Jul  5 16:36:44 2018', queued=0ms, exec=10412ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Version-Release number of selected component (if applicable):
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/restoring-the-overcloud

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controller, 2 computes, 3 ceph nodes

2. Backup controller nodes per:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-preparing_for_openstack-platform_upgrade#backing_up_the_overcloud 

3. Run the FFU procedure until the end of 'openstack overcloud ffwd-upgrade run' step. Make sure this step is successful.

4. Try to restore the controller nodes to step 2 per the restore procedure:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/restoring-the-overcloud

Actual results:
At step 'On the bootstrap Controller node, set Pacemaker to manage the Galera cluster:' the galera resource cannot get started:

Failed Actions:
* galera_promote_0 on controller-0 'unknown error' (1): call=66, status=complete, exitreason='Failed initial monitor action',
    last-rc-change='Thu Jul  5 16:36:44 2018', queued=0ms, exec=10412ms


Expected results:

The galera resource gets started as documented.

Additional info:

Attaching sosreport and the output of the commands that I run for the restore procedure.

Comment 1 Marius Cornea 2018-07-05 16:47:16 UTC
Created attachment 1456802 [details]
sosreport

Comment 5 Andrew Dahms 2018-09-24 05:22:47 UTC
Hi Marius,

Thank you for raising this bug.

My name is Andrew, and I am the documentation program manager investigating this issue.

After discussing this issue with the documentation team, we have decided that this issue must be reviewed by engineering before we can review the documentation impact.

Because there are several bugs of a similar nature, I will close this bug as a duplicate for now and move the main bug to engineering where it can be reviewed. We will then follow up and track any potential documentation impact coming out of that process.

Kind regards,

Andrew

*** This bug has been marked as a duplicate of bug 1626086 ***