Bug 1503756 - OSP11 -> OSP12 upgrade: upgrade gets stuck on split stack deployments during Deployment_Step2 because the cluster is in maintenance mode
Summary: OSP11 -> OSP12 upgrade: upgrade gets stuck on split stack deployments during ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 12.0 (Pike)
Assignee: Marius Cornea
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-18 16:10 UTC by Marius Cornea
Modified: 2023-02-22 23:02 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.3-12.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 22:17:19 UTC
Target Upstream Version:


Attachments (Terms of Use)
first pass debug notes from control0 /var/log/messages (2.33 KB, text/plain)
2017-10-19 15:26 UTC, Marios Andreou
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1725175 0 None None None 2017-10-20 09:26:21 UTC
OpenStack gerrit 518597 0 None None None 2017-11-13 15:06:01 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Marius Cornea 2017-10-18 16:10:56 UTC
Description of problem:
OSP11 -> OSP12 upgrade: upgrade gets stuck on split stack deployments during Deployment_Step2 because the cluster is in maintenance mode 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.3-0.20171014102841.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11 split stack deployment with 3 ctrl, 3 messaging, 3 db, 2 compute node, 3 ceph nodes
2. Upgrade to OSP12

Actual results:
While running major-upgrade-composable-steps-docker the upgrade gets stuck, checking the heat stacks:

(undercloud) [stack@undercloud-0 ~]$ openstack stack list --nested | grep PROGRESS
| 8e40a6c7-ebdb-4ccc-85c6-6275f8d3f3c5 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm-DatabaseDeployedServerDeployment_Step2-xcog6pw4h7ot                                                      | 08da50fc73114b118f112d645e8631dd | CREATE_IN_PROGRESS | 2017-10-18T15:42:08Z | None                 | dab455a8-18d2-4eab-8cea-7cabbb1d2659 |
| dab455a8-18d2-4eab-8cea-7cabbb1d2659 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm                                                                                                          | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T09:09:53Z | 2017-10-18T15:33:00Z | 2edb57b9-eb04-4147-ae07-e3d766052ca2 |
| 2edb57b9-eb04-4147-ae07-e3d766052ca2 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx                                                                                                                                                | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T06:18:14Z | 2017-10-18T15:31:15Z | f63bd95d-d367-49e6-a83e-d223ee13c991 |
| f63bd95d-d367-49e6-a83e-d223ee13c991 | overcloud                                                                                                                                                                                 | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T06:10:20Z | 2017-10-18T15:23:44Z | None                                 |
(undercloud) [stack@undercloud-0 ~]$ 


Going to the database nodes we can see that the mysql_init_bundle has been running for 23 minutes:

[root@database-0 ~]# docker ps
CONTAINER ID        IMAGE                                                          COMMAND                  CREATED             STATUS              PORTS               NAMES
4aa5d1cb91f3        192.168.0.1:8787/rhosp12/openstack-mariadb-docker:20171017.1   "/bin/bash -c 'cp -a "   23 minutes ago      Up 23 minutes                           mysql_init_bundle
b9d4c6209a8c        192.168.0.1:8787/rhosp12/openstack-mariadb-docker:20171017.1   "kolla_start"            23 minutes ago      Up 23 minutes                           clustercheck


[root@database-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: messaging-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
Last updated: Wed Oct 18 16:06:43 2017
Last change: Wed Oct 18 15:43:53 2017 by root via cibadmin on controller-0

18 nodes configured
36 resources configured (1 DISABLED)

              *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]

Full list of resources:

 ip-192.168.0.66	(ocf::heartbeat:IPaddr2):	Started controller-0 (unmanaged)
 ip-172.16.18.27	(ocf::heartbeat:IPaddr2):	Started controller-1 (unmanaged)
 ip-10.0.0.16	(ocf::heartbeat:IPaddr2):	Started controller-2 (unmanaged)
 ip-10.0.0.138	(ocf::heartbeat:IPaddr2):	Started controller-0 (unmanaged)
 ip-10.0.1.14	(ocf::heartbeat:IPaddr2):	Started controller-1 (unmanaged)
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Stopped (disabled, unmanaged)
 Docker container set: redis-bundle [192.168.0.1:8787/rhosp12/openstack-redis-docker:pcmklatest] (unmanaged)
   redis-bundle-0	(ocf::heartbeat:redis):	Stopped (unmanaged)
   redis-bundle-1	(ocf::heartbeat:redis):	Stopped (unmanaged)
   redis-bundle-2	(ocf::heartbeat:redis):	Stopped (unmanaged)
 Docker container set: rabbitmq-bundle [192.168.0.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest] (unmanaged)
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Stopped (unmanaged)
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Stopped (unmanaged)
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Stopped (unmanaged)
 Docker container set: galera-bundle [192.168.0.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest] (unmanaged)
   galera-bundle-0	(ocf::heartbeat:galera):	Stopped (unmanaged)
   galera-bundle-1	(ocf::heartbeat:galera):	Stopped (unmanaged)
   galera-bundle-2	(ocf::heartbeat:galera):	Stopped (unmanaged)
 Docker container set: haproxy-bundle [192.168.0.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest] (unmanaged)
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Stopped (unmanaged)
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Stopped (unmanaged)
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Stopped (unmanaged)

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@database-0 ~]# pcs property list
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: tripleo_cluster
 dc-version: 1.1.16-12.el7_4.4-94ff4df
 have-watchdog: false
 maintenance-mode: true
 redis_REPL_INFO: controller-0
 stonith-enabled: false
Node Attributes:
 controller-0: cinder-volume-role=true haproxy-role=true redis-role=true
 controller-1: cinder-volume-role=true haproxy-role=true redis-role=true
 controller-2: cinder-volume-role=true haproxy-role=true redis-role=true
 database-0: galera-role=true
 database-1: galera-role=true
 database-2: galera-role=true
 messaging-0: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-0
 messaging-1: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-1
 messaging-2: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-2


Expected results:
Upgrade doesn't get stuck.

Additional info:

After setting pcs property set maintenance-mode=false the upgrade gets unstuck and the resources get started:

Cluster name: tripleo_cluster
Stack: corosync
Current DC: messaging-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum
Last updated: Wed Oct 18 16:10:37 2017
Last change: Wed Oct 18 16:09:36 2017 by rabbitmq-bundle-2 via crm_attribute on messaging-2

18 nodes configured
36 resources configured (1 DISABLED)

Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
GuestOnline: [ galera-bundle-0@database-0 galera-bundle-1@database-1 galera-bundle-2@database-2 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ]

Full list of resources:

 ip-192.168.0.66	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.16.18.27	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-10.0.0.16	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.138	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-10.0.1.14	(ocf::heartbeat:IPaddr2):	Started controller-1
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Stopped (disabled)
 Docker container set: redis-bundle [192.168.0.1:8787/rhosp12/openstack-redis-docker:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-1	(ocf::heartbeat:redis):	Master controller-0
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-1
 Docker container set: rabbitmq-bundle [192.168.0.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest]
   rabbitmq-bundle-0	(ocf::heartbeat:rabbitmq-cluster):	Started messaging-0
   rabbitmq-bundle-1	(ocf::heartbeat:rabbitmq-cluster):	Started messaging-1
   rabbitmq-bundle-2	(ocf::heartbeat:rabbitmq-cluster):	Started messaging-2
 Docker container set: galera-bundle [192.168.0.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master database-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master database-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master database-2
 Docker container set: haproxy-bundle [192.168.0.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-0
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 1 Marius Cornea 2017-10-18 16:24:56 UTC
But upgrade later failed because openstack-cinder-volume was stopped:


2017-10-18 16:17:52Z [overcloud]: UPDATE_FAILED  resources.AllNodesDeploySteps: resources.AllNodesPostUpgradeSteps: Error: resources.ControllerDeployedServerPostConfig.resources.ControllerDeployedServerPostPuppetRestart.resources.ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server f

 Stack overcloud UPDATE_FAILED 

overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.ControllerDeployedServerPostConfig.ControllerDeployedServerPostPuppetRestart.ControllerPostPuppetRestartDeployment.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 67da640e-b27a-4dfe-a732-daa139053478
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
  deploy_stdout: |
     openstack-cinder-volume    (systemd:openstack-cinder-volume):      Stopped (disabled)
    Restarting openstack-cinder-volume...
  deploy_stderr: |
    ...
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled'
    + grep openstack-cinder-volume
    + for service in '$SERVICES_TO_RESTART'
    + echo 'Restarting openstack-cinder-volume...'
    + pcs resource restart --wait=600 openstack-cinder-volume
    Error: Error performing operation: No such device or address
    openstack-cinder-volume is not running anywhere and so cannot be restarted
    
    (truncated, view all with --long)
Heat Stack update failed.
Heat Stack update failed.

Comment 3 Marios Andreou 2017-10-19 15:26:53 UTC
Created attachment 1340815 [details]
first pass debug notes from control0 /var/log/messages

as discussed on upgrades scrum today, I've assigned to myself for triage. I had a look at controller-0 from the logs mcornea provided, attaching some interesting bits here. Would be great if someone from DFG:DF could check it still isn't clear what is failing here.

thanks

Comment 4 Marios Andreou 2017-10-19 15:28:52 UTC
adding needinfo on TC for DFG:DF can you please add this to triage list/rotation/whatever you use, see comment #3 we need help triaging it and it isn't at first look something related to the upgrade_tasks which seem to have run OK.

Comment 5 James Slagle 2017-10-19 16:07:23 UTC
what was the initial deployment command?
what are all the upgrade commands that have been run?

The pacemaker cluster is set/unset for maintenance-mode on every stack update by:
https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/environments/deployed-server-pacemaker-environment.yaml

Is that still the correct thing to be happening during an upgrade?

This matches what is done in:
https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/environments/puppet-pacemaker.yaml

Comment 6 James Slagle 2017-10-19 16:19:02 UTC
can we also see extraconfig/tasks/post_puppet_pacemaker.yaml from the upgraded plan?

and the roles data file used during the upgrade?

that's where the resources should be generated that set maintenance-mode=false.

Comment 7 James Slagle 2017-10-19 16:25:24 UTC
looking in /var/log/messages from controller-0, I don't see any instances of ControllerDeployedServerPostConfig being run until Oct 18 16:17:12

Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,329] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/script < /var/lib/heat-config/deployed/4b2a4a42-1512-4e54-bd83-e01db27d3c3c.json
Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,959] (heat-config) [INFO] {"deploy_stdout": "", "deploy_stderr": "", "deploy_status_code": 0}
Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,960] (heat-config) [DEBUG] [2017-10-18 16:17:12,367] (heat-config) [INFO] update_identifier=1508340189
Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_server_id=6d48feb2-4303-46ea-955c-fdfe888d09cc
Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_action=CREATE
Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_stack_id=overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm-ControllerDeployedServerPostConfig-lvg7e5kptpon-ControllerDeployedServerPostPuppetMaintenanceModeDeployment-73bv2s6gwx6v/9acd167a-1eee-4196-84e6-0793332647ad
Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_resource_name=0
Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_signal_transport=TEMP_URL_SIGNAL
Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_signal_id=https://192.168.0.2:13808/v1/AUTH_08da50fc73114b118f112d645e8631dd/9acd167a-1eee-4196-84e6-0793332647ad/overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm-ControllerDeployedServerPostConfig-lvg7e5kptpon-ControllerDeployedServerPostPuppetMaintenanceModeDeployment-73bv2s6gwx6v-0-nterhyiyqfy7?temp_url_sig=e888b3946570252a95ddd2644c0184a5fe9589c8&temp_url_expires=2147483586


looks like this corresponds to the stack update started around Oct 18 15:09:26 (whatever that was).

whatever upgarde or stack-update that was started around Oct 18 08:44:08 as linked by marios, must have had some difference in environments/templates/roles to not have the right resources generated to take the cluster out of maintenance mode.

that's probably what we need to be focusing on, what commands were actually run, what times so we can match up the logs, and what templates/environments/roles were used each time.

Comment 8 Marius Cornea 2017-10-19 19:57:11 UTC
(In reply to James Slagle from comment #5)
> what was the initial deployment command?
> what are all the upgrade commands that have been run?
> 
> The pacemaker cluster is set/unset for maintenance-mode on every stack
> update by:
> https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/
> environments/deployed-server-pacemaker-environment.yaml
> 
> Is that still the correct thing to be happening during an upgrade?
> 
> This matches what is done in:
> https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/
> environments/puppet-pacemaker.yaml

OK, now this rings a bell - I think this looks pretty much the same as bug 1470795 where the fix for regular deployments was to noop ControllerPreConfig and ControllerPostConfig in docker-ha -
 https://review.openstack.org/#/c/487313/2/environments/docker-ha.yaml

During upgrade of the split stack env we keep the deployed-server-pacemaker-environment.yaml environment file where ControllerDeployedServerPreConfig and ControllerDeployedServerPostConfig point to the puppet pacemaker extraconfig which get the cluster into maintenance mode.

This the upgrade command:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates/

openstack overcloud deploy --templates $THT \
--disable-validations \
-e $THT/environments/deployed-server-environment.yaml \
-e $THT/environments/deployed-server-bootstrap-environment-rhel.yaml \
-e $THT/environments/deployed-server-pacemaker-environment.yaml \
-r ~/openstack_deployment/roles/roles_data.yaml \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e $THT/environments/ceph-ansible/ceph-ansible.yaml \
-e ~/openstack_deployment/environments/nodes.yaml \
-e ~/openstack_deployment/environments/network-environment.yaml \
-e ~/openstack_deployment/environments/disk-layout.yaml \
-e ~/openstack_deployment/environments/ctlplane-assignments.yaml \
-e ~/openstack_deployment/environments/neutron-settings.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \
-e /home/stack/ceph-ansible-env.yaml \
-e /home/stack/docker-osp12.yaml \

I'd be inclined to noop https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/environments/deployed-server-pacemaker-environment.yaml#L2-L3 similar to https://review.openstack.org/#/c/487313/ wdyt?

Comment 9 James Slagle 2017-10-19 22:13:36 UTC
(In reply to Marius Cornea from comment #8)
> (In reply to James Slagle from comment #5)
> > what was the initial deployment command?
> > what are all the upgrade commands that have been run?
> > 
> > The pacemaker cluster is set/unset for maintenance-mode on every stack
> > update by:
> > https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/
> > environments/deployed-server-pacemaker-environment.yaml
> > 
> > Is that still the correct thing to be happening during an upgrade?
> > 
> > This matches what is done in:
> > https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/
> > environments/puppet-pacemaker.yaml
> 
> OK, now this rings a bell - I think this looks pretty much the same as bug
> 1470795 where the fix for regular deployments was to noop
> ControllerPreConfig and ControllerPostConfig in docker-ha -
>  https://review.openstack.org/#/c/487313/2/environments/docker-ha.yaml
> 
> During upgrade of the split stack env we keep the
> deployed-server-pacemaker-environment.yaml environment file where
> ControllerDeployedServerPreConfig and ControllerDeployedServerPostConfig
> point to the puppet pacemaker extraconfig which get the cluster into
> maintenance mode.
> 
> This the upgrade command:
> 
> source ~/stackrc
> export THT=/usr/share/openstack-tripleo-heat-templates/
> 
> openstack overcloud deploy --templates $THT \
> --disable-validations \
> -e $THT/environments/deployed-server-environment.yaml \
> -e $THT/environments/deployed-server-bootstrap-environment-rhel.yaml \
> -e $THT/environments/deployed-server-pacemaker-environment.yaml \
> -r ~/openstack_deployment/roles/roles_data.yaml \
> -e $THT/environments/network-isolation.yaml \
> -e $THT/environments/network-management.yaml \
> -e $THT/environments/ceph-ansible/ceph-ansible.yaml \
> -e ~/openstack_deployment/environments/nodes.yaml \
> -e ~/openstack_deployment/environments/network-environment.yaml \
> -e ~/openstack_deployment/environments/disk-layout.yaml \
> -e ~/openstack_deployment/environments/ctlplane-assignments.yaml \
> -e ~/openstack_deployment/environments/neutron-settings.yaml \
> -e
> /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-
> composable-steps-docker.yaml \
> -e /home/stack/ceph-ansible-env.yaml \
> -e /home/stack/docker-osp12.yaml \
> 
> I'd be inclined to noop
> https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/
> environments/deployed-server-pacemaker-environment.yaml#L2-L3 similar to
> https://review.openstack.org/#/c/487313/ wdyt?

sounds reasonable. the right fix here is probably to not have Controller hardcoded in docker-ha.yaml, as that would break if you are using custom roles where you have pacemaker services not on a role called "Controller".

Comment 11 Jon Schlueter 2017-11-22 17:38:44 UTC
openstack-tripleo-heat-templates-7.0.3-12.el7ost

Comment 15 errata-xmlrpc 2017-12-13 22:17:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.