Hide Forgot
Description of problem: OSP11 -> OSP12 upgrade: upgrade gets stuck on split stack deployments during Deployment_Step2 because the cluster is in maintenance mode Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.3-0.20171014102841.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP11 split stack deployment with 3 ctrl, 3 messaging, 3 db, 2 compute node, 3 ceph nodes 2. Upgrade to OSP12 Actual results: While running major-upgrade-composable-steps-docker the upgrade gets stuck, checking the heat stacks: (undercloud) [stack@undercloud-0 ~]$ openstack stack list --nested | grep PROGRESS | 8e40a6c7-ebdb-4ccc-85c6-6275f8d3f3c5 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm-DatabaseDeployedServerDeployment_Step2-xcog6pw4h7ot | 08da50fc73114b118f112d645e8631dd | CREATE_IN_PROGRESS | 2017-10-18T15:42:08Z | None | dab455a8-18d2-4eab-8cea-7cabbb1d2659 | | dab455a8-18d2-4eab-8cea-7cabbb1d2659 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T09:09:53Z | 2017-10-18T15:33:00Z | 2edb57b9-eb04-4147-ae07-e3d766052ca2 | | 2edb57b9-eb04-4147-ae07-e3d766052ca2 | overcloud-AllNodesDeploySteps-wjo2dcwwmosx | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T06:18:14Z | 2017-10-18T15:31:15Z | f63bd95d-d367-49e6-a83e-d223ee13c991 | | f63bd95d-d367-49e6-a83e-d223ee13c991 | overcloud | 08da50fc73114b118f112d645e8631dd | UPDATE_IN_PROGRESS | 2017-10-18T06:10:20Z | 2017-10-18T15:23:44Z | None | (undercloud) [stack@undercloud-0 ~]$ Going to the database nodes we can see that the mysql_init_bundle has been running for 23 minutes: [root@database-0 ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4aa5d1cb91f3 192.168.0.1:8787/rhosp12/openstack-mariadb-docker:20171017.1 "/bin/bash -c 'cp -a " 23 minutes ago Up 23 minutes mysql_init_bundle b9d4c6209a8c 192.168.0.1:8787/rhosp12/openstack-mariadb-docker:20171017.1 "kolla_start" 23 minutes ago Up 23 minutes clustercheck [root@database-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: messaging-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum Last updated: Wed Oct 18 16:06:43 2017 Last change: Wed Oct 18 15:43:53 2017 by root via cibadmin on controller-0 18 nodes configured 36 resources configured (1 DISABLED) *** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] Full list of resources: ip-192.168.0.66 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged) ip-172.16.18.27 (ocf::heartbeat:IPaddr2): Started controller-1 (unmanaged) ip-10.0.0.16 (ocf::heartbeat:IPaddr2): Started controller-2 (unmanaged) ip-10.0.0.138 (ocf::heartbeat:IPaddr2): Started controller-0 (unmanaged) ip-10.0.1.14 (ocf::heartbeat:IPaddr2): Started controller-1 (unmanaged) openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped (disabled, unmanaged) Docker container set: redis-bundle [192.168.0.1:8787/rhosp12/openstack-redis-docker:pcmklatest] (unmanaged) redis-bundle-0 (ocf::heartbeat:redis): Stopped (unmanaged) redis-bundle-1 (ocf::heartbeat:redis): Stopped (unmanaged) redis-bundle-2 (ocf::heartbeat:redis): Stopped (unmanaged) Docker container set: rabbitmq-bundle [192.168.0.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest] (unmanaged) rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped (unmanaged) rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped (unmanaged) rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped (unmanaged) Docker container set: galera-bundle [192.168.0.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest] (unmanaged) galera-bundle-0 (ocf::heartbeat:galera): Stopped (unmanaged) galera-bundle-1 (ocf::heartbeat:galera): Stopped (unmanaged) galera-bundle-2 (ocf::heartbeat:galera): Stopped (unmanaged) Docker container set: haproxy-bundle [192.168.0.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest] (unmanaged) haproxy-bundle-docker-0 (ocf::heartbeat:docker): Stopped (unmanaged) haproxy-bundle-docker-1 (ocf::heartbeat:docker): Stopped (unmanaged) haproxy-bundle-docker-2 (ocf::heartbeat:docker): Stopped (unmanaged) Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@database-0 ~]# pcs property list Cluster Properties: cluster-infrastructure: corosync cluster-name: tripleo_cluster dc-version: 1.1.16-12.el7_4.4-94ff4df have-watchdog: false maintenance-mode: true redis_REPL_INFO: controller-0 stonith-enabled: false Node Attributes: controller-0: cinder-volume-role=true haproxy-role=true redis-role=true controller-1: cinder-volume-role=true haproxy-role=true redis-role=true controller-2: cinder-volume-role=true haproxy-role=true redis-role=true database-0: galera-role=true database-1: galera-role=true database-2: galera-role=true messaging-0: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-0 messaging-1: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-1 messaging-2: rabbitmq-role=true rmq-node-attr-last-known-rabbitmq=rabbit@messaging-2 Expected results: Upgrade doesn't get stuck. Additional info: After setting pcs property set maintenance-mode=false the upgrade gets unstuck and the resources get started: Cluster name: tripleo_cluster Stack: corosync Current DC: messaging-1 (version 1.1.16-12.el7_4.4-94ff4df) - partition with quorum Last updated: Wed Oct 18 16:10:37 2017 Last change: Wed Oct 18 16:09:36 2017 by rabbitmq-bundle-2 via crm_attribute on messaging-2 18 nodes configured 36 resources configured (1 DISABLED) Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] GuestOnline: [ galera-bundle-0@database-0 galera-bundle-1@database-1 galera-bundle-2@database-2 rabbitmq-bundle-0@messaging-0 rabbitmq-bundle-1@messaging-1 rabbitmq-bundle-2@messaging-2 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ] Full list of resources: ip-192.168.0.66 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.16.18.27 (ocf::heartbeat:IPaddr2): Started controller-1 ip-10.0.0.16 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.138 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.1.14 (ocf::heartbeat:IPaddr2): Started controller-1 openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped (disabled) Docker container set: redis-bundle [192.168.0.1:8787/rhosp12/openstack-redis-docker:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Slave controller-2 redis-bundle-1 (ocf::heartbeat:redis): Master controller-0 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-1 Docker container set: rabbitmq-bundle [192.168.0.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started messaging-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started messaging-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started messaging-2 Docker container set: galera-bundle [192.168.0.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master database-0 galera-bundle-1 (ocf::heartbeat:galera): Master database-1 galera-bundle-2 (ocf::heartbeat:galera): Master database-2 Docker container set: haproxy-bundle [192.168.0.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
But upgrade later failed because openstack-cinder-volume was stopped: 2017-10-18 16:17:52Z [overcloud]: UPDATE_FAILED resources.AllNodesDeploySteps: resources.AllNodesPostUpgradeSteps: Error: resources.ControllerDeployedServerPostConfig.resources.ControllerDeployedServerPostPuppetRestart.resources.ControllerPostPuppetRestartDeployment.resources[0]: Deployment to server f Stack overcloud UPDATE_FAILED overcloud.AllNodesDeploySteps.AllNodesPostUpgradeSteps.ControllerDeployedServerPostConfig.ControllerDeployedServerPostPuppetRestart.ControllerPostPuppetRestartDeployment.0: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 67da640e-b27a-4dfe-a732-daa139053478 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 deploy_stdout: | openstack-cinder-volume (systemd:openstack-cinder-volume): Stopped (disabled) Restarting openstack-cinder-volume... deploy_stderr: | ... corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled' + grep openstack-cinder-volume + for service in '$SERVICES_TO_RESTART' + echo 'Restarting openstack-cinder-volume...' + pcs resource restart --wait=600 openstack-cinder-volume Error: Error performing operation: No such device or address openstack-cinder-volume is not running anywhere and so cannot be restarted (truncated, view all with --long) Heat Stack update failed. Heat Stack update failed.
Created attachment 1340815 [details] first pass debug notes from control0 /var/log/messages as discussed on upgrades scrum today, I've assigned to myself for triage. I had a look at controller-0 from the logs mcornea provided, attaching some interesting bits here. Would be great if someone from DFG:DF could check it still isn't clear what is failing here. thanks
adding needinfo on TC for DFG:DF can you please add this to triage list/rotation/whatever you use, see comment #3 we need help triaging it and it isn't at first look something related to the upgrade_tasks which seem to have run OK.
what was the initial deployment command? what are all the upgrade commands that have been run? The pacemaker cluster is set/unset for maintenance-mode on every stack update by: https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/environments/deployed-server-pacemaker-environment.yaml Is that still the correct thing to be happening during an upgrade? This matches what is done in: https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/environments/puppet-pacemaker.yaml
can we also see extraconfig/tasks/post_puppet_pacemaker.yaml from the upgraded plan? and the roles data file used during the upgrade? that's where the resources should be generated that set maintenance-mode=false.
looking in /var/log/messages from controller-0, I don't see any instances of ControllerDeployedServerPostConfig being run until Oct 18 16:17:12 Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,329] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/script < /var/lib/heat-config/deployed/4b2a4a42-1512-4e54-bd83-e01db27d3c3c.json Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,959] (heat-config) [INFO] {"deploy_stdout": "", "deploy_stderr": "", "deploy_status_code": 0} Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,960] (heat-config) [DEBUG] [2017-10-18 16:17:12,367] (heat-config) [INFO] update_identifier=1508340189 Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_server_id=6d48feb2-4303-46ea-955c-fdfe888d09cc Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_action=CREATE Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_stack_id=overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm-ControllerDeployedServerPostConfig-lvg7e5kptpon-ControllerDeployedServerPostPuppetMaintenanceModeDeployment-73bv2s6gwx6v/9acd167a-1eee-4196-84e6-0793332647ad Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_resource_name=0 Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_signal_transport=TEMP_URL_SIGNAL Oct 18 16:17:12 controller-0 os-collect-config: [2017-10-18 16:17:12,368] (heat-config) [INFO] deploy_signal_id=https://192.168.0.2:13808/v1/AUTH_08da50fc73114b118f112d645e8631dd/9acd167a-1eee-4196-84e6-0793332647ad/overcloud-AllNodesDeploySteps-wjo2dcwwmosx-AllNodesPostUpgradeSteps-4i7bmylywdnm-ControllerDeployedServerPostConfig-lvg7e5kptpon-ControllerDeployedServerPostPuppetMaintenanceModeDeployment-73bv2s6gwx6v-0-nterhyiyqfy7?temp_url_sig=e888b3946570252a95ddd2644c0184a5fe9589c8&temp_url_expires=2147483586 looks like this corresponds to the stack update started around Oct 18 15:09:26 (whatever that was). whatever upgarde or stack-update that was started around Oct 18 08:44:08 as linked by marios, must have had some difference in environments/templates/roles to not have the right resources generated to take the cluster out of maintenance mode. that's probably what we need to be focusing on, what commands were actually run, what times so we can match up the logs, and what templates/environments/roles were used each time.
(In reply to James Slagle from comment #5) > what was the initial deployment command? > what are all the upgrade commands that have been run? > > The pacemaker cluster is set/unset for maintenance-mode on every stack > update by: > https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/ > environments/deployed-server-pacemaker-environment.yaml > > Is that still the correct thing to be happening during an upgrade? > > This matches what is done in: > https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/ > environments/puppet-pacemaker.yaml OK, now this rings a bell - I think this looks pretty much the same as bug 1470795 where the fix for regular deployments was to noop ControllerPreConfig and ControllerPostConfig in docker-ha - https://review.openstack.org/#/c/487313/2/environments/docker-ha.yaml During upgrade of the split stack env we keep the deployed-server-pacemaker-environment.yaml environment file where ControllerDeployedServerPreConfig and ControllerDeployedServerPostConfig point to the puppet pacemaker extraconfig which get the cluster into maintenance mode. This the upgrade command: source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates/ openstack overcloud deploy --templates $THT \ --disable-validations \ -e $THT/environments/deployed-server-environment.yaml \ -e $THT/environments/deployed-server-bootstrap-environment-rhel.yaml \ -e $THT/environments/deployed-server-pacemaker-environment.yaml \ -r ~/openstack_deployment/roles/roles_data.yaml \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e $THT/environments/ceph-ansible/ceph-ansible.yaml \ -e ~/openstack_deployment/environments/nodes.yaml \ -e ~/openstack_deployment/environments/network-environment.yaml \ -e ~/openstack_deployment/environments/disk-layout.yaml \ -e ~/openstack_deployment/environments/ctlplane-assignments.yaml \ -e ~/openstack_deployment/environments/neutron-settings.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-composable-steps-docker.yaml \ -e /home/stack/ceph-ansible-env.yaml \ -e /home/stack/docker-osp12.yaml \ I'd be inclined to noop https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/environments/deployed-server-pacemaker-environment.yaml#L2-L3 similar to https://review.openstack.org/#/c/487313/ wdyt?
(In reply to Marius Cornea from comment #8) > (In reply to James Slagle from comment #5) > > what was the initial deployment command? > > what are all the upgrade commands that have been run? > > > > The pacemaker cluster is set/unset for maintenance-mode on every stack > > update by: > > https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/ > > environments/deployed-server-pacemaker-environment.yaml > > > > Is that still the correct thing to be happening during an upgrade? > > > > This matches what is done in: > > https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/ > > environments/puppet-pacemaker.yaml > > OK, now this rings a bell - I think this looks pretty much the same as bug > 1470795 where the fix for regular deployments was to noop > ControllerPreConfig and ControllerPostConfig in docker-ha - > https://review.openstack.org/#/c/487313/2/environments/docker-ha.yaml > > During upgrade of the split stack env we keep the > deployed-server-pacemaker-environment.yaml environment file where > ControllerDeployedServerPreConfig and ControllerDeployedServerPostConfig > point to the puppet pacemaker extraconfig which get the cluster into > maintenance mode. > > This the upgrade command: > > source ~/stackrc > export THT=/usr/share/openstack-tripleo-heat-templates/ > > openstack overcloud deploy --templates $THT \ > --disable-validations \ > -e $THT/environments/deployed-server-environment.yaml \ > -e $THT/environments/deployed-server-bootstrap-environment-rhel.yaml \ > -e $THT/environments/deployed-server-pacemaker-environment.yaml \ > -r ~/openstack_deployment/roles/roles_data.yaml \ > -e $THT/environments/network-isolation.yaml \ > -e $THT/environments/network-management.yaml \ > -e $THT/environments/ceph-ansible/ceph-ansible.yaml \ > -e ~/openstack_deployment/environments/nodes.yaml \ > -e ~/openstack_deployment/environments/network-environment.yaml \ > -e ~/openstack_deployment/environments/disk-layout.yaml \ > -e ~/openstack_deployment/environments/ctlplane-assignments.yaml \ > -e ~/openstack_deployment/environments/neutron-settings.yaml \ > -e > /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade- > composable-steps-docker.yaml \ > -e /home/stack/ceph-ansible-env.yaml \ > -e /home/stack/docker-osp12.yaml \ > > I'd be inclined to noop > https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/ > environments/deployed-server-pacemaker-environment.yaml#L2-L3 similar to > https://review.openstack.org/#/c/487313/ wdyt? sounds reasonable. the right fix here is probably to not have Controller hardcoded in docker-ha.yaml, as that would break if you are using custom roles where you have pacemaker services not on a role called "Controller".
openstack-tripleo-heat-templates-7.0.3-12.el7ost
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462