Description of problem: There might be a race condition in the current proposed migration between shutting down the eventlet Keystone and starting up the WSGI one, which means that cloud is down for longer than necessary (4.5 minutes in the log below, but that's a virt env so things are generally slower). The cluster eventually recovered without any intervention: Here's httpd error log: [Wed Jun 22 06:32:08.996441 2016] [mpm_prefork:notice] [pid 4039] AH00171: Graceful restart requested, doing restart (98)Address already in use: AH00072: make_sock: could not bind to address 192.0.2.14:35357 [Wed Jun 22 06:32:09.110418 2016] [mpm_prefork:alert] [pid 4039] no listening sockets available, shutting down [Wed Jun 22 06:32:09.110425 2016] [:emerg] [pid 4039] AH00019: Unable to open logs, exiting [Wed Jun 22 06:36:46.084062 2016] [core:notice] [pid 6033] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0 [Wed Jun 22 06:36:46.086082 2016] [suexec:notice] [pid 6033] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec) [Wed Jun 22 06:36:46.099634 2016] [auth_digest:notice] [pid 6033] AH01757: generating secret for digest authentication ... [Wed Jun 22 06:36:46.104044 2016] [core:warn] [pid 6033] AH00098: pid file /etc/httpd/run/httpd.pid overwritten -- Unclean shutdown of previous Apache run? [Wed Jun 22 06:36:46.109067 2016] [mpm_prefork:notice] [pid 6033] AH00163: Apache/2.4.6 (Red Hat Enterprise Linux) mod_wsgi/3.4 Python/2.7.5 configured -- resuming normal operations [Wed Jun 22 06:36:46.109097 2016] [core:notice] [pid 6033] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND' Here's stopping time of openstack-keystone service: čen 22 06:33:01 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Identity Service (code-named Keystone)... čen 22 06:33:01 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Identity Service (code-named Keystone). Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-2.0.0-11.el7ost.noarch + applied patch set 17 of https://review.openstack.org/#/c/302235/
We thought that the latest patch set of https://review.openstack.org/#/c/302235/ which got merged to stable/mitaka had this issue fixed, and perhaps it partially does, but there still seems to be some form of a collision on one of the controllers and i even noticed an orphaned openstack-keystone resource in pcs status, though only for a while and then it disappeared: openstack-keystone (systemd:openstack-keystone): ORPHANED Started overcloud-controller-0 Workaround: It's enough to just run `pcs resource cleanup` after the migration, to get httpd started on the single controller where it didn't start. (The migration itself doesn't fail because of this issue.)
Created attachment 1173909 [details] pcs status
Created attachment 1173910 [details] corosync log from controller 0
Created attachment 1173911 [details] httpd log from controller 0
Created attachment 1173912 [details] httpd log from controller 1 -- ok
we need this file please: Jun 29 13:15:43 [1531] overcloud-controller-0.localdomain pengine: error: process_pe_message: Calculated Transition 31: /var/lib/pacemaker/pengine/pe-error-0.bz2
Created attachment 1177246 [details] pe-input corresponding to the restart transition So the attached file was obtained and it is the first transition after the httpd restart: Jul 07 07:49:16 [1123] overcloud-controller-0 pengine: notice: LogActions: Restart httpd:0 (Started overcloud-controller-0) Jul 07 07:49:16 [1123] overcloud-controller-0 pengine: notice: process_pe_message: Calculated Transition 155: /var/lib/pacemaker/pengine/pe-input-116.bz2 *Do* note that the error message was slightly different this time around. I still think it is the same underlying race, but the message is not the one about binding ports: Jul 07 07:51:14 overcloud-controller-0 systemd[1]: Unit httpd.service cannot be reloaded because it is inactive. Jul 07 07:51:14 overcloud-controller-0 os-collect-config[2449]: [2016-07-07 07:51:14,367] (heat-config) [INFO] {"deploy_stdout": "", "deploy_stderr": "Job for httpd.service invalid.\n", "deploy_status_code": 1} Jul 07 07:51:14 overcloud-controller-0 os-collect-config[2449]: [2016-07-07 07:51:14,368] (heat-config) [DEBUG] [2016-07-07 07:51:14,341] (heat-config) [INFO] deploy_server_id=65622106-1f58-40db-ab06-27e581868b20
Ignore this comment 7, Andrew. It is a slightly separate issue and I know what it is.
So I looked at this more with Mathieu and Andrew this morning (thanks both for your time btw.). Here's a brief recap: """ After adding the upgrade step to migrate keystone under httpd, we left two small races in process: 1) The first race could result in the following error: Graceful restart requested, doing restart (98)Address already in use: AH00072: make_sock: could not bind to address 192.0.2.14:35357 This is likely caused by removing the keystone resource and changing constraints in a single pacemaker CIB transaction. We are not guaranteed that pacemaker will first remove keystone and then attempt to restart httpd due to the changed constraints. To address this we unmanage the httpd resource before the constraint changes and we remanage it later. 2) The second race is because after the cib-push we were not guaranteed that the later upgrade step that reloads the httpd configuration via 'systemctl reload httpd' was run after httpd was started everywhere and we could get the following error: 07 07:51:14 overcloud-controller-0 systemd[1]: Unit httpd.service cannot be reloaded because it is inactive. We add a check_resource httpd started after we remanage the httpd resource, in order to guarantee that the httpd resource is up and running at this point. """ Now for race 1), which is really what this BZ is about, we will test the approach mentioned here and in the review, but PLEASE if you do hit this specific port binding issue please collect the following file and upload it here: On the DC do the following $ grep -e pengine:.*bz2 -e "LogActions.*Restart httpd" /var/log/cluster/corosync.log and grab the first .bz2 file that is listed after the first "Restart httpd" line Mathieu and I will test this change this afternoon and will report back here
In the file you attached, the restarts are caused by: Clone Set: openstack-core-clone [openstack-core] Stopped: [ overcloud-controller-0 ] Removing this bogus restart would presumably go a long way towards addressing the problem. I Suggest s phased approach: 1. Create openstack-core-clone AND wait for it to get started 2. Update all the constraints to point to openstack-core-clone instead of keystone 3. Delete keystone AND wait for it to be stopped 4. Update the httpd resource and restart it
Discussed this further with Andrew. For now the unmanage httpd approach is also okay. Let's keep an eye on any other failure reports from QE/CI in any case
Created attachment 1179862 [details] sosreport from controller 0
Created attachment 1179871 [details] sos report controller 1
Created attachment 1179877 [details] sos report from controller 2
Hi @bandini and @matbu - I tried the review at https://review.openstack.org/#/c/338879 again today, and this time it ends up with UPDATE_FAILED from heat. As we have already discussed, the issues I am seeing may not be caused by /#/c/338879 so I don't think we should block on what I am seeing. But I can say it also doesn't fix the issues I am seeing. As requested on the review, I attach the sos reports from controllers here (comments #12 #13 and #14) after the heat stack is update failed. I see httpd down on controller 0. A pcs resource cleanup does help and I can move on after doing that. Verbose copy/pasta notes on what/how I ran the keystone migration: *=* 14:40:55 *=*=*= "DEPLOY" openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' tripleo.sh -- Overcloud pingtest SUCCEEDED *=* 14:56:42 *=*=*= "openstack undercloud upgrade " sudo rm -rf /etc/yum.repos.d/* sudo rhos-release 9-director -d sudo rhos-release 9 -d sudo yum clean all && sudo yum clean metadata && sudo yum clean dbcache && sudo yum makecache sudo yum -y update sudo systemctl stop openstack-* sudo systemctl stop neutron-* openstack undercloud upgrade *=* 15:20:29 *=*=*= " still ongoing... " (probably the yum clean line adds a couple mins but still seemed to take longer today) pingtest *=* 15:23:21 *=*=*= "tripleo.sh -- Overcloud pingtest, SUCCESS" after undercloud upgrade *=* 15:24:38 *=*=*= "setup osp8 repos for the aodh migration on the overcloud:" for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm ; sudo rhos-release 8-director -d ; echo '';"; done No need to apply compute hostname format already applied: [stack@instack ~]$ grep -n -A 3 'ComputeHostnameFormat:' /usr/share/openstack-tripleo-heat-templates/overcloud.yaml 818: ComputeHostnameFormat: 819- type: string 820- description: Format for Compute node hostnames 821- default: '%stackname%-compute-%index%' [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo ls -l /etc/yum.repos.d/; echo '';"; doneovercloud-compute-0.localdomain total 20 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2097 Jul 14 12:25 rhos-release-8-director.repo -rw-r--r--. 1 root root 2277 Jul 14 12:25 rhos-release-8.repo -rw-r--r--. 1 root root 278 Jun 28 18:02 rhos-release.repo -rw-r--r--. 1 root root 1237 Jul 14 12:24 rhos-release-rhel-7.2.repo overcloud-controller-0.localdomain total 20 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2097 Jul 14 12:26 rhos-release-8-director.repo -rw-r--r--. 1 root root 2277 Jul 14 12:26 rhos-release-8.repo -rw-r--r--. 1 root root 278 Jun 28 18:02 rhos-release.repo -rw-r--r--. 1 root root 1237 Jul 14 12:25 rhos-release-rhel-7.2.repo overcloud-controller-1.localdomain total 20 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2097 Jul 14 12:27 rhos-release-8-director.repo -rw-r--r--. 1 root root 2277 Jul 14 12:27 rhos-release-8.repo -rw-r--r--. 1 root root 278 Jun 28 18:02 rhos-release.repo -rw-r--r--. 1 root root 1237 Jul 14 12:26 rhos-release-rhel-7.2.repo overcloud-controller-2.localdomain total 20 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2097 Jul 14 12:28 rhos-release-8-director.repo -rw-r--r--. 1 root root 2277 Jul 14 12:28 rhos-release-8.repo -rw-r--r--. 1 root root 278 Jun 28 18:02 rhos-release.repo -rw-r--r--. 1 root root 1237 Jul 14 12:27 rhos-release-rhel-7.2.repo *=* 15:29:37 *=*=*= "AODH MIGRATION:" openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml 2016-07-11 12:05:19 [1]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_COMPLETE *=* 15:39:57 *=*=*= " " 2016-07-14 12:39:30 [0]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: http://10.0.0.4:5000/v2.0 NO SERVICES DOWN *=* 15:42:23 *=*=*= "tripleo.sh -- Overcloud pingtest, SUCCESS" after aodh migration *=* 15:51:54 *=*=*= " manually apply keystone fixup@ " test matbu fixup for possible races in the keystone migration at https://review.openstack.org/#/c/338879/ sudo vim /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh [stack@instack ~]$ diff /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh.backup.orig 28,32d27 < # LP #1599798 < # We unmanage the httpd resource to make sure that pacemaker won't race < # with the keystone deletion/stopping during the CIB transaction that < # will take place later < pcs resource unmanage httpd-clone 60a56,60 > # We push the CIB after removing the keystone resource as we want > # to be sure that the httpd resource is untouched. Otherwise we risk > # httpd being restarted before keystone is stopped which would give > # us a conflicting listening port, because during this step httpd already > # has the keystone wsgi configuration but was not restarted 66,83d65 < < # Let's be 100% sure that the keystone resource is stopped and gone before < # we remanage the httpd resource later below. We cannot reuse check_resource < # as the resource might not exist already in which case the function would fail < tstart=$(date +%s) < while pcs status | grep -q keystone-clone; do < sleep 5 < tnow=$(date +%s) < if (( tnow-tstart > 600)) ; then < echo_error "ERROR: keystone failed to stop during migration" < exit 1 < fi < done < < # We re-manage the httpd resource now and make sure it is fully started < # so that a subsequent reload will not fail < pcs resource manage httpd-clone < check_resource httpd started 1800 *=* 15:55:24 *=*=*= "KEYSTONE MIGRATION:" openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml *=* 16:01:29 *=*=*= " " 92- openstack-keystone (systemd:openstack-keystone): ORPHANED Started[ overcloud-controller-0 overclou *=* 16:11:51 *=*=*= " " 2016-07-14 13:11:18 [overcloud-UpdateWorkflow-4m3ezwl5yxl6-KeystoneLibertyMitakaPostUpgradeDeployment-2p5w33jxxcj2]: CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 2016-07-14 13:11:18 [1]: SIGNAL_COMPLETE Unknown 2016-07-14 13:11:19 [1]: SIGNAL_COMPLETE Unknown 2016-07-14 13:11:20 [ControllerDeployment]: SIGNAL_COMPLETE Unknown 2016-07-14 13:11:21 [2]: SIGNAL_COMPLETE Unknown 2016-07-14 13:11:23 [2]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_FAILED Deployment failed: Heat Stack update failed. [stack@instack ~]$ heat resource-list overcloud | grep -ni fail WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead 32:| UpdateWorkflow | 70225b72-bf9f-40eb-94e2-829225338f65 | OS::TripleO::Tasks::UpdateWorkflow | UPDATE_FAILED | 2016-07-14T12:57:19 | [stack@instack ~]$ heat resource-show overcloud UpdateWorkflow | resource_status_reason | resources.UpdateWorkflow: Error: resources.KeystoneLibertyMitakaPostUpgradeDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1 | *=* 16:12:33 *=*=*= " " Every 2.0s: pcs status | grep -ni stop -C2 Thu Jul 14 13:12:44 2016 74- Clone Set: httpd-clone [httpd] 75- Started: [ overcloud-controller-1 overcloud-controller-2 ] 76: Stopped: [ overcloud-controller-0 ] attach sos reports to bug https://bugzilla.redhat.com/show_bug.cgi?id=1348831
Hi o/ update on my testing of this today. FWIW I got through the keystone migration with heat saying UPDATE_COMPLETE and no stopped services as has previously been the case. Notes on my testing below for reference, but I included both https://review.openstack.org/#/c/342725/ and https://review.openstack.org/#/c/338879/2 copy/pasta notes on my env/what i did: --------------------------------------- *=* 11:26:05 *=*=*= " reset osp8 latest poodle, DEPLOY" openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' tripleo.sh -- Overcloud pingtest SUCCEEDED *=* 13:20:38 *=*=*= "openstack undercloud upgrade " sudo rm -rf /etc/yum.repos.d/* sudo rhos-release 9-director -d sudo rhos-release 9 -d sudo yum clean all && sudo yum clean metadata && sudo yum clean dbcache && sudo yum makecache sudo yum -y update sudo systemctl stop openstack-* sudo systemctl stop neutron-* *=* 15:03:35 *=*=*= " " openstack undercloud upgrade *=* 15:23:21 *=*=*= "tripleo.sh -- Overcloud pingtest, SUCCESS" after undercloud upgrade *=* 15:28:21 *=*=*= " " "setup osp8 repos for the aodh migration on the overcloud:" for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm ; sudo rhos-release 8-director -d ; echo '';"; done No need to apply compute hostname format already applied: [stack@instack ~]$ grep -n -A 3 'ComputeHostnameFormat:' /usr/share/openstack-tripleo-heat-templates/overcloud.yaml 818: ComputeHostnameFormat: 819- type: string 820- description: Format for Compute node hostnames 821- default: '%stackname%-compute-%index%' [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo ls -l /etc/yum.repos.d/; echo '';"; doneovercloud-compute-0.localdomain total 20 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2097 Jul 14 12:25 rhos-release-8-director.repo -rw-r--r--. 1 root root 2277 Jul 14 12:25 rhos-release-8.repo -rw-r--r--. 1 root root 278 Jun 28 18:02 rhos-release.repo -rw-r--r--. 1 root root 1237 Jul 14 12:24 rhos-release-rhel-7.2.repo overcloud-controller-0.localdomain total 20 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2097 Jul 14 12:26 rhos-release-8-director.repo -rw-r--r--. 1 root root 2277 Jul 14 12:26 rhos-release-8.repo -rw-r--r--. 1 root root 278 Jun 28 18:02 rhos-release.repo -rw-r--r--. 1 root root 1237 Jul 14 12:25 rhos-release-rhel-7.2.repo overcloud-controller-1.localdomain total 20 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2097 Jul 14 12:27 rhos-release-8-director.repo -rw-r--r--. 1 root root 2277 Jul 14 12:27 rhos-release-8.repo -rw-r--r--. 1 root root 278 Jun 28 18:02 rhos-release.repo -rw-r--r--. 1 root root 1237 Jul 14 12:26 rhos-release-rhel-7.2.repo overcloud-controller-2.localdomain total 20 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2097 Jul 14 12:28 rhos-release-8-director.repo -rw-r--r--. 1 root root 2277 Jul 14 12:28 rhos-release-8.repo -rw-r--r--. 1 root root 278 Jun 28 18:02 rhos-release.repo -rw-r--r--. 1 root root 1237 Jul 14 12:27 rhos-release-rhel-7.2.repo *=* 15:31:43 *=*=*= "AODH MIGRATION:" openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml 2016-07-18 12:41:04 [1]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: http://10.0.0.4:5000/v2.0 NO SERVICES DOWN *=* 15:50:30 *=*=*= "tripleo.sh -- Overcloud pingtest SUCCEEDED " AFTER aodh migration *=* 15:51:54 *=*=*= " manually apply keystone fixup@ " test matbu fixup for possible races in the keystone migration at https://review.openstack.org/#/c/338879/ sudo vim /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh *=* 15:59:47 *=*=*= "apply openstack-core interleave https://review.openstack.org/#/c/342725/" [root@instack openstack-tripleo-heat-templates]# diff puppet/manifests/overcloud_controller_pacemaker.pp puppet/manifests/overcloud_controller_pacemaker.pp.ORIG 247c247 < clone_params => 'interleave=true', --- > clone_params => true, [root@instack openstack-tripleo-heat-templates]# [stack@instack ~]$ diff /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh.ORIG 28,33d27 < # LP #1599798 < # We unmanage the httpd resource to make sure that pacemaker won't race < # with the keystone deletion/stopping during the CIB transaction that < # will take place later < pcs resource unmanage httpd-clone < 44c38 < $PCS resource create openstack-core ocf:heartbeat:Dummy --clone interleave=true --- > $PCS resource create openstack-core ocf:heartbeat:Dummy --clone 61a56,60 > # We push the CIB after removing the keystone resource as we want > # to be sure that the httpd resource is untouched. Otherwise we risk > # httpd being restarted before keystone is stopped which would give > # us a conflicting listening port, because during this step httpd already > # has the keystone wsgi configuration but was not restarted 67,85d65 < < # Let's be 100% sure that the keystone resource is stopped and gone before < # we remanage the httpd resource later below. We cannot reuse check_resource < # as the resource might not exist already in which case the function would fail < tstart=$(date +%s) < while pcs status | grep -q keystone-clone; do < sleep 5 < tnow=$(date +%s) < if (( tnow-tstart > 600)) ; then < echo_error "ERROR: keystone failed to stop during migration" < exit 1 < fi < done < < # We re-manage the httpd resource now and make sure it is fully started < # so that a subsequent reload will not fail < pcs resource manage httpd-clone < check_resource httpd started 1800 < [stack@instack ~]$ *=* 16:07:11 *=*=*= "there is no openstack-core resource before the migration? this is latest 8 poodle overcloud we are upgrading here. I will sanity check when i reset the env to vanila saved 8 state." [root@overcloud-controller-0 ~]# pcs status | grep core [root@overcloud-controller-0 ~]# *=* 16:08:43 *=*=*= "KEYSTONE MIGRATION:" openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml Jul 18 13:18:29 overcloud-controller-0.localdomain systemd[1]: Configuration file /run/systemd/system/openstack-ceilometer-notification.service.d/50-pacemaker.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway. Jul 18 13:18:29 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Identity Service (code-named Keystone)... Jul 18 13:18:29 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Identity Service (code-named Keystone). *=* 16:28:37 *=*=*= " " 2016-07-18 13:26:51 [1]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_COMPLETE NO SERVICES DOWN!!! \o/ [root@overcloud-controller-0 ~]# pcs status | grep core Clone Set: openstack-core-clone [openstack-core] "AI respond to https://bugzilla.redhat.com/show_bug.cgi?id=1348831#c15 and the reviews " cat > rhos-release-9.yaml << EOF parameter_defaults: UpgradeInitCommand: | set -e rpm -ivh http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm || true # rpm -i will return 1 if already installed #wise to remove any existing rhos-release-x repos, e.g. that you setup for the aodh and keystone migrations mv /etc/yum.repos.d/rhos-release* ~ || true rhos-release 9-director -d EOF *=* 16:41:39 *=*=*= " " "UPGRADE INIT:" openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml -e rhos-release-9.yaml *=* 18:36:48 *=*=*= "update bug and review: " *=* 18:44:24 *=*=*= " tripleo.sh -- Overcloud pingtest SUCCEEDED" after upgrade init: migration
So in the last review iteration I went ahead and implemented the four phased approach Andrew suggested in c#10. I'd appreciate any feedback or testing on this latest iteration. Thanks, Michele
Patch has been merged upstream
This race happened during upgrade procedure from osp8 to osp9, correct? Any suggested steps to verify?
Hi Udi, yes that is correct. Specifically it happens when executing the keystone migration step. So I would say if you can do the 8->9 upgrade, if the keystone step concludes successfully and keystone is running under httpd via wsgi after-wards, we are good to go. cheers, Michele
openstack-tripleo-heat-templates-2.0.0-24.el7ost upgraded from 8 to 9 without any issues.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1599.html