Bug 1348831 - Keystone WSGI migration: httpd temporarily collides on port binding
Summary: Keystone WSGI migration: httpd temporarily collides on port binding
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: Michele Baldessari
QA Contact: Leonid Natapov
URL:
Whiteboard:
Depends On:
Blocks: 1333977
TreeView+ depends on / blocked
 
Reported: 2016-06-22 07:52 UTC by Jiri Stransky
Modified: 2016-08-11 11:33 UTC (History)
13 users (show)

Fixed In Version: openstack-tripleo-heat-templates-2.0.0-21.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-11 11:33:01 UTC
Target Upstream Version:


Attachments (Terms of Use)
pcs status (6.42 KB, text/plain)
2016-06-29 13:32 UTC, Jiri Stransky
no flags Details
corosync log from controller 0 (3.50 MB, text/plain)
2016-06-29 13:33 UTC, Jiri Stransky
no flags Details
httpd log from controller 0 (666.72 KB, text/x-vhdl)
2016-06-29 13:33 UTC, Jiri Stransky
no flags Details
httpd log from controller 1 -- ok (669.91 KB, text/x-vhdl)
2016-06-29 13:33 UTC, Jiri Stransky
no flags Details
pe-input corresponding to the restart transition (6.70 KB, application/x-bzip)
2016-07-07 10:32 UTC, Michele Baldessari
no flags Details
sosreport from controller 0 (11.17 MB, application/x-xz)
2016-07-14 13:50 UTC, Marios Andreou
no flags Details
sos report controller 1 (10.45 MB, application/x-xz)
2016-07-14 13:59 UTC, Marios Andreou
no flags Details
sos report from controller 2 (10.24 MB, application/x-xz)
2016-07-14 14:14 UTC, Marios Andreou
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1599 normal SHIPPED_LIVE Red Hat OpenStack Platform 9 director Release Candidate Advisory 2016-08-11 15:25:37 UTC
OpenStack gerrit 338879 'None' 'MERGED' 'Remove two races during the L->M Keystone under httpd migration' 2019-11-29 09:22:20 UTC

Description Jiri Stransky 2016-06-22 07:52:25 UTC
Description of problem:

There might be a race condition in the current proposed migration between shutting down the eventlet Keystone and starting up the WSGI one, which means that cloud is down for longer than necessary (4.5 minutes in the log below, but that's a virt env so things are generally slower). The cluster eventually recovered without any intervention:

Here's httpd error log:

[Wed Jun 22 06:32:08.996441 2016] [mpm_prefork:notice] [pid 4039] AH00171: Graceful restart requested, doing restart
(98)Address already in use: AH00072: make_sock: could not bind to address 192.0.2.14:35357
[Wed Jun 22 06:32:09.110418 2016] [mpm_prefork:alert] [pid 4039] no listening sockets available, shutting down
[Wed Jun 22 06:32:09.110425 2016] [:emerg] [pid 4039] AH00019: Unable to open logs, exiting
[Wed Jun 22 06:36:46.084062 2016] [core:notice] [pid 6033] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0
[Wed Jun 22 06:36:46.086082 2016] [suexec:notice] [pid 6033] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Wed Jun 22 06:36:46.099634 2016] [auth_digest:notice] [pid 6033] AH01757: generating secret for digest authentication ...
[Wed Jun 22 06:36:46.104044 2016] [core:warn] [pid 6033] AH00098: pid file /etc/httpd/run/httpd.pid overwritten -- Unclean shutdown of previous Apache run?
[Wed Jun 22 06:36:46.109067 2016] [mpm_prefork:notice] [pid 6033] AH00163: Apache/2.4.6 (Red Hat Enterprise Linux) mod_wsgi/3.4 Python/2.7.5 configured -- resuming normal operations
[Wed Jun 22 06:36:46.109097 2016] [core:notice] [pid 6033] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'

Here's stopping time of openstack-keystone service:

čen 22 06:33:01 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Identity Service (code-named Keystone)...
čen 22 06:33:01 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Identity Service (code-named Keystone).


Version-Release number of selected component (if applicable):

openstack-tripleo-heat-templates-2.0.0-11.el7ost.noarch
+ applied patch set 17 of https://review.openstack.org/#/c/302235/

Comment 1 Jiri Stransky 2016-06-29 13:32:02 UTC
We thought that the latest patch set of https://review.openstack.org/#/c/302235/ which got merged to stable/mitaka had this issue fixed, and perhaps it partially does, but there still seems to be some form of a collision on one of the controllers and i even noticed an orphaned openstack-keystone resource in pcs status, though only for a while and then it disappeared:

 openstack-keystone     (systemd:openstack-keystone):    ORPHANED Started overcloud-controller-0


Workaround:

It's enough to just run `pcs resource cleanup` after the migration, to get httpd started on the single controller where it didn't start. (The migration itself doesn't fail because of this issue.)

Comment 2 Jiri Stransky 2016-06-29 13:32:28 UTC
Created attachment 1173909 [details]
pcs status

Comment 3 Jiri Stransky 2016-06-29 13:33:00 UTC
Created attachment 1173910 [details]
corosync log from controller 0

Comment 4 Jiri Stransky 2016-06-29 13:33:32 UTC
Created attachment 1173911 [details]
httpd log from controller 0

Comment 5 Jiri Stransky 2016-06-29 13:33:53 UTC
Created attachment 1173912 [details]
httpd log from controller 1 -- ok

Comment 6 Andrew Beekhof 2016-07-07 06:35:07 UTC
we need this file please:

Jun 29 13:15:43 [1531] overcloud-controller-0.localdomain    pengine:    error: process_pe_message:	Calculated Transition 31: /var/lib/pacemaker/pengine/pe-error-0.bz2

Comment 7 Michele Baldessari 2016-07-07 10:32:06 UTC
Created attachment 1177246 [details]
pe-input corresponding to the restart transition

So the attached file was obtained and it is the first transition after the httpd restart:
Jul 07 07:49:16 [1123] overcloud-controller-0    pengine:   notice: LogActions:	Restart httpd:0	(Started overcloud-controller-0)
Jul 07 07:49:16 [1123] overcloud-controller-0    pengine:   notice: process_pe_message:	Calculated Transition 155: /var/lib/pacemaker/pengine/pe-input-116.bz2

*Do* note that the error message was slightly different this time around. I still think it is the same underlying race, but the message is not the one about
binding ports:
Jul 07 07:51:14 overcloud-controller-0 systemd[1]: Unit httpd.service cannot be reloaded because it is inactive.
Jul 07 07:51:14 overcloud-controller-0 os-collect-config[2449]: [2016-07-07 07:51:14,367] (heat-config) [INFO] {"deploy_stdout": "", "deploy_stderr": "Job for httpd.service invalid.\n", "deploy_status_code": 1}
Jul 07 07:51:14 overcloud-controller-0 os-collect-config[2449]: [2016-07-07 07:51:14,368] (heat-config) [DEBUG] [2016-07-07 07:51:14,341] (heat-config) [INFO] deploy_server_id=65622106-1f58-40db-ab06-27e581868b20

Comment 8 Michele Baldessari 2016-07-07 11:12:22 UTC
Ignore this comment 7, Andrew. It is a slightly separate issue and I know what it is.

Comment 9 Michele Baldessari 2016-07-07 12:50:39 UTC
So I looked at this more with Mathieu and Andrew this morning (thanks both for
your time btw.). Here's a brief recap:
"""
After adding the upgrade step to migrate keystone under httpd, we
left two small races in process:
1) The first race could result in the following error:
Graceful restart requested, doing restart (98)Address already in use: AH00072: make_sock: could not bind to address 192.0.2.14:35357

This is likely caused by removing the keystone resource and
changing constraints in a single pacemaker CIB transaction.
We are not guaranteed that pacemaker will first remove keystone
and then attempt to restart httpd due to the changed constraints.
To address this we unmanage the httpd resource before the constraint
changes and we remanage it later.

2) The second race is because after the cib-push we were not
guaranteed that the later upgrade step that reloads the httpd
configuration via 'systemctl reload httpd' was run after
httpd was started everywhere and we could get the following error:
07 07:51:14 overcloud-controller-0 systemd[1]: Unit httpd.service cannot be reloaded because it is inactive.

We add a check_resource httpd started after we remanage the httpd
resource, in order to guarantee that the httpd resource is up and
running at this point.
"""

Now for race 1), which is really what this BZ is about, we will test the approach mentioned here and in the review, but PLEASE if you do hit this specific port binding issue please collect the following file and upload it here:
On the DC do the following
$ grep -e pengine:.*bz2 -e "LogActions.*Restart httpd" /var/log/cluster/corosync.log

and grab the first .bz2 file that is listed after the first "Restart httpd" line

Mathieu and I will test this change this afternoon and will report back here

Comment 10 Andrew Beekhof 2016-07-11 03:08:06 UTC
In the file you attached, the restarts are caused by:

 Clone Set: openstack-core-clone [openstack-core]
     Stopped: [ overcloud-controller-0 ]

Removing this bogus restart would presumably go a long way towards addressing the problem.

I Suggest s phased approach:

1. Create openstack-core-clone AND wait for it to get started
2. Update all the constraints to point to openstack-core-clone instead of keystone
3. Delete keystone AND wait for it to be stopped
4. Update the httpd resource and restart it

Comment 11 Michele Baldessari 2016-07-11 09:19:55 UTC
Discussed this further with Andrew. For now the unmanage httpd approach is also
okay. Let's keep an eye on any other failure reports from QE/CI in any case

Comment 12 Marios Andreou 2016-07-14 13:50:14 UTC
Created attachment 1179862 [details]
sosreport from controller 0

Comment 13 Marios Andreou 2016-07-14 13:59:40 UTC
Created attachment 1179871 [details]
sos report controller 1

Comment 14 Marios Andreou 2016-07-14 14:14:06 UTC
Created attachment 1179877 [details]
sos report from controller 2

Comment 15 Marios Andreou 2016-07-14 14:39:25 UTC
 Hi @bandini and @matbu - I tried the review at https://review.openstack.org/#/c/338879 again today, and this time it ends up with UPDATE_FAILED from heat. As we have already discussed, the issues I am seeing may not be caused by /#/c/338879 so I don't think we should block on what I am seeing. But I can say it also doesn't fix the issues I am seeing.
 
 As requested on the review, I attach the sos reports from controllers here (comments #12 #13 and #14) after the heat stack is update failed. I see httpd down on controller 0. A pcs resource cleanup does help and I can move on after doing that.
 
 Verbose copy/pasta notes on what/how I ran the keystone migration:
 


*=* 14:40:55 *=*=*= "DEPLOY"
openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

tripleo.sh -- Overcloud pingtest SUCCEEDED




*=* 14:56:42 *=*=*= "openstack undercloud upgrade " 


        sudo rm -rf /etc/yum.repos.d/*
        sudo rhos-release 9-director -d
        sudo rhos-release 9 -d
        sudo yum clean all && sudo yum clean metadata && sudo yum clean dbcache && sudo yum makecache
        sudo yum -y update
        sudo systemctl stop openstack-*
        sudo systemctl stop neutron-*
        openstack undercloud upgrade
*=* 15:20:29 *=*=*= " still ongoing... " (probably the yum clean line adds a couple mins but still seemed to take longer today)
pingtest 
                
*=* 15:23:21 *=*=*= "tripleo.sh -- Overcloud pingtest, SUCCESS"  after undercloud upgrade
        
        

*=* 15:24:38 *=*=*= "setup osp8 repos for the aodh migration on the overcloud:"

for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm ; sudo rhos-release 8-director -d ; echo '';"; done


No need to apply compute hostname format already applied:

[stack@instack ~]$ grep -n -A 3 'ComputeHostnameFormat:'  /usr/share/openstack-tripleo-heat-templates/overcloud.yaml
818:  ComputeHostnameFormat:
819-    type: string
820-    description: Format for Compute node hostnames
821-    default: '%stackname%-compute-%index%'


[stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo ls -l /etc/yum.repos.d/; echo '';"; doneovercloud-compute-0.localdomain

total 20
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2097 Jul 14 12:25 rhos-release-8-director.repo
-rw-r--r--. 1 root root 2277 Jul 14 12:25 rhos-release-8.repo
-rw-r--r--. 1 root root  278 Jun 28 18:02 rhos-release.repo
-rw-r--r--. 1 root root 1237 Jul 14 12:24 rhos-release-rhel-7.2.repo

overcloud-controller-0.localdomain

total 20
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2097 Jul 14 12:26 rhos-release-8-director.repo
-rw-r--r--. 1 root root 2277 Jul 14 12:26 rhos-release-8.repo
-rw-r--r--. 1 root root  278 Jun 28 18:02 rhos-release.repo
-rw-r--r--. 1 root root 1237 Jul 14 12:25 rhos-release-rhel-7.2.repo

overcloud-controller-1.localdomain

total 20
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2097 Jul 14 12:27 rhos-release-8-director.repo
-rw-r--r--. 1 root root 2277 Jul 14 12:27 rhos-release-8.repo
-rw-r--r--. 1 root root  278 Jun 28 18:02 rhos-release.repo
-rw-r--r--. 1 root root 1237 Jul 14 12:26 rhos-release-rhel-7.2.repo

overcloud-controller-2.localdomain

total 20
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2097 Jul 14 12:28 rhos-release-8-director.repo
-rw-r--r--. 1 root root 2277 Jul 14 12:28 rhos-release-8.repo
-rw-r--r--. 1 root root  278 Jun 28 18:02 rhos-release.repo
-rw-r--r--. 1 root root 1237 Jul 14 12:27 rhos-release-rhel-7.2.repo


*=* 15:29:37 *=*=*= "AODH MIGRATION:"
 
openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml
2016-07-11 12:05:19 [1]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_COMPLETE

*=* 15:39:57 *=*=*= " " 2016-07-14 12:39:30 [0]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_COMPLETE
Overcloud Endpoint: http://10.0.0.4:5000/v2.0

NO SERVICES DOWN

*=* 15:42:23 *=*=*= "tripleo.sh -- Overcloud pingtest, SUCCESS" after aodh migration 

*=* 15:51:54 *=*=*= " manually apply keystone fixup@ " 

test matbu fixup for possible races in the keystone migration
at https://review.openstack.org/#/c/338879/
sudo vim /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh

[stack@instack ~]$ diff  /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh.backup.orig
28,32d27
<         # LP #1599798
<         # We unmanage the httpd resource to make sure that pacemaker won't race
<         # with the keystone deletion/stopping during the CIB transaction that
<         # will take place later
<         pcs resource unmanage httpd-clone 
60a56,60
>         # We push the CIB after removing the keystone resource as we want
>         # to be sure that the httpd resource is untouched. Otherwise we risk
>         # httpd being restarted before keystone is stopped which would give
>         # us a conflicting listening port, because during this step httpd already
>         # has the keystone wsgi configuration but was not restarted
66,83d65
< 
<         # Let's be 100% sure that the keystone resource is stopped and gone before
<         # we remanage the httpd resource later below. We cannot reuse check_resource
<         # as the resource might not exist already in which case the function would fail
<         tstart=$(date +%s)
<         while pcs status | grep -q keystone-clone; do
<             sleep 5
<             tnow=$(date +%s)
<             if (( tnow-tstart > 600)) ; then
<                 echo_error "ERROR: keystone failed to stop during migration"
<                 exit 1
<             fi
<         done
< 
<         # We re-manage the httpd resource now and make sure it is fully started
<         # so that a subsequent reload will not fail
<         pcs resource manage httpd-clone
<         check_resource httpd started 1800

*=* 15:55:24 *=*=*=  "KEYSTONE MIGRATION:" 
 openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml 


*=* 16:01:29 *=*=*= " " 
92- openstack-keystone  (systemd:openstack-keystone):    ORPHANED Started[ overcloud-controller-0 overclou


*=* 16:11:51 *=*=*= " " 
2016-07-14 13:11:18 [overcloud-UpdateWorkflow-4m3ezwl5yxl6-KeystoneLibertyMitakaPostUpgradeDeployment-2p5w33jxxcj2]: CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-07-14 13:11:18 [1]: SIGNAL_COMPLETE Unknown
2016-07-14 13:11:19 [1]: SIGNAL_COMPLETE Unknown
2016-07-14 13:11:20 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-07-14 13:11:21 [2]: SIGNAL_COMPLETE Unknown
2016-07-14 13:11:23 [2]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_FAILED
Deployment failed:  Heat Stack update failed.

[stack@instack ~]$ heat resource-list overcloud | grep -ni fail
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
32:| UpdateWorkflow                            | 70225b72-bf9f-40eb-94e2-829225338f65          | OS::TripleO::Tasks::UpdateWorkflow                | UPDATE_FAILED   | 2016-07-14T12:57:19 |
[stack@instack ~]$ heat resource-show overcloud UpdateWorkflow
| resource_status_reason | resources.UpdateWorkflow: Error: resources.KeystoneLibertyMitakaPostUpgradeDeployment.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1 |

*=* 16:12:33 *=*=*= " " 
Every 2.0s: pcs status | grep -ni stop -C2                                        Thu Jul 14 13:12:44 2016

74- Clone Set: httpd-clone [httpd]
75-     Started: [ overcloud-controller-1 overcloud-controller-2 ]
76:     Stopped: [ overcloud-controller-0 ]

attach sos reports to bug https://bugzilla.redhat.com/show_bug.cgi?id=1348831

Comment 16 Marios Andreou 2016-07-18 16:04:00 UTC
Hi o/ update on my testing of this today. FWIW I got through the keystone migration with heat saying UPDATE_COMPLETE and no stopped services as has previously been the case. Notes on my testing below for reference, but I included both https://review.openstack.org/#/c/342725/ and https://review.openstack.org/#/c/338879/2 

copy/pasta notes on my env/what i did:
---------------------------------------
*=* 11:26:05 *=*=*=  " reset osp8 latest poodle, DEPLOY"
openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

tripleo.sh -- Overcloud pingtest SUCCEEDED

*=* 13:20:38 *=*=*= "openstack undercloud upgrade " 

        sudo rm -rf /etc/yum.repos.d/*
        sudo rhos-release 9-director -d
        sudo rhos-release 9 -d
        sudo yum clean all && sudo yum clean metadata && sudo yum clean dbcache && sudo yum makecache
        sudo yum -y update
        sudo systemctl stop openstack-*
        sudo systemctl stop neutron-*
     *=* 15:03:35 *=*=*= " "    openstack undercloud upgrade
                
*=* 15:23:21 *=*=*= "tripleo.sh -- Overcloud pingtest, SUCCESS"  after undercloud upgrade
   
*=* 15:28:21 *=*=*= " "  "setup osp8 repos for the aodh migration on the overcloud:"

for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum localinstall -y http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm ; sudo rhos-release 8-director -d ; echo '';"; done

No need to apply compute hostname format already applied:

[stack@instack ~]$ grep -n -A 3 'ComputeHostnameFormat:'  /usr/share/openstack-tripleo-heat-templates/overcloud.yaml
818:  ComputeHostnameFormat:
819-    type: string
820-    description: Format for Compute node hostnames
821-    default: '%stackname%-compute-%index%'


[stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo ls -l /etc/yum.repos.d/; echo '';"; doneovercloud-compute-0.localdomain

total 20
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2097 Jul 14 12:25 rhos-release-8-director.repo
-rw-r--r--. 1 root root 2277 Jul 14 12:25 rhos-release-8.repo
-rw-r--r--. 1 root root  278 Jun 28 18:02 rhos-release.repo
-rw-r--r--. 1 root root 1237 Jul 14 12:24 rhos-release-rhel-7.2.repo

overcloud-controller-0.localdomain

total 20
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2097 Jul 14 12:26 rhos-release-8-director.repo
-rw-r--r--. 1 root root 2277 Jul 14 12:26 rhos-release-8.repo
-rw-r--r--. 1 root root  278 Jun 28 18:02 rhos-release.repo
-rw-r--r--. 1 root root 1237 Jul 14 12:25 rhos-release-rhel-7.2.repo

overcloud-controller-1.localdomain

total 20
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2097 Jul 14 12:27 rhos-release-8-director.repo
-rw-r--r--. 1 root root 2277 Jul 14 12:27 rhos-release-8.repo
-rw-r--r--. 1 root root  278 Jun 28 18:02 rhos-release.repo
-rw-r--r--. 1 root root 1237 Jul 14 12:26 rhos-release-rhel-7.2.repo

overcloud-controller-2.localdomain

total 20
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2097 Jul 14 12:28 rhos-release-8-director.repo
-rw-r--r--. 1 root root 2277 Jul 14 12:28 rhos-release-8.repo
-rw-r--r--. 1 root root  278 Jun 28 18:02 rhos-release.repo
-rw-r--r--. 1 root root 1237 Jul 14 12:27 rhos-release-rhel-7.2.repo

*=* 15:31:43 *=*=*=  "AODH MIGRATION:"
 
openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml
2016-07-18 12:41:04 [1]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_COMPLETE
Overcloud Endpoint: http://10.0.0.4:5000/v2.0

NO SERVICES DOWN
*=* 15:50:30 *=*=*= "tripleo.sh -- Overcloud pingtest SUCCEEDED "  AFTER aodh migration

*=* 15:51:54 *=*=*= " manually apply keystone fixup@ " 

test matbu fixup for possible races in the keystone migration
at https://review.openstack.org/#/c/338879/
sudo vim /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh

*=* 15:59:47 *=*=*= "apply openstack-core interleave https://review.openstack.org/#/c/342725/" 


[root@instack openstack-tripleo-heat-templates]# diff  puppet/manifests/overcloud_controller_pacemaker.pp puppet/manifests/overcloud_controller_pacemaker.pp.ORIG
247c247
<         clone_params   => 'interleave=true',
---
>         clone_params   => true,
[root@instack openstack-tripleo-heat-templates]# 


[stack@instack ~]$ diff  /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh.ORIG 
28,33d27
<         # LP #1599798
<         # We unmanage the httpd resource to make sure that pacemaker won't race
<         # with the keystone deletion/stopping during the CIB transaction that
<         # will take place later
<         pcs resource unmanage httpd-clone
< 
44c38
<         $PCS resource create openstack-core ocf:heartbeat:Dummy --clone interleave=true
---
>         $PCS resource create openstack-core ocf:heartbeat:Dummy --clone
61a56,60
>         # We push the CIB after removing the keystone resource as we want
>         # to be sure that the httpd resource is untouched. Otherwise we risk
>         # httpd being restarted before keystone is stopped which would give
>         # us a conflicting listening port, because during this step httpd already
>         # has the keystone wsgi configuration but was not restarted
67,85d65
< 
<         # Let's be 100% sure that the keystone resource is stopped and gone before
<         # we remanage the httpd resource later below. We cannot reuse check_resource
<         # as the resource might not exist already in which case the function would fail
<         tstart=$(date +%s)
<         while pcs status | grep -q keystone-clone; do
<             sleep 5
<             tnow=$(date +%s)
<             if (( tnow-tstart > 600)) ; then
<                 echo_error "ERROR: keystone failed to stop during migration"
<                 exit 1
<             fi
<         done
< 
<         # We re-manage the httpd resource now and make sure it is fully started
<         # so that a subsequent reload will not fail
<         pcs resource manage httpd-clone
<         check_resource httpd started 1800
< 
[stack@instack ~]$ 


*=* 16:07:11 *=*=*= "there is no openstack-core resource before the migration? this is latest 8 poodle overcloud we are upgrading here. I will sanity check when i reset the env to vanila saved 8 state." 
[root@overcloud-controller-0 ~]# pcs status | grep core
[root@overcloud-controller-0 ~]# 


*=* 16:08:43 *=*=*= "KEYSTONE MIGRATION:" 
 openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml 


Jul 18 13:18:29 overcloud-controller-0.localdomain systemd[1]: Configuration file /run/systemd/system/openstack-ceilometer-notification.service.d/50-pacemaker.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jul 18 13:18:29 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Identity Service (code-named Keystone)...
Jul 18 13:18:29 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Identity Service (code-named Keystone).

*=* 16:28:37 *=*=*= " " 2016-07-18 13:26:51 [1]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_COMPLETE

NO SERVICES DOWN!!! \o/

[root@overcloud-controller-0 ~]# pcs status | grep core
 Clone Set: openstack-core-clone [openstack-core]

"AI respond to https://bugzilla.redhat.com/show_bug.cgi?id=1348831#c15  and the reviews "

cat > rhos-release-9.yaml << EOF
parameter_defaults:
  UpgradeInitCommand: |
    set -e
    rpm -ivh http://rhos-release.virt.bos.redhat.com/repos/rhos-release/rhos-release-latest.noarch.rpm || true  # rpm -i will return 1 if already installed
    #wise to remove any existing rhos-release-x repos, e.g. that you setup for the aodh and keystone migrations
    mv /etc/yum.repos.d/rhos-release* ~ || true
    rhos-release 9-director -d
EOF

*=* 16:41:39 *=*=*= " " "UPGRADE INIT:" 
openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml -e rhos-release-9.yaml


*=* 18:36:48 *=*=*= "update bug and review:  " 


*=* 18:44:24 *=*=*= " tripleo.sh -- Overcloud pingtest SUCCEEDED" after upgrade init: migration

Comment 17 Michele Baldessari 2016-07-19 07:55:11 UTC
So in the last review iteration I went ahead and implemented the four phased
approach Andrew suggested in c#10. I'd appreciate any feedback or testing on
this latest iteration. 

Thanks,
Michele

Comment 20 Michele Baldessari 2016-07-25 14:11:06 UTC
Patch has been merged upstream

Comment 22 Udi Shkalim 2016-07-31 13:37:59 UTC
This race happened during upgrade procedure from osp8 to osp9, correct?
Any suggested steps to verify?

Comment 23 Michele Baldessari 2016-08-01 12:48:34 UTC
Hi Udi,

yes that is correct. Specifically it happens when executing the keystone
migration step. So I would say if you can do the 8->9 upgrade, if the keystone
step concludes successfully and keystone is running under httpd via wsgi after-wards, we are good to go.

cheers,
Michele

Comment 25 Leonid Natapov 2016-08-07 13:44:37 UTC
openstack-tripleo-heat-templates-2.0.0-24.el7ost

upgraded from 8 to 9 without any issues.

Comment 27 errata-xmlrpc 2016-08-11 11:33:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html


Note You need to log in before you can comment on or make changes to this bug.