Description of problem: In IPv6 + SSL environment: post 7.3 -> 8 upgrade live migration fails Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.14-7.el7ost.noarch How reproducible: on one environment Steps to Reproduce: 1. Deploy using 7.3: export THT=~/templates/my-overcloud-7.3 openstack overcloud deploy --templates $THT \ -e $THT/environments/storage-environment.yaml \ -e $THT/environments/network-isolation-v6.yaml \ -e ~/templates/network-environment-7.3-v6.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ --control-scale 3 \ --compute-scale 1 \ --ceph-storage-scale 2 \ --ntp-server clock.redhat.com \ --libvirt-type qemu 2. Run some instances with volumes attached on the deployed cloud 3. Upgrade undercloud 4. Upgrade overcloud with workarounds for BZ#1324739 and BZ#1324691 5. Scale out with an additional compute and ceph nodes export THT=~/templates/my-overcloud-8.0 openstack overcloud deploy --templates $THT \ -e $THT/environments/storage-environment.yaml \ -e $THT/environments/network-isolation-v6.yaml \ -e ~/templates/network-environment-8.0-v6.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e ~/templates/hostname-wa.yaml \ --control-scale 3 \ --compute-scale 2 \ --ceph-storage-scale 3 \ --ntp-server clock.redhat.com \ --libvirt-type qemu 6. Live migrate an instance on the compute node added in tep 5 Actual results: stack@instack:~>>> nova live-migration stack01-vm03-eylyi5wp2qx2-my_instance-ob5ab2y2ivry overcloud-compute-1.localdomain /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SecurityWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SecurityWarning ERROR (ConnectFailure): Unable to establish connection to https://[2001:db8:fd00:1000::10]:13774/v2/4a257a50297344e894f21c358c65bf58/servers/aa56e12a-b58d-4cb3-85a0-108a3b42fc4d/action Expected results: Live migration completes ok. Additional info: The nova compute log shows the following error: 2016-04-13 13:11:56.161 24179 ERROR nova.volume.cinder [req-510cc26e-b2f8-4ee5-81ba-7b02bf8586fe 1cbe81c519c245b791bee9be7ff1b159 4a257a50297344e894f21c358c65bf58 - - -] Connection between volume 200dc6f2-9141-4964-bd1c-868841f700d8 and ho st overcloud-compute-0.localdomain might have succeeded, but attempt to terminate connection has failed. Validate the connection and determine if manual cleanup is needed. Error: Gateway Time-out (HTTP 504) Code: 504.
Adding some details here: I tried the same scenario on a fresh 8 install with 1 compute, then scaled out with an additional compute and live migration completed fine.
The working environment with fresh 8 install, was it backed by Ceph too? I don't have a root cause pinned down, but posting more debugging info: The full stack trace shows that the error was triggered within check_can_live_migrate_source method in nova, specifically when executing initialize_connection in cinderclient: http://fpaste.org/355055/62068146/raw/ (The errors mentioning check_can_live_migrate_source can be found on both compute-0 and compute-1.) Inspecting cinder-api logs, it seems like haproxy returned the 504 code before cinder-api got a chance to respond, but the response from cinder-api would have been an error anyway: http://fpaste.org/355057/60562306/raw/
I found something weird when doing the live migration: http://paste.openstack.org/show/Ya7G5BVmMsiZhSq6Wbc8/ Which is related to this change: https://github.com/openstack/tripleo-heat-templates/commit/fd0b25b010db428c450b99b50ff3a0d60d263005 I think this commit is not backward compatible with the cinder volumes we created before. cinder service-list is showing 2 services, while it should show only one, I think we need to migrate volumes from the old one to the new one, with a MySQL operation (or maybe using Cinder API?). That is I think, the root issue.
(In reply to Jiri Stransky from comment #3) > The working environment with fresh 8 install, was it backed by Ceph too? > Yes, it is backed by Ceph. I think Emilien is right about the root cause. On the fresh environment I can only see: +------------------+--------------------------+------+---------+-------+----------------------------+-----------------+ | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +------------------+--------------------------+------+---------+-------+----------------------------+-----------------+ | cinder-scheduler | hostgroup | nova | enabled | up | 2016-04-13T18:08:22.000000 | - | | cinder-volume | rbd:volumes@tripleo_ceph | nova | enabled | up | 2016-04-13T18:08:24.000000 | - | +------------------+--------------------------+------+---------+-------+----------------------------+-----------------+ Given that I believe the issue is not related to either IPv6 or SSL and will show up with all Ceph backed environment.
(In reply to Emilien Macchi from comment #5) > cinder service-list is showing 2 services, while it should show only one, I > think we need to migrate volumes from the old one to the new one, with a > MySQL operation (or maybe using Cinder API?). > Yes, this seems to come from the fact that cinder.conf specifies "host=hostgroup", but "hostgroup" isn't an actual host. (Looking at overcloud-controller-0.)
Thanks Emilien, Marius and Eric for the debugging. I've traced the issue you mention to backwards incompatible changes in puppet-cinder. First a change that unconditionally sets host for cinder backends to a computed non-overridable value: https://review.openstack.org/#/c/209412/ And a change that migrates from `host` to `backend_host` and makes the value configurable, but it keeps the old (wrong, backwards incompatible) behavior for default value of the property. https://review.openstack.org/#/c/231068/ I think these should be both reverted, but since they already made it into stable/liberty and stable/mitaka, it's probably easiest to just work around this in t-h-t :-/
After upgrade: stack@instack:~>>> cinder service-list /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SecurityWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SecurityWarning +------------------+------------------------+------+---------+-------+----------------------------+-----------------+ | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +------------------+------------------------+------+---------+-------+----------------------------+-----------------+ | cinder-scheduler | hostgroup | nova | enabled | down | 2016-04-18T12:23:20.000000 | - | | cinder-scheduler | hostgroup | nova | enabled | up | 2016-04-18T16:41:20.000000 | - | | cinder-volume | hostgroup@tripleo_ceph | nova | enabled | up | 2016-04-18T16:41:19.000000 | - | +------------------+------------------------+------+---------+-------+----------------------------+-----------------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0653.html