Bug 1326823

Summary:	scaling up after upgrade from 7.3 to 8.0 brings down cinder
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	rhosp-director	Assignee:	Jiri Stransky <jstransk>
Status:	CLOSED ERRATA	QA Contact:	Marius Cornea <mcornea>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	8.0 (Liberty)	CC:	augol, dbecker, eharney, emacchi, jcoufal, mburns, morazi, rhel-osp-director-maint
Target Milestone:	async
Target Release:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-0.8.14-8.el7ost	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-04-20 13:04:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2016-04-13 13:23:13 UTC

Description of problem:
In IPv6 + SSL environment: post 7.3 -> 8 upgrade live migration fails

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.14-7.el7ost.noarch

How reproducible:
on one environment

Steps to Reproduce:
1. Deploy using 7.3:

export THT=~/templates/my-overcloud-7.3
openstack overcloud deploy --templates $THT \
-e $THT/environments/storage-environment.yaml \
-e $THT/environments/network-isolation-v6.yaml \
-e ~/templates/network-environment-7.3-v6.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--compute-scale 1 \
--ceph-storage-scale 2 \
--ntp-server clock.redhat.com \
--libvirt-type qemu

2. Run some instances with volumes attached on the deployed cloud

3. Upgrade undercloud

4. Upgrade overcloud with workarounds for BZ#1324739 and BZ#1324691

5. Scale out with an additional compute and ceph nodes

export THT=~/templates/my-overcloud-8.0
openstack overcloud deploy --templates $THT \
-e $THT/environments/storage-environment.yaml \
-e $THT/environments/network-isolation-v6.yaml \
-e ~/templates/network-environment-8.0-v6.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e ~/templates/hostname-wa.yaml \
--control-scale 3 \
--compute-scale 2 \
--ceph-storage-scale 3 \
--ntp-server clock.redhat.com \
--libvirt-type qemu

6. Live migrate an instance on the compute node added in tep 5

Actual results:
stack@instack:~>>> nova live-migration stack01-vm03-eylyi5wp2qx2-my_instance-ob5ab2y2ivry overcloud-compute-1.localdomain
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SecurityWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SecurityWarning
ERROR (ConnectFailure): Unable to establish connection to https://[2001:db8:fd00:1000::10]:13774/v2/4a257a50297344e894f21c358c65bf58/servers/aa56e12a-b58d-4cb3-85a0-108a3b42fc4d/action


Expected results:
Live migration completes ok.

Additional info:
The nova compute log shows the following error:

2016-04-13 13:11:56.161 24179 ERROR nova.volume.cinder [req-510cc26e-b2f8-4ee5-81ba-7b02bf8586fe 1cbe81c519c245b791bee9be7ff1b159 4a257a50297344e894f21c358c65bf58 - - -] Connection between volume 200dc6f2-9141-4964-bd1c-868841f700d8 and ho
st overcloud-compute-0.localdomain might have succeeded, but attempt to terminate connection has failed. Validate the connection and determine if manual cleanup is needed. Error: Gateway Time-out (HTTP 504) Code: 504.

Comment 2 Marius Cornea 2016-04-13 14:56:53 UTC

Adding some details here: I tried the same scenario on a fresh 8 install with 1 compute, then scaled out with an additional compute and live migration completed fine.

Comment 3 Jiri Stransky 2016-04-13 15:58:32 UTC

The working environment with fresh 8 install, was it backed by Ceph too?


I don't have a root cause pinned down, but posting more debugging info:

The full stack trace shows that the error was triggered within check_can_live_migrate_source method in nova, specifically when executing initialize_connection in cinderclient:

http://fpaste.org/355055/62068146/raw/

(The errors mentioning check_can_live_migrate_source can be found on both compute-0 and compute-1.)

Inspecting cinder-api logs, it seems like haproxy returned the 504 code before cinder-api got a chance to respond, but the response from cinder-api would have been an error anyway:

http://fpaste.org/355057/60562306/raw/

Comment 5 Emilien Macchi 2016-04-13 17:59:53 UTC

I found something weird when doing the live migration:
http://paste.openstack.org/show/Ya7G5BVmMsiZhSq6Wbc8/

Which is related to this change:
https://github.com/openstack/tripleo-heat-templates/commit/fd0b25b010db428c450b99b50ff3a0d60d263005

I think this commit is not backward compatible with the cinder volumes we created before.

cinder service-list is showing 2 services, while it should show only one, I think we need to migrate volumes from the old one to the new one, with a MySQL operation (or maybe using Cinder API?).

That is I think, the root issue.

Comment 6 Marius Cornea 2016-04-13 18:12:51 UTC

(In reply to Jiri Stransky from comment #3)
> The working environment with fresh 8 install, was it backed by Ceph too?
> 

Yes, it is backed by Ceph. I think Emilien is right about the root cause. On the fresh environment I can only see:

+------------------+--------------------------+------+---------+-------+----------------------------+-----------------+
|      Binary      |           Host           | Zone |  Status | State |         Updated_at         | Disabled Reason |
+------------------+--------------------------+------+---------+-------+----------------------------+-----------------+
| cinder-scheduler |        hostgroup         | nova | enabled |   up  | 2016-04-13T18:08:22.000000 |        -        |
|  cinder-volume   | rbd:volumes@tripleo_ceph | nova | enabled |   up  | 2016-04-13T18:08:24.000000 |        -        |
+------------------+--------------------------+------+---------+-------+----------------------------+-----------------+

Given that I believe the issue is not related to either IPv6 or SSL and will show up with all Ceph backed environment.

Comment 7 Eric Harney 2016-04-13 18:14:06 UTC

(In reply to Emilien Macchi from comment #5)
> cinder service-list is showing 2 services, while it should show only one, I
> think we need to migrate volumes from the old one to the new one, with a
> MySQL operation (or maybe using Cinder API?).
> 

Yes, this seems to come from the fact that cinder.conf specifies "host=hostgroup", but "hostgroup" isn't an actual host.  (Looking at overcloud-controller-0.)

Comment 8 Jiri Stransky 2016-04-14 09:39:29 UTC

Thanks Emilien, Marius and Eric for the debugging. I've traced the issue you mention to backwards incompatible changes in puppet-cinder. First a change that unconditionally sets host for cinder backends to a computed non-overridable value:

https://review.openstack.org/#/c/209412/

And a change that migrates from `host` to `backend_host` and makes the value configurable, but it keeps the old (wrong, backwards incompatible) behavior for default value of the property.

https://review.openstack.org/#/c/231068/

I think these should be both reverted, but since they already made it into stable/liberty and stable/mitaka, it's probably easiest to just work around this in t-h-t :-/

Comment 11 Marius Cornea 2016-04-18 16:41:41 UTC

After upgrade:

stack@instack:~>>> cinder service-list
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SecurityWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SecurityWarning
+------------------+------------------------+------+---------+-------+----------------------------+-----------------+
|      Binary      |          Host          | Zone |  Status | State |         Updated_at         | Disabled Reason |
+------------------+------------------------+------+---------+-------+----------------------------+-----------------+
| cinder-scheduler |       hostgroup        | nova | enabled |  down | 2016-04-18T12:23:20.000000 |        -        |
| cinder-scheduler |       hostgroup        | nova | enabled |   up  | 2016-04-18T16:41:20.000000 |        -        |
|  cinder-volume   | hostgroup@tripleo_ceph | nova | enabled |   up  | 2016-04-18T16:41:19.000000 |        -        |
+------------------+------------------------+------+---------+-------+----------------------------+-----------------+

Comment 13 errata-xmlrpc 2016-04-20 13:04:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0653.html