Bug 1421347

Summary:	cinder retype --migration-policy on-demand issues in OSP-5
Product:	Red Hat OpenStack	Reporter:	Robin Cernin <rcernin>
Component:	openstack-cinder	Assignee:	Jon Bernard <jobernar>
Status:	CLOSED WONTFIX	QA Contact:	Tzach Shefi <tshefi>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.0 (RHEL 7)	CC:	acanan, asoni, dhill, eglynn, eharney, geguileo, jobernar, jwaterwo, mflusche, nkinder, pgrist, srevivo, tshefi
Target Milestone:	async	Keywords:	Reopened, Triaged, ZStream
Target Release:	5.0 (RHEL 7)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-17 23:12:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robin Cernin 2017-02-11 09:41:53 UTC

Description of problem:

Migrating VM volumes from one NAS storage to another NAS storage using cinder retype --migration-policy on-demand. VM has three volumes, but only two are migrated successfully and updated in MariaDB. Third volume is migrated but the MariaDB is not updated with the name_id, volume_type and status.

Version-Release number of selected component (if applicable):

openstack-cinder-2014.1.5-8.el7ost.noarch                   Mon Jul 25 13:53:43 2016
openstack-cinder-doc-2014.1.5-8.el7ost.noarch               Mon Jul 25 13:53:44 2016
python-cinder-2014.1.5-8.el7ost.noarch                      Mon Jul 25 13:53:43 2016
python-cinderclient-1.0.9-2.el7ost.noarch                   Mon Jul 25 13:53:41 2016
python-cinderclient-doc-1.0.9-2.el7ost.noarch               Mon Jul 25 13:53:44 2016

How reproducible:

Executed within script attached to this BZ

commands.getoutput("cinder --os-volume-api-version 2 retype --migration-policy on-demand %s %s" % (volumeid, volume_type))

Actual results:

From the procs fd output we can see that the VMs all three volumes are migrated, however only two are updated in the MariaDB.

Expected results:

All three volumes being migrated and updated in MariaDB.

Additional info:

This behavior is observed in one out of five vms.

Comment 8 Eric Harney 2017-02-15 16:16:10 UTC

Possibly related to this upstream bug, which sounds like the same symptom and has a proposed fix:

https://bugs.launchpad.net/cinder/+bug/1657806

Comment 17 Jack Waterworth 2017-03-09 19:11:01 UTC

Checking the libvirt vs the nova output, we saw a different volumes appear to be attached.

nova shows connected:
-----------------
6e8693e5-6979-4a7a-8f65-95b0ff8b8bdb    retyping
8a369f26-7fd9-4d63-a863-eb96118909e1
9b02a589-dae8-4780-9a9d-20124cf52adf
-----------------

libvirt shows:
-----------------
3a600a0e-ef48-4189-8e05-450d8efe30e5
173e7cf1-f59c-4a43-b6a3-a00fbe4d49f1
17cbe359-9866-415d-97c8-a62866cf0721    attaching
-----------------

Looking at the database, it appears that device that is stuck attaching is the migrated volume that is stuck retyping

-----------------
MariaDB [cinder]> select * from volumes where id LIKE '%f0721'\G
*************************** 1. row ***************************
         created_at: 2017-03-02 05:47:12
         updated_at: 2017-03-02 06:12:14
         deleted_at: NULL
            deleted: 0
                 id: 17cbe359-9866-415d-97c8-a62866cf0721
             ec2_id: NULL
            user_id: fe4729cf6a7e462fb8925daf36ae3c8e
         project_id: ac59371df2ba455b939d6ed2f79c6b04
               host: ha-controller@backend_netapp10
               size: 500
  availability_zone: nova
      instance_uuid: a50d8b51-0361-490e-9281-844b3f3a0e8c
         mountpoint: /dev/vdc
        attach_time: 2017-03-02T05:48:07.657648
             status: attaching
      attach_status: detached
       scheduled_at: 2017-03-02 05:47:12
        launched_at: 2017-03-02 06:12:13
      terminated_at: NULL
       display_name: lgposput00329_vdc_restore_2017-03-02_05:42:08_recovery_
display_description: NULL
  provider_location: gso-e3-affnas01-svm1_lif2.gso.aexp.com:/ge3affnas01_ipc2_cinder9
      provider_auth: NULL
        snapshot_id: d052588d-a0d8-401a-b3fa-de4bab611570
     volume_type_id: 41130358-8d77-447f-af5e-6e276193276b
       source_volid: NULL
           bootable: 0
      attached_host: NULL
  provider_geometry: NULL
           _name_id: NULL
  encryption_key_id: NULL
   migration_status: target:6e8693e5-6979-4a7a-8f65-95b0ff8b8bdb <----
-----------------

I suspect that the failure here is that nova does not correctly communicate with cinder that the volume has been successful attached, although I didnt see any obvious sign of this in the nova logs.

the customer's controller logs are lacking, so i have requested that they upload them again.

Comment 37 Scott Lewis 2017-06-30 13:30:58 UTC

Red Hat OpenStack Platform version 5 is now End-of-Life, and as such will not have further updates. See https://access.redhat.com/support/policy/updates/openstack/platform/ for full support lifecycle details.

Comment 42 Paul Grist 2018-04-17 23:12:38 UTC

After much effort and internal testing, we have determined this operation and live volume migration is not stable in the OSP-5 release, nor did we come up with any viable fixes as there was a great deal of re-work done to improve this in later and more recent releases.

It's worth noting at least the suggestions around the keystone timer and other things that could timeout these very long and problematic operations:

 expiration time to 2 days executing something like this on each keystone node:

$ openstack-config -set /etc/keystone/keystone.conf token expiration 172800
$ service openstack-keystone restart


That said, the only recommendation here is to do offline migration and look for alternatives to moving data.