1599372 – Detached volumes are stuck in "detaching" state

Bug 1599372 - Detached volumes are stuck in "detaching" state

Summary: Detached volumes are stuck in "detaching" state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-tempestconf
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	Upstream M3
Target Release:	14.0 (Rocky)
Assignee:	Martin Kopec
QA Contact:	Chandan Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:	1622011
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-09 16:00 UTC by Filip Hubík
Modified:	2020-12-21 19:35 UTC (History)
CC List:	18 users (show)
Fixed In Version:	python-tempestconf-2.0.0-0.20180821043805.d7db90e.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-11 11:50:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
compute:/var/log/containers/nova/nova-compute.log (717.91 KB, text/plain) 2018-07-09 16:00 UTC, Filip Hubík	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	581344	0	None	MERGED	Fix http_image uploading to glance	2021-01-26 16:21:52 UTC
Red Hat Product Errata	RHEA-2019:0045	0	None	None	None	2019-01-11 11:50:50 UTC

Description Filip Hubík 2018-07-09 16:00:36 UTC

Created attachment 1457521 [details]
compute:/var/log/containers/nova/nova-compute.log

Description of problem:
Volumes attached to running VM's are not detachable, nova-compute.log on compute node reports error:

DEBUG nova.virt.libvirt.guest [req-5be5ac81-32d6-46bc-aa79-54aa92aa0634 fd79cfc452c04158bd3d99c94a110dc5 d07b58ddf0d84309bffabd6abdddfc36 - default default] Successfully detached device vdb from guest. Persistent?
tach_device /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:400
DEBUG oslo.service.loopingcall [req-5be5ac81-32d6-46bc-aa79-54aa92aa0634 fd79cfc452c04158bd3d99c94a110dc5 d07b58ddf0d84309bffabd6abdddfc36 - default default] Exception which is in the suggested list of exceptions 
nction: nova.virt.libvirt.guest._do_wait_and_retry_detach. _func /usr/lib/python2.7/site-packages/oslo_service/loopingcall.py:456
DEBUG oslo.service.loopingcall [req-5be5ac81-32d6-46bc-aa79-54aa92aa0634 fd79cfc452c04158bd3d99c94a110dc5 d07b58ddf0d84309bffabd6abdddfc36 - default default] Cannot retry nova.virt.libvirt.guest._do_wait_and_retry
eption since retry count (7) reached max retry count (7). _func /usr/lib/python2.7/site-packages/oslo_service/loopingcall.py:466
ERROR oslo.service.loopingcall [req-5be5ac81-32d6-46bc-aa79-54aa92aa0634 fd79cfc452c04158bd3d99c94a110dc5 d07b58ddf0d84309bffabd6abdddfc36 - default default] Dynamic interval looping call 'oslo_service.loopingcall
chFailed: Device detach failed for vdb: Unable to detach from guest transient domain.
ERROR oslo.service.loopingcall Traceback (most recent call last):
ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 193, in _run_loop
ERROR oslo.service.loopingcall     result = func(*self.args, **self.kw)
ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 471, in _func
ERROR oslo.service.loopingcall     return self._sleep_time
ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
ERROR oslo.service.loopingcall     self.force_reraise()
ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
ERROR oslo.service.loopingcall     six.reraise(self.type_, self.value, self.tb)
ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 450, in _func
ERROR oslo.service.loopingcall     result = f(*args, **kwargs)
ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 457, in _do_wait_and_retry_detach
ERROR oslo.service.loopingcall     device=alternative_device_name, reason=reason)
ERROR oslo.service.loopingcall DeviceDetachFailed: Device detach failed for vdb: Unable to detach from guest transient domain

In deployed OSPd14 environment:
$ source overcloudrc
$ openstack volume create TestVOL --size 1
$ openstack volume list --all                                                                                                                        
+--------------------------------------+---------+-----------+------+-------------+
| ID                                   | Name    | Status    | Size | Attached to |
+--------------------------------------+---------+-----------+------+-------------+
| b15bf82b-e285-4ea1-9834-815971e7580d | TestVOL | available |    1 |             |
+--------------------------------------+---------+-----------+------+-------------+

$ openstack server list --all                                                               
+--------------------------------------+------+--------+-----------------------------------+--------------------------------+---------+
| ID                                   | Name | Status | Networks                          | Image                          | Flavor  |
+--------------------------------------+------+--------+-----------------------------------+--------------------------------+---------+
| 1d87ffcd-6bbb-4d1b-bf0c-fbfdd88e750c | test | ACTIVE | private=192.100.100.4, 10.0.0.220 | cirros-0.3.5-x86_64-uec.tar.gz | m1.tiny |
+--------------------------------------+------+--------+-----------------------------------+--------------------------------+---------+

Add volume to VM:
$ openstack server add volume test TestVOL
$ openstack volume list --all
+--------------------------------------+---------+--------+------+-------------------------------+
| ID                                   | Name    | Status | Size | Attached to                   |
+--------------------------------------+---------+--------+------+-------------------------------+
| b15bf82b-e285-4ea1-9834-815971e7580d | TestVOL | in-use |    1 | Attached to test on /dev/vdb  |
+--------------------------------------+---------+--------+------+-------------------------------+

But when I try to detach volume from VM:
$ openstack server remove volume test TestVOL
$ openstack volume list --all
+--------------------------------------+---------+-----------+------+-------------------------------+
| ID                                   | Name    | Status    | Size | Attached to                   |
+--------------------------------------+---------+-----------+------+-------------------------------+
| b15bf82b-e285-4ea1-9834-815971e7580d | TestVOL | detaching |    1 | Attached to test on /dev/vdb  |
+--------------------------------------+---------+-----------+------+-------------------------------+

volume is stuck in "detaching" state.

How reproducible:
100% for such topology, also any Tempest test using this feature fails (scenario, volume-related)

Steps to Reproduce:
1. Deploy OSP14 using InfraRed virthost 1:1:1:1 topology, puddle 2018-07-04.3
2. Run full Tempest
3. Any test trying to detach volume fails
4. Manually reproduce using commands above

Expected results:
Volume is detached from VM in matter of seconds.

Nova packages present:
UC:
openstack-nova-api-18.0.0-0.20180629234416.54343e9.el7ost.noarch
openstack-nova-conductor-18.0.0-0.20180629234416.54343e9.el7ost.noarch
openstack-nova-common-18.0.0-0.20180629234416.54343e9.el7ost.noarch
puppet-nova-13.1.1-0.20180701111130.40d402f.el7ost.noarch
python2-novaclient-10.3.0-0.20180627161921.0cb0ffa.el7ost.noarch
openstack-nova-placement-api-18.0.0-0.20180629234416.54343e9.el7ost.noarch
openstack-nova-compute-18.0.0-0.20180629234416.54343e9.el7ost.noarch
openstack-nova-scheduler-18.0.0-0.20180629234416.54343e9.el7ost.noarch
python-nova-18.0.0-0.20180629234416.54343e9.el7ost.noarch

Controller:
puppet-nova-13.1.1-0.20180701111130.40d402f.el7ost.noarch
python2-novaclient-10.3.0-0.20180627161921.0cb0ffa.el7ost.noarch

Compute:
puppet-nova-13.1.1-0.20180701111130.40d402f.el7ost.noarch
python2-novaclient-10.3.0-0.20180627161921.0cb0ffa.el7ost.noarch

Comment 3 Attila Fazekas 2018-07-10 07:17:03 UTC

The image checksum is 7ef58c0f9aa6136021cb61a5d4f275e5 which is matching with cirros-0.3.5-x86_64-uec.tar.gz  , it is registered as a qcow2 image. (it is not before unpack)

Looks like wrong image used.

Comment 4 Attila Fazekas 2018-07-10 07:49:20 UTC

/usr/bin/discover-tempest-config  --deployer-input ~/ir-tempest-deployer-input.conf  --debug -v --create   --image http://<internal>/images/cirros-0.3.5-x86_64-disk.img   identity.uri $OS_AUTH_URL identity.admin_password $OS_PASSWORD scenario.img_dir ~/tempest-dir/etc         image.http_image http://<internal>/images/cirros-0.3.5-x86_64-uec.tar.gz    identity.region regionOne    orchestration.stack_owner_role heat_stack_owner       --out ~/tempest-dir/etc/tempest.conf

image.http_image should be only used in tempest.conf only for a test which does not depends on the file content (it could be any url)

--image <url>  is the one should be downloaded and uploaded to glance then configure it as generic test image (2th also for alt).

Comment 5 Filip Hubík 2018-07-10 11:18:48 UTC

Yes, that seems to be the cause here. After thorough discussion with Martin handling python-tempestconf we also came to conclusion that this is most likely bug.

--image specifies the main image that supposed to be upload to glance, then id should be referenced during discover-tempest-config process in [compute] section of tempest.conf. Currently, if image.http_image value is overriden, it replaces behaviour of --image and that is definitely bug.

image.http_image plays a role in image v1 test as far I was able to find - https://github.com/openstack/tempest/blob/master/tempest/api/image/v1/test_images.py which is according to comments https://github.com/openstack/python-tempestconf/blob/3a40d5fe982b0b8ef49f3b16bfbc278a10a93ef0/config_tempest/constants.py#L70 deprecated since Queens in Tempest, but since Newton in OpenStack https://developer.openstack.org/api-ref/image/versions/index.html#what-happened-to-the-v1-api itself.

The unknown for me is, why is http_image variable kept in code then and not marked for deprecation (e.g. like comments around image-feature-enabled.api_v1 have). In case only image v1 tests used this, shouldn't http_image be removed completely (since it only adds confusion)?

All in all, since in OSP13 we use also this override in CI and tempest.conf seems to be generated with ID's of proper qcow2 file (--image), not "http_image" one, looks like this bug was included between package versions:
python2-tempestconf-1.1.5-0.20180326143753.f9d956f.el7ost.noarch (ga pkg in OSP13)
and
python2-tempestconf-1.1.5-0.20180629215301.d1107c9.el7ost.noarch (OSP14 latest, puddle 2018-07-09.1).

Comment 10 Martin Kopec 2018-07-17 06:39:33 UTC

The patch [1] got merged in master branch.

[1] https://review.openstack.org/581344

Comment 22 Filip Hubík 2018-09-13 14:24:13 UTC

Verified, with fixed python-tempestconf package the value of "--image" parameter is referenced in tempest.conf as the main image for testing (also its ID), not image.http_image. http_image is just referenced inside [image] section as it should be and doesn't override mentioned value tempest.conf-wise.

tempest.conf difference (broken, fixed):
21c21
< img_file = cirros-0.3.5-x86_64-uec.tar.gz
---
> img_file = cirros-0.3.5-x86_64-disk.img
52a53,57
> [image]
> image_path = http://rhos-qe-mirror-tlv.usersys.redhat.com/images/cirros-0.3.5-x86_64-disk.img
> region = regionOne
> http_image = http://rhos-qe-mirror-tlv.usersys.redhat.com/images/cirros-0.3.5-x86_64-uec.tar.gz
> 
57,62c62,63
< image_ref = 51db4246-a53e-4c5e-92ff-c65550e0fd45
< image_ref_alt = 5486c620-9a5c-48aa-b4ff-8d8dbd540ec2
< 
< [image]
< region = regionOne
< http_image = http://rhos-qe-mirror-tlv.usersys.redhat.com/images/cirros-0.3.5-x86_64-uec.tar.gz
---
> image_ref = 82baf3f8-12d0-450a-9e28-cba7f0187af6
> image_ref_alt = 03bb1b21-4c6f-4149-9ca5-2673b481ecaf
76a78,79
> min_microversion = 3.0
> max_microversion = 3.52
115,118d117
< [image-feature-enabled]
< api_v1 = False
< api_v2 = True
< 
122a122,125
> 
> [image-feature-enabled]
> api_v1 = False
> api_v2 = True

Comment 25 errata-xmlrpc 2019-01-11 11:50:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Note You need to log in before you can comment on or make changes to this bug.