Created attachment 1457521 [details] compute:/var/log/containers/nova/nova-compute.log Description of problem: Volumes attached to running VM's are not detachable, nova-compute.log on compute node reports error: DEBUG nova.virt.libvirt.guest [req-5be5ac81-32d6-46bc-aa79-54aa92aa0634 fd79cfc452c04158bd3d99c94a110dc5 d07b58ddf0d84309bffabd6abdddfc36 - default default] Successfully detached device vdb from guest. Persistent? tach_device /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:400 DEBUG oslo.service.loopingcall [req-5be5ac81-32d6-46bc-aa79-54aa92aa0634 fd79cfc452c04158bd3d99c94a110dc5 d07b58ddf0d84309bffabd6abdddfc36 - default default] Exception which is in the suggested list of exceptions nction: nova.virt.libvirt.guest._do_wait_and_retry_detach. _func /usr/lib/python2.7/site-packages/oslo_service/loopingcall.py:456 DEBUG oslo.service.loopingcall [req-5be5ac81-32d6-46bc-aa79-54aa92aa0634 fd79cfc452c04158bd3d99c94a110dc5 d07b58ddf0d84309bffabd6abdddfc36 - default default] Cannot retry nova.virt.libvirt.guest._do_wait_and_retry eption since retry count (7) reached max retry count (7). _func /usr/lib/python2.7/site-packages/oslo_service/loopingcall.py:466 ERROR oslo.service.loopingcall [req-5be5ac81-32d6-46bc-aa79-54aa92aa0634 fd79cfc452c04158bd3d99c94a110dc5 d07b58ddf0d84309bffabd6abdddfc36 - default default] Dynamic interval looping call 'oslo_service.loopingcall chFailed: Device detach failed for vdb: Unable to detach from guest transient domain. ERROR oslo.service.loopingcall Traceback (most recent call last): ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 193, in _run_loop ERROR oslo.service.loopingcall result = func(*self.args, **self.kw) ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 471, in _func ERROR oslo.service.loopingcall return self._sleep_time ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ ERROR oslo.service.loopingcall self.force_reraise() ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise ERROR oslo.service.loopingcall six.reraise(self.type_, self.value, self.tb) ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 450, in _func ERROR oslo.service.loopingcall result = f(*args, **kwargs) ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 457, in _do_wait_and_retry_detach ERROR oslo.service.loopingcall device=alternative_device_name, reason=reason) ERROR oslo.service.loopingcall DeviceDetachFailed: Device detach failed for vdb: Unable to detach from guest transient domain In deployed OSPd14 environment: $ source overcloudrc $ openstack volume create TestVOL --size 1 $ openstack volume list --all +--------------------------------------+---------+-----------+------+-------------+ | ID | Name | Status | Size | Attached to | +--------------------------------------+---------+-----------+------+-------------+ | b15bf82b-e285-4ea1-9834-815971e7580d | TestVOL | available | 1 | | +--------------------------------------+---------+-----------+------+-------------+ $ openstack server list --all +--------------------------------------+------+--------+-----------------------------------+--------------------------------+---------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+------+--------+-----------------------------------+--------------------------------+---------+ | 1d87ffcd-6bbb-4d1b-bf0c-fbfdd88e750c | test | ACTIVE | private=192.100.100.4, 10.0.0.220 | cirros-0.3.5-x86_64-uec.tar.gz | m1.tiny | +--------------------------------------+------+--------+-----------------------------------+--------------------------------+---------+ Add volume to VM: $ openstack server add volume test TestVOL $ openstack volume list --all +--------------------------------------+---------+--------+------+-------------------------------+ | ID | Name | Status | Size | Attached to | +--------------------------------------+---------+--------+------+-------------------------------+ | b15bf82b-e285-4ea1-9834-815971e7580d | TestVOL | in-use | 1 | Attached to test on /dev/vdb | +--------------------------------------+---------+--------+------+-------------------------------+ But when I try to detach volume from VM: $ openstack server remove volume test TestVOL $ openstack volume list --all +--------------------------------------+---------+-----------+------+-------------------------------+ | ID | Name | Status | Size | Attached to | +--------------------------------------+---------+-----------+------+-------------------------------+ | b15bf82b-e285-4ea1-9834-815971e7580d | TestVOL | detaching | 1 | Attached to test on /dev/vdb | +--------------------------------------+---------+-----------+------+-------------------------------+ volume is stuck in "detaching" state. How reproducible: 100% for such topology, also any Tempest test using this feature fails (scenario, volume-related) Steps to Reproduce: 1. Deploy OSP14 using InfraRed virthost 1:1:1:1 topology, puddle 2018-07-04.3 2. Run full Tempest 3. Any test trying to detach volume fails 4. Manually reproduce using commands above Expected results: Volume is detached from VM in matter of seconds. Nova packages present: UC: openstack-nova-api-18.0.0-0.20180629234416.54343e9.el7ost.noarch openstack-nova-conductor-18.0.0-0.20180629234416.54343e9.el7ost.noarch openstack-nova-common-18.0.0-0.20180629234416.54343e9.el7ost.noarch puppet-nova-13.1.1-0.20180701111130.40d402f.el7ost.noarch python2-novaclient-10.3.0-0.20180627161921.0cb0ffa.el7ost.noarch openstack-nova-placement-api-18.0.0-0.20180629234416.54343e9.el7ost.noarch openstack-nova-compute-18.0.0-0.20180629234416.54343e9.el7ost.noarch openstack-nova-scheduler-18.0.0-0.20180629234416.54343e9.el7ost.noarch python-nova-18.0.0-0.20180629234416.54343e9.el7ost.noarch Controller: puppet-nova-13.1.1-0.20180701111130.40d402f.el7ost.noarch python2-novaclient-10.3.0-0.20180627161921.0cb0ffa.el7ost.noarch Compute: puppet-nova-13.1.1-0.20180701111130.40d402f.el7ost.noarch python2-novaclient-10.3.0-0.20180627161921.0cb0ffa.el7ost.noarch
The image checksum is 7ef58c0f9aa6136021cb61a5d4f275e5 which is matching with cirros-0.3.5-x86_64-uec.tar.gz , it is registered as a qcow2 image. (it is not before unpack) Looks like wrong image used.
/usr/bin/discover-tempest-config --deployer-input ~/ir-tempest-deployer-input.conf --debug -v --create --image http://<internal>/images/cirros-0.3.5-x86_64-disk.img identity.uri $OS_AUTH_URL identity.admin_password $OS_PASSWORD scenario.img_dir ~/tempest-dir/etc image.http_image http://<internal>/images/cirros-0.3.5-x86_64-uec.tar.gz identity.region regionOne orchestration.stack_owner_role heat_stack_owner --out ~/tempest-dir/etc/tempest.conf image.http_image should be only used in tempest.conf only for a test which does not depends on the file content (it could be any url) --image <url> is the one should be downloaded and uploaded to glance then configure it as generic test image (2th also for alt).
Yes, that seems to be the cause here. After thorough discussion with Martin handling python-tempestconf we also came to conclusion that this is most likely bug. --image specifies the main image that supposed to be upload to glance, then id should be referenced during discover-tempest-config process in [compute] section of tempest.conf. Currently, if image.http_image value is overriden, it replaces behaviour of --image and that is definitely bug. image.http_image plays a role in image v1 test as far I was able to find - https://github.com/openstack/tempest/blob/master/tempest/api/image/v1/test_images.py which is according to comments https://github.com/openstack/python-tempestconf/blob/3a40d5fe982b0b8ef49f3b16bfbc278a10a93ef0/config_tempest/constants.py#L70 deprecated since Queens in Tempest, but since Newton in OpenStack https://developer.openstack.org/api-ref/image/versions/index.html#what-happened-to-the-v1-api itself. The unknown for me is, why is http_image variable kept in code then and not marked for deprecation (e.g. like comments around image-feature-enabled.api_v1 have). In case only image v1 tests used this, shouldn't http_image be removed completely (since it only adds confusion)? All in all, since in OSP13 we use also this override in CI and tempest.conf seems to be generated with ID's of proper qcow2 file (--image), not "http_image" one, looks like this bug was included between package versions: python2-tempestconf-1.1.5-0.20180326143753.f9d956f.el7ost.noarch (ga pkg in OSP13) and python2-tempestconf-1.1.5-0.20180629215301.d1107c9.el7ost.noarch (OSP14 latest, puddle 2018-07-09.1).
The patch [1] got merged in master branch. [1] https://review.openstack.org/581344
Verified, with fixed python-tempestconf package the value of "--image" parameter is referenced in tempest.conf as the main image for testing (also its ID), not image.http_image. http_image is just referenced inside [image] section as it should be and doesn't override mentioned value tempest.conf-wise. tempest.conf difference (broken, fixed): 21c21 < img_file = cirros-0.3.5-x86_64-uec.tar.gz --- > img_file = cirros-0.3.5-x86_64-disk.img 52a53,57 > [image] > image_path = http://rhos-qe-mirror-tlv.usersys.redhat.com/images/cirros-0.3.5-x86_64-disk.img > region = regionOne > http_image = http://rhos-qe-mirror-tlv.usersys.redhat.com/images/cirros-0.3.5-x86_64-uec.tar.gz > 57,62c62,63 < image_ref = 51db4246-a53e-4c5e-92ff-c65550e0fd45 < image_ref_alt = 5486c620-9a5c-48aa-b4ff-8d8dbd540ec2 < < [image] < region = regionOne < http_image = http://rhos-qe-mirror-tlv.usersys.redhat.com/images/cirros-0.3.5-x86_64-uec.tar.gz --- > image_ref = 82baf3f8-12d0-450a-9e28-cba7f0187af6 > image_ref_alt = 03bb1b21-4c6f-4149-9ca5-2673b481ecaf 76a78,79 > min_microversion = 3.0 > max_microversion = 3.52 115,118d117 < [image-feature-enabled] < api_v1 = False < api_v2 = True < 122a122,125 > > [image-feature-enabled] > api_v1 = False > api_v2 = True
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045