Bug 1396112 - dm devices left on controller after volume create from image
Summary: dm devices left on controller after volume create from image
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 9.0 (Mitaka)
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: zstream
: 9.0 (Mitaka)
Assignee: Gorka Eguileor
QA Contact: Avi Avraham
URL:
Whiteboard:
: 1462346 (view as bug list)
Depends On: 1422941
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-17 13:11 UTC by Benjamin Schmaus
Modified: 2020-12-14 07:52 UTC (History)
22 users (show)

Fixed In Version: openstack-cinder-8.1.1-11.el7ost
Doc Type: Bug Fix
Doc Text:
The iSCSI connections have been improved to use the latest os-brick functionality detach the volumes, even by force, when appropriate. To achieve optimal results, you should use it with iscsi-initiator-utils 6.2.0.874-2 or higher.
Clone Of:
Environment:
Last Closed: 2017-10-05 14:03:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1663925 0 None None None 2017-02-16 15:47:10 UTC
Launchpad 1663936 0 None None None 2017-02-16 15:47:27 UTC
Launchpad 1664032 0 None None None 2017-02-16 15:47:51 UTC
OpenStack gerrit 459453 0 'None' MERGED Do proper cleanup if connect volume fails 2020-05-08 14:55:34 UTC
OpenStack gerrit 459454 0 'None' MERGED Add support for OS-Brick force disconnect 2020-05-08 14:55:34 UTC
Red Hat Product Errata RHBA-2017:2861 0 normal SHIPPED_LIVE openstack-cinder bug fix advisory 2017-10-05 18:03:19 UTC

Description Benjamin Schmaus 2016-11-17 13:11:20 UTC
Description of problem:

Customer is launching a heat stack of 48 instances that will boot instances from volumes that are coming from an EMC VNX array that is configured to be multipathed to both controllers and compute nodes.

The stack completed successfully but still left a few (in this test 2) dm devices behind on the controller nodes.  

python-os-brick-1.1.0-3.el7ost.noarch.rpm was being used.

The issue seems similar to BZ#1372431

Logs are extracted to collab:

 [extracting] sosreport-wcmsc2-l-rh-ocld-2.localdomain.01727836-20161109181020.tar.xz
 [parsing] sosreport for wcmsc2-l-rh-ocld-2.localdomain [7.2][openstack][all-in-one] -- Wed Nov 9 18:12:36 2016
 [scanning] for and moving additional files
 [x-rpm] python-os-brick-1.1.0-3.el7ost.noarch.rpm
 [posting] download and extraction summary to ticket as rhn-support-bschmaus
 [erasing] empty directories
---
 [yank] complete - access files in /cases/01727836
 [browse] the files here: http://collab-shell.usersys.redhat.com/01727836/
please report any issues here: https://gitlab.cee.redhat.com/gss-tools/collab-shell/issues


Version-Release number of selected component (if applicable):
OSP9

How reproducible:
100% - However the number of left over DM devices varies.

Steps to Reproduce:
1.Launch heat stack with many instances that boot from volume
2.
3.

Actual results:
Left over DM devices on controller

Expected results:
DM devices on controller should be removed once stack is up and operational

Additional info:

Comment 2 Elise Gafford 2016-11-23 14:14:36 UTC
Hi Eric,

Could you take a look at this to assess diagnosis and feasibility of a fix?

Comment 8 Gorka Eguileor 2017-02-01 11:49:13 UTC
I believe the issue is caused by a race condition between the `_connect_device` and the `_detach_volume` methods for this specific driver.

Even though the driver is not implementing those methods and it's just inheriting them from the base driver, it is its responsibility to be sure race conditions do not affect the end result for their storage device.

The reason why this may be failing on this specific hardware is because this is one of EMC's devices where a single target has multiple luns.

The race condition seems to happen when have disconnected the volume and proceed to the `terminate_connection` driver call and this call is in ongoing:

    2016-11-09 17:24:55.801 44368 DEBUG cinder.volume.drivers.emc.emc_vnx_cli [req-ccdf506b-3c7a-4a50-a70c-3d16a92423c0 4ff130397fd147afb7dda53cc529dde2 66fbd6a0efe6493e9d02b8eb2b9b2ae3 - - -] Get lun_id: 5. get_lun_id /usr/lib/python2.7/site-packages/cinder/volume/drivers/emc/emc_vnx_cli.py:3303
    2016-11-09 17:24:55.803 44368 DEBUG oslo_concurrency.processutils [req-ccdf506b-3c7a-4a50-a70c-3d16a92423c0 4ff130397fd147afb7dda53cc529dde2 66fbd6a0efe6493e9d02b8eb2b9b2ae3 - - -] Running cmd (subprocess): /opt/Navisphere/bin/naviseccli -address fd00:4888:2000:fc01:524:ff2:0:9 -user sysadmin -password sysadmin -scope global storagegroup -removehlu -hlu 58 -gname wcmsc2-l-rh-ocld-2.localdomain -o execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:344

When another request starts connecting in `connect_volume` method from os-brick:

    2016-11-09 17:24:55.816 44368 INFO os_brick.initiator.connector [req-512c470b-967e-41ae-ae8d-c3a64aef2672 4ff130397fd147afb7dda53cc529dde2 66fbd6a0efe6493e9d02b8eb2b9b2ae3 - - -] Trying to connect to iSCSI portal 192.168.3.253:3260
    2016-11-09 17:24:55.817 44368 DEBUG oslo_concurrency.processutils [req-512c470b-967e-41ae-ae8d-c3a64aef2672 4ff130397fd147afb7dda53cc529dde2 66fbd6a0efe6493e9d02b8eb2b9b2ae3 - - -] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf iscsiadm -m node -T iqn.1992-04.com.emc:cx.apm00163100670.b3 -p 192.168.3.253:3260 execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:344
    2016-11-09 17:24:55.906 44368 DEBUG oslo_concurrency.processutils [req-512c470b-967e-41ae-ae8d-c3a64aef2672 4ff130397fd147afb7dda53cc529dde2 66fbd6a0efe6493e9d02b8eb2b9b2ae3 - - -] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf iscsiadm -m node -T iqn.1992-04.com.emc:cx.apm00163100670.b3 -p 192.168.3.253:3260" returned: 0 in 0.089s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:374
       ...

And then the `terminate_connection` call completes its execution:

    2016-11-09 17:24:58.932 44368 DEBUG oslo_concurrency.processutils [req-ccdf506b-3c7a-4a50-a70c-3d16a92423c0 4ff130397fd147afb7dda53cc529dde2 66fbd6a0efe6493e9d02b8eb2b9b2ae3 - - -] CMD "/opt/Navisphere/bin/naviseccli -address fd00:4888:2000:fc01:524:ff2:0:9 -user sysadmin -password sysadmin -scope global storagegroup -removehlu -hlu 58 -gname wcmsc2-l-rh-ocld-2.localdomain -o" returned: 0 in 3.130s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:374
    2016-11-09 17:24:58.933 44368 DEBUG cinder.volume.drivers.emc.emc_vnx_cli [req-ccdf506b-3c7a-4a50-a70c-3d16a92423c0 4ff130397fd147afb7dda53cc529dde2 66fbd6a0efe6493e9d02b8eb2b9b2ae3 - - -] EMC: Command: ('/opt/Navisphere/bin/naviseccli', '-address', 'fd00:4888:2000:fc01:524:ff2:0:9', 'storagegroup', '-removehlu', '-hlu', 58, '-gname', 'wcmsc2-l-rh-ocld-2.localdomain', '-o'). Result: \n. _command_execute_on_active_ip /usr/lib/python2.7/site-packages/cinder/volume/drivers/emc/emc_vnx_cli.py:2058

Comment 10 Jay Xu 2017-02-08 03:19:24 UTC
This a known issue for storage like VNX/Unity(multiple LUNs exported from certain targets)

Problem:
The race condition happened as below:

Step 1 disconnect volume
Step 2 terminate connection starts
Step 3 connect volume(for other luns)
Step 4 terminate connection ends

The step 3 usually rescan out the devices which already removed from host in step 1, when step 4(remove lun from VNX storage group) finished, the rescaned-out devices became so-called “faulty devices”. 

Solution:
Currently due to the design above, no solution to avoid faulty devices completely. It is suggested to use “vnx-faulty-device-cleanup” script to clean up the devices periodically.

Note: The link of script: https://github.com/emc-openstack/vnx-faulty-device-cleanup

Comment 16 Gorka Eguileor 2017-02-16 15:53:22 UTC
There are a couple of issues with the scanning of targets that can be summarized as:

- No retries on "map in use" dm flushing (os-brick)
- iSCSI Scans too broad
- Automatic iSCSI scan performed by iscsid on AER/AEN package reception

Due to these, we end up with leftover dms in the system.

There are proposed fixes for all of them, pending review and backport process.

Comment 17 Benjamin Schmaus 2017-02-17 13:53:46 UTC
They do not use script as one or two multipath -F will do it.

Comment 21 Paul Grist 2017-03-30 01:05:08 UTC
In further testing last week, some key additional fixes were identified for this collection and Gorka is the process of getting those ready to post upstream. We don't have a specific ETA, but we will get the BZs updated once the patches are ready.

Comment 22 Paul Grist 2017-04-12 03:26:18 UTC
Patches are posted for review, comprehensive testing is now passing on iSCSI (FC mpath testing will follow). 

The relevant patch set will actually be the following set and we will confirm the proper collection needed which may actually vary from the initial set proposed.  This BZ will be the right place to track status for the collection.

https://review.openstack.org/#/c/455394/
https://review.openstack.org/#/c/455393/
https://review.openstack.org/#/c/455392/

Comment 30 Eric Harney 2017-09-07 14:41:43 UTC
*** Bug 1462346 has been marked as a duplicate of this bug. ***

Comment 33 Avi Avraham 2017-10-03 11:23:18 UTC
verified 
Cinder package version
# rpm -qa openstack-cinder
openstack-cinder-8.1.1-11.el7ost.noarch
ISCSI package version: 
iscsi-initiator-utils-6.2.0.874-4.el7.x86_64
running the following command to verify that backend support lun scan 
iscsiadm -m node -p 10.35.146.129:3260 --op update -n node.session.scan -v manual -T iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00
and the expected return code should be 0  

verified according to volume.log 
2017-10-01 11:44:40.582 11723 DEBUG oslo_concurrency.processutils [req-ec56c587-f845-4cbb-ac2c-7d233b681d6a 4327e5472bef4dbdb8c44ff31d2f96ad 1acbfb01956048a08ecbd3f018871f2f - - -] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf iscsiadm -m node -T iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c04 -p 10.35.146.193:3260 --op update -n node.session.scan -v manual execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:344
2017-10-01 11:44:40.603 11723 DEBUG oslo_concurrency.processutils [req-ec56c587-f845-4cbb-ac2c-7d233b681d6a 4327e5472bef4dbdb8c44ff31d2f96ad 1acbfb01956048a08ecbd3f018871f2f - - -] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf iscsiadm -m node -T iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00 -p 10.35.146.129:3260 --interface default --op new" returned: 0 in 0.280s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:374

Comment 35 errata-xmlrpc 2017-10-05 14:03:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2861


Note You need to log in before you can comment on or make changes to this bug.