Bug 1721361 - Failing to attach 3par iscsi Cinder volume to instance osbrick reports device not found
Summary: Failing to attach 3par iscsi Cinder volume to instance osbrick reports devic...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 15.0 (Stein)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 16.1 (Train on RHEL 8.2)
Assignee: Pablo Caruana
QA Contact: Tzach Shefi
Chuck Copello
URL:
Whiteboard:
Depends On:
Blocks: 1624971 1701172 1708705 1713047
TreeView+ depends on / blocked
 
Reported: 2019-06-18 05:46 UTC by Tzach Shefi
Modified: 2020-10-28 15:46 UTC (History)
18 users (show)

Fixed In Version: openstack-cinder-15.3.1-0.20200722075512.c9c98d8.el8ost
Doc Type: Bug Fix
Doc Text:
This update includes the following bug fix patches related to fully qualified domain names (FQDN). ** _Kaminario Fix unique_fqdn_network option_ + Previously, the Kaminario driver accepted the unique_fqdn_network configuration option in the specific driver section. When this option was moved, a regression was introduced: the parameter was now only used if it was defined in the shared configuration group. + This patch fixes the regression and makes it possible to define the option in the shared configuration group as well as the driver specific section. + ** _HPE 3PAR Support duplicated FQDN in network_ + The 3PAR driver uses the FQDN of the node that is doing the attach as an unique identifier to map the volume. + Because the FQDN is not always unique, in some environments the same FQDN can be found in different systems. In those cases, if both try to attach volumes, the second system will fail. + For example, this could happen in a QA environment where VMs share names like controller-.localdomain and compute-0.localdomain. + This patch adds the `unique_fqdn_network` configuration option to the 3PAR driver to prevent failures caused by name duplication between systems. (BZ#1721361)
Clone Of:
Environment:
Last Closed: 2020-10-28 15:45:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Cinder and Nova logs (892.95 KB, application/gzip)
2019-06-18 05:46 UTC, Tzach Shefi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1834695 0 None None None 2019-06-28 21:10:50 UTC
OpenStack gerrit 713803 0 None MERGED HPE 3PAR: Support duplicated FQDN in network 2021-02-18 14:31:00 UTC
OpenStack gerrit 724271 0 None MERGED HPE 3PAR: Support duplicated FQDN in network 2021-02-18 14:31:01 UTC
Red Hat Product Errata RHSA-2020:4283 0 None None None 2020-10-28 15:46:43 UTC

Description Tzach Shefi 2019-06-18 05:46:46 UTC
Created attachment 1581592 [details]
Cinder and Nova logs

Description of problem: I keep hitting this issue on osp13/14/15, nova volume-attach fails to attach a Cinder 3par iscsi volume. 
What's on some occasions (osp13 or 14) some instance on one compute manage to attach while other instances on a second compute all fail to attach. 

Might be a generic iscsi attach problem rather than just 3par. 
 


Version-Release number of selected component (if applicable):
OSP13/14/15 

How reproducible:
Almost always  

Steps to Reproduce:
1. Create a Cinder 3par iscsi backed volume
(overcloud) [stack@undercloud-0 ~]$ cinder create 1 --volume-type 3par --name 3parvol+--------------------------------+--------------------------------------+
| Property                       | Value                                |
+--------------------------------+--------------------------------------+
| attachments                    | []                                   |
| availability_zone              | nova                                 |
| bootable                       | false                                |
| consistencygroup_id            | None                                 |
| created_at                     | 2019-06-18T05:20:58.000000           |
| description                    | None                                 |
| encrypted                      | False                                |
| id                             | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb |
| metadata                       | {}                                   |
| migration_status               | None                                 |
| multiattach                    | False                                |
| name                           | 3parvol                              |
| os-vol-host-attr:host          | controller-1@3par#SSD_r5             |
| os-vol-mig-status-attr:migstat | None                                 |
| os-vol-mig-status-attr:name_id | None                                 |
| os-vol-tenant-attr:tenant_id   | a6d44b66b8554733aef4e753ccffa20f     |
| replication_status             | None                                 |
| size                           | 1                                    |
| snapshot_id                    | None                                 |
| source_volid                   | None                                 |
| status                         | creating                             |
| updated_at                     | 2019-06-18T05:20:58.000000           |
| user_id                        | 329a146cd6884a3bb549bc5566417e2d     |
| volume_type                    | 3par                                 |
+--------------------------------+--------------------------------------+


Vol is available:
| 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | available | 3parvol            | 1    | 3par        | false    |    

2. Boot an instance
| 378717cc-299b-444e-b851-7dc5209d3172 | kuku2 | ACTIVE | -          | Running     | private=100.100.99.112 |


3. Try to attach volume to instance 

At first it looks like it would work
(overcloud) [stack@undercloud-0 ~]$ nova volume-attach 378717cc-299b-444e-b851-7dc5209d3172 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb auto
+----------+--------------------------------------+
| Property | Value                                |
+----------+--------------------------------------+
| device   | /dev/vdc                             |
| id       | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb |
| serverId | 378717cc-299b-444e-b851-7dc5209d3172 |
| tag      | -                                    |
| volumeId | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb |
+----------+--------------------------------------+

(overcloud) [stack@undercloud-0 ~]$ cinder list

| 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | attaching | 3parvol            | 1    | 3par        | false    |     

But this hangs like this till we reach attach timeout.

On nova compute log I see lot of such lines
2019-06-18 05:24:58.592 8 DEBUG os_brick.initiator.linuxscsi [req-edde9f49-64dc-4cea-be41-40956faeea4d 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] Searching for a device in session 50 and hctl ['4', '*', '*', 1] yield: None device_name_by_hctl /usr/lib/python3.6/site-packages/os_brick/initiator/linuxscsi.py:642
2019-06-18 05:24:58.593 8 DEBUG os_brick.initiator.linuxscsi [req-edde9f49-64dc-4cea-be41-40956faeea4d 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] Searching for a device in session 51 and hctl ['5', '*', '*', 1] yield: None device_name_by_hctl /usr/lib/python3.6/site-packages/os_brick/initiator/linuxscsi.py:642



Then after timeout, cinder returns to available. 
| 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | available | 3parvol            | 1    | 3par        | false    |   


And this is about the time we see this on nova bug traceback

2019-06-18 05:18:20.403 8 DEBUG oslo_concurrency.lockutils [req-dd50ec37-dcbf-4cca-9127-20938ba8d93e 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] Lock "connect_volume" released by "os_brick.initiator.connectors.iscsi.ISCSIConnector.connect_volume" :: held 427.893s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2019-06-18 05:18:20.404 8 DEBUG os_brick.initiator.connectors.iscsi [req-dd50ec37-dcbf-4cca-9127-20938ba8d93e 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] <== connect_volume: exception (427893ms) VolumeDeviceNotFound('Volume device not found at .',) trace_logging_wrapper /usr/lib/python3.6/site-packages/os_brick/utils.py:156
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [req-dd50ec37-dcbf-4cca-9127-20938ba8d93e 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] [instance: 378717cc-299b-444e-b851-7dc5209d3172] Driver failed to attach volume 020b06d8-822c-4b22-a241-ff73eec7750c at /dev/vdc: os_brick.exception.VolumeDeviceNotFound: Volume device not found at .
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] Traceback (most recent call last):
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/nova/virt/block_device.py", line 567, in _volume_attach
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     device_type=self['device_type'], encryption=encryption)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 1495, in attach_volume
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     encryption=encryption)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 1277, in _connect_volume
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     vol_driver.connect_volume(connection_info, instance)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/volume/iscsi.py", line 64, in connect_volume
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     device_info = self.connector.connect_volume(connection_info['data'])
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/os_brick/utils.py", line 150, in trace_logging_wrapper
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     result = f(*args, **kwargs)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, in inner
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     return f(*args, **kwargs)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/iscsi.py", line 518, in connect_volume
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     self._cleanup_connection(connection_properties, force=True)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     self.force_reraise()
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     six.reraise(self.type_, self.value, self.tb)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     raise value
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/iscsi.py", line 511, in connect_volume
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     return self._connect_multipath_volume(connection_properties)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/os_brick/utils.py", line 61, in _wrapper
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     return r.call(f, *args, **kwargs)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/retrying.py", line 229, in call
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     raise attempt.get()
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     six.reraise(self.value[0], self.value[1], self.value[2])
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     raise value
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]   File "/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/iscsi.py", line 768, in _connect_multipath_volume
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]     raise exception.VolumeDeviceNotFound(device='')
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] os_brick.exception.VolumeDeviceNotFound: Volume device not found at .
2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172]
2019-06-18 05:18:21.010 8 DEBUG oslo_service.periodic_task [req-9386a516-4586-4dab-b761-12fc2ca4877a - - - - -] Running periodic task ComputeManager._reclaim_queued_deletes run_periodic_tasks /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:217
2019-06-18 05:18:21.010 8 DEBUG nova.compute.manager [req-9386a516-4586-4dab-b761-12fc2ca4877a - - - - -] CONF.reclaim_instance_interval <= 0, skipping... _reclaim_queued_deletes /usr/lib/python3.6/site-packages/nova/compute/manager.py:8033


Attaching logs for full details.

Comment 2 Lee Yarwood 2019-06-18 07:44:19 UTC
This smells like a misconfigured environment but either way isn't an openstack-nova issue unless the os-brick or cinder folks can point to us calling their APIs incorrectly here.

Comment 15 Pablo Caruana 2019-06-27 10:09:36 UTC
(In reply to Lee Yarwood from comment #2)
> This smells like a misconfigured environment but either way isn't an
> openstack-nova issue unless the os-brick or cinder folks can point to us
> calling their APIs incorrectly here.

Gorka took a look on it and proposed some upstream bug [1] and proposed a fix for it [2]

On the the configuration side  those particular  setups are not using multipath.

Main  issue of not detecting the WWN seems to have also affected the WWN verification feature [3].  In that case it's because the new code didn't take into account that we may not find the WWN in
the given time.

Just the iSCSI single path attach fails on slow WWN sysfs population (on some scenarios taking around 16 seconds)



[1]: https://bugs.launchpad.net/os-brick/+bug/1834443
[2]: https://review.opendev.org/667830
[3]: https://bugs.launchpad.net/os-brick/+bug/1819577

Comment 16 Pablo Caruana 2019-06-28 09:17:23 UTC
Additionally from what was observed on one of the RHOSP 13 setups, there is a mismatch from  initiator name for the compute-x host using some iqn.1994-05.com.redhat:982a871546c, but comparing the  3PAR Web interface same  compute-x has iqn.1994-05.com.redhat:587fe9ef057.

This points  that the 3PAR driver is assuming  of unique FQDNs (and in lab/qa enviroments hostnames would keep generic/default conventations meaning more than one would connect to the same storage array).


As workaround   2 possible options for moving forward:

- From the  3PAR web interface and delete the hosts "compute-x" and   "compute-y" and let the Cinder 3Par driver recreate it correctly on  the next attachment operation.

- Restart Nova compute and Cinder volume with unique "host"  configurations.

  In Cinder we can use "host" under "[DEFAULT]" or "backend_host" under the specific driver's section.

  In Nova it is also "host" under "[DEFAULT]"


For the real solution is fixing the driver on whatever hpe_3par_common.py and  hpe_3par_iscsi.py are currently doing.

Comment 21 Sneha Rai 2019-11-18 07:08:44 UTC
Checked logs; the test was performed on 18-jun.
Also checked launchpad bug 1834443 and code review.
The related fix (https://review.opendev.org/#/c/667830/) got merged into master on 15-jul.
This issue has been fixed in the openstack/os-brick 2.10.0 (which was released on 05-sep).

May we request Red Hat team to perform test again with latest RHOSP having os-brick 2.10.0.

Comment 22 Alan Bishop 2019-11-18 14:03:24 UTC
@Sneha,

The patch you linked addresses a different issue. For this BZ, the launchpad bug [1] has a good explanation of the problem, and references a similar bug in another cinder driver [2]. If you follow [2] you'll see how the other cinder driver resolved the problem.

[1] https://bugs.launchpad.net/cinder/+bug/1834695
[2] https://bugs.launchpad.net/cinder/+bug/1720147

Comment 23 Antonios Dakopoulos 2020-01-14 14:55:01 UTC
Hello RH,
HPE 3PAR will review this and will follow-up on this BZ

-Antonios

Comment 24 Sneha Rai 2020-01-24 11:38:28 UTC
Looking into this issue.
Trying in our local setup.


Regards,
Raghavendra Tilay.

Comment 25 Sneha Rai 2020-02-25 12:13:05 UTC
Hi,

We are able to reproduce issue on our setup.
Checked logs & code flow; able to identify code changes to be done.

Regards,
Raghavendra Tilay.

Comment 26 Sneha Rai 2020-03-19 12:49:30 UTC
Hi,

Performed code changes, manual testing (in progress - some scenarios left) & updated UT (unit test) code.
Submitted patch for community review:
https://review.opendev.org/#/c/713803/

Regards,
Raghavendra Tilay.

Comment 27 Alan Bishop 2020-03-19 12:54:13 UTC
Thanks, Sneha and Raghavendra!

Comment 28 Sneha Rai 2020-04-16 14:38:46 UTC
Hi,

This is regarding https://review.opendev.org/#/c/713803/
Its under review by community reviewers.
We already submitted few patch sets.
Currently we are working on review comments received in patch set 8.

Regards,
Raghavendra Tilay.

Comment 29 Sneha Rai 2020-04-22 12:01:46 UTC
Hi,

Got approval from community reviewers.
Fix for this bug has been merged into master (Ussuri release).
https://review.opendev.org/#/c/713803/

Regards,
Raghavendra Tilay.

Comment 30 Alan Bishop 2020-04-22 15:39:27 UTC
Retargetting this for OSP 16.1

@Sneha or Raghavendra, please submit an upstream patch to backport the fix to stable/train.

Comment 31 Sneha Rai 2020-04-24 04:26:50 UTC
Okay Alan,
Will follow process for backport.

Regards,
Raghavendra Tilay.

Comment 32 Sneha Rai 2020-04-28 11:54:38 UTC
Hi,

Below patch has been proposed for backport to stable/train:
https://review.opendev.org/#/c/723520

Regards,
Raghavendra Tilay.

Comment 33 Sneha Rai 2020-05-05 14:45:13 UTC
Hi,

The patch 723520 has been abandoned.
Currently working on below patch for backport to stable/train:
https://review.opendev.org/#/c/724271/

Regards,
Raghavendra Tilay.

Comment 34 Sneha Rai 2020-06-30 11:13:05 UTC
Hi,

Apologies for delay.
Was working on RHOSP16 certification of HPE 3PAR driver (and other stuff).
So, this was put on hold.

Actively working on backport patch since last one week:
https://review.opendev.org/#/c/724271
Zuul test and HPE CI got passed.

But got some review comment today.
Will update code accordingly.

Regards,
Raghavendra Tilay.

Comment 35 Sneha Rai 2020-07-14 12:07:06 UTC
Hi,

Backport patch for Train:
[1] https://review.opendev.org/#/c/724271/

Addressed all review comments.
But Gorka discovered a regression bug & submitted a patch for same to master:
[2] https://review.opendev.org/#/c/739033/
Review of this patch in progress.

Once patch [2] is approved, backported to U and T, then patch [1] would be approved.

Regards,
Raghavendra Tilay.

Comment 36 Sneha Rai 2020-07-24 05:06:39 UTC
Hi,

Backport patch has been approved & merged into Train:
[1] https://review.opendev.org/#/c/724271/

Also patches for regression bug have been approved & merged:
[2] https://review.opendev.org/#/c/739033/  [master/victoria]
https://review.opendev.org/#/c/741428/  [ussuri]
https://review.opendev.org/#/c/741879/  [train]


Regards,
Raghavendra Tilay.

Comment 50 errata-xmlrpc 2020-10-28 15:45:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: openstack-cinder security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4283


Note You need to log in before you can comment on or make changes to this bug.