Created attachment 1581592 [details] Cinder and Nova logs Description of problem: I keep hitting this issue on osp13/14/15, nova volume-attach fails to attach a Cinder 3par iscsi volume. What's on some occasions (osp13 or 14) some instance on one compute manage to attach while other instances on a second compute all fail to attach. Might be a generic iscsi attach problem rather than just 3par. Version-Release number of selected component (if applicable): OSP13/14/15 How reproducible: Almost always Steps to Reproduce: 1. Create a Cinder 3par iscsi backed volume (overcloud) [stack@undercloud-0 ~]$ cinder create 1 --volume-type 3par --name 3parvol+--------------------------------+--------------------------------------+ | Property | Value | +--------------------------------+--------------------------------------+ | attachments | [] | | availability_zone | nova | | bootable | false | | consistencygroup_id | None | | created_at | 2019-06-18T05:20:58.000000 | | description | None | | encrypted | False | | id | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | | metadata | {} | | migration_status | None | | multiattach | False | | name | 3parvol | | os-vol-host-attr:host | controller-1@3par#SSD_r5 | | os-vol-mig-status-attr:migstat | None | | os-vol-mig-status-attr:name_id | None | | os-vol-tenant-attr:tenant_id | a6d44b66b8554733aef4e753ccffa20f | | replication_status | None | | size | 1 | | snapshot_id | None | | source_volid | None | | status | creating | | updated_at | 2019-06-18T05:20:58.000000 | | user_id | 329a146cd6884a3bb549bc5566417e2d | | volume_type | 3par | +--------------------------------+--------------------------------------+ Vol is available: | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | available | 3parvol | 1 | 3par | false | 2. Boot an instance | 378717cc-299b-444e-b851-7dc5209d3172 | kuku2 | ACTIVE | - | Running | private=100.100.99.112 | 3. Try to attach volume to instance At first it looks like it would work (overcloud) [stack@undercloud-0 ~]$ nova volume-attach 378717cc-299b-444e-b851-7dc5209d3172 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb auto +----------+--------------------------------------+ | Property | Value | +----------+--------------------------------------+ | device | /dev/vdc | | id | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | | serverId | 378717cc-299b-444e-b851-7dc5209d3172 | | tag | - | | volumeId | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | +----------+--------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ cinder list | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | attaching | 3parvol | 1 | 3par | false | But this hangs like this till we reach attach timeout. On nova compute log I see lot of such lines 2019-06-18 05:24:58.592 8 DEBUG os_brick.initiator.linuxscsi [req-edde9f49-64dc-4cea-be41-40956faeea4d 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] Searching for a device in session 50 and hctl ['4', '*', '*', 1] yield: None device_name_by_hctl /usr/lib/python3.6/site-packages/os_brick/initiator/linuxscsi.py:642 2019-06-18 05:24:58.593 8 DEBUG os_brick.initiator.linuxscsi [req-edde9f49-64dc-4cea-be41-40956faeea4d 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] Searching for a device in session 51 and hctl ['5', '*', '*', 1] yield: None device_name_by_hctl /usr/lib/python3.6/site-packages/os_brick/initiator/linuxscsi.py:642 Then after timeout, cinder returns to available. | 6ac25adc-7d1e-4c5b-98ca-d35c7195fdbb | available | 3parvol | 1 | 3par | false | And this is about the time we see this on nova bug traceback 2019-06-18 05:18:20.403 8 DEBUG oslo_concurrency.lockutils [req-dd50ec37-dcbf-4cca-9127-20938ba8d93e 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] Lock "connect_volume" released by "os_brick.initiator.connectors.iscsi.ISCSIConnector.connect_volume" :: held 427.893s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339 2019-06-18 05:18:20.404 8 DEBUG os_brick.initiator.connectors.iscsi [req-dd50ec37-dcbf-4cca-9127-20938ba8d93e 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] <== connect_volume: exception (427893ms) VolumeDeviceNotFound('Volume device not found at .',) trace_logging_wrapper /usr/lib/python3.6/site-packages/os_brick/utils.py:156 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [req-dd50ec37-dcbf-4cca-9127-20938ba8d93e 329a146cd6884a3bb549bc5566417e2d a6d44b66b8554733aef4e753ccffa20f - default default] [instance: 378717cc-299b-444e-b851-7dc5209d3172] Driver failed to attach volume 020b06d8-822c-4b22-a241-ff73eec7750c at /dev/vdc: os_brick.exception.VolumeDeviceNotFound: Volume device not found at . 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] Traceback (most recent call last): 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/nova/virt/block_device.py", line 567, in _volume_attach 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] device_type=self['device_type'], encryption=encryption) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 1495, in attach_volume 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] encryption=encryption) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 1277, in _connect_volume 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] vol_driver.connect_volume(connection_info, instance) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/volume/iscsi.py", line 64, in connect_volume 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] device_info = self.connector.connect_volume(connection_info['data']) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/os_brick/utils.py", line 150, in trace_logging_wrapper 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] result = f(*args, **kwargs) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, in inner 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] return f(*args, **kwargs) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/iscsi.py", line 518, in connect_volume 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] self._cleanup_connection(connection_properties, force=True) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] self.force_reraise() 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] six.reraise(self.type_, self.value, self.tb) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] raise value 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/iscsi.py", line 511, in connect_volume 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] return self._connect_multipath_volume(connection_properties) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/os_brick/utils.py", line 61, in _wrapper 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] return r.call(f, *args, **kwargs) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/retrying.py", line 229, in call 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] raise attempt.get() 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] six.reraise(self.value[0], self.value[1], self.value[2]) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] raise value 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] attempt = Attempt(fn(*args, **kwargs), attempt_number, False) 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] File "/usr/lib/python3.6/site-packages/os_brick/initiator/connectors/iscsi.py", line 768, in _connect_multipath_volume 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] raise exception.VolumeDeviceNotFound(device='') 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] os_brick.exception.VolumeDeviceNotFound: Volume device not found at . 2019-06-18 05:18:20.404 8 ERROR nova.virt.block_device [instance: 378717cc-299b-444e-b851-7dc5209d3172] 2019-06-18 05:18:21.010 8 DEBUG oslo_service.periodic_task [req-9386a516-4586-4dab-b761-12fc2ca4877a - - - - -] Running periodic task ComputeManager._reclaim_queued_deletes run_periodic_tasks /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:217 2019-06-18 05:18:21.010 8 DEBUG nova.compute.manager [req-9386a516-4586-4dab-b761-12fc2ca4877a - - - - -] CONF.reclaim_instance_interval <= 0, skipping... _reclaim_queued_deletes /usr/lib/python3.6/site-packages/nova/compute/manager.py:8033 Attaching logs for full details.
This smells like a misconfigured environment but either way isn't an openstack-nova issue unless the os-brick or cinder folks can point to us calling their APIs incorrectly here.
(In reply to Lee Yarwood from comment #2) > This smells like a misconfigured environment but either way isn't an > openstack-nova issue unless the os-brick or cinder folks can point to us > calling their APIs incorrectly here. Gorka took a look on it and proposed some upstream bug [1] and proposed a fix for it [2] On the the configuration side those particular setups are not using multipath. Main issue of not detecting the WWN seems to have also affected the WWN verification feature [3]. In that case it's because the new code didn't take into account that we may not find the WWN in the given time. Just the iSCSI single path attach fails on slow WWN sysfs population (on some scenarios taking around 16 seconds) [1]: https://bugs.launchpad.net/os-brick/+bug/1834443 [2]: https://review.opendev.org/667830 [3]: https://bugs.launchpad.net/os-brick/+bug/1819577
Additionally from what was observed on one of the RHOSP 13 setups, there is a mismatch from initiator name for the compute-x host using some iqn.1994-05.com.redhat:982a871546c, but comparing the 3PAR Web interface same compute-x has iqn.1994-05.com.redhat:587fe9ef057. This points that the 3PAR driver is assuming of unique FQDNs (and in lab/qa enviroments hostnames would keep generic/default conventations meaning more than one would connect to the same storage array). As workaround 2 possible options for moving forward: - From the 3PAR web interface and delete the hosts "compute-x" and "compute-y" and let the Cinder 3Par driver recreate it correctly on the next attachment operation. - Restart Nova compute and Cinder volume with unique "host" configurations. In Cinder we can use "host" under "[DEFAULT]" or "backend_host" under the specific driver's section. In Nova it is also "host" under "[DEFAULT]" For the real solution is fixing the driver on whatever hpe_3par_common.py and hpe_3par_iscsi.py are currently doing.
Checked logs; the test was performed on 18-jun. Also checked launchpad bug 1834443 and code review. The related fix (https://review.opendev.org/#/c/667830/) got merged into master on 15-jul. This issue has been fixed in the openstack/os-brick 2.10.0 (which was released on 05-sep). May we request Red Hat team to perform test again with latest RHOSP having os-brick 2.10.0.
@Sneha, The patch you linked addresses a different issue. For this BZ, the launchpad bug [1] has a good explanation of the problem, and references a similar bug in another cinder driver [2]. If you follow [2] you'll see how the other cinder driver resolved the problem. [1] https://bugs.launchpad.net/cinder/+bug/1834695 [2] https://bugs.launchpad.net/cinder/+bug/1720147
Hello RH, HPE 3PAR will review this and will follow-up on this BZ -Antonios
Looking into this issue. Trying in our local setup. Regards, Raghavendra Tilay.
Hi, We are able to reproduce issue on our setup. Checked logs & code flow; able to identify code changes to be done. Regards, Raghavendra Tilay.
Hi, Performed code changes, manual testing (in progress - some scenarios left) & updated UT (unit test) code. Submitted patch for community review: https://review.opendev.org/#/c/713803/ Regards, Raghavendra Tilay.
Thanks, Sneha and Raghavendra!
Hi, This is regarding https://review.opendev.org/#/c/713803/ Its under review by community reviewers. We already submitted few patch sets. Currently we are working on review comments received in patch set 8. Regards, Raghavendra Tilay.
Hi, Got approval from community reviewers. Fix for this bug has been merged into master (Ussuri release). https://review.opendev.org/#/c/713803/ Regards, Raghavendra Tilay.
Retargetting this for OSP 16.1 @Sneha or Raghavendra, please submit an upstream patch to backport the fix to stable/train.
Okay Alan, Will follow process for backport. Regards, Raghavendra Tilay.
Hi, Below patch has been proposed for backport to stable/train: https://review.opendev.org/#/c/723520 Regards, Raghavendra Tilay.
Hi, The patch 723520 has been abandoned. Currently working on below patch for backport to stable/train: https://review.opendev.org/#/c/724271/ Regards, Raghavendra Tilay.
Hi, Apologies for delay. Was working on RHOSP16 certification of HPE 3PAR driver (and other stuff). So, this was put on hold. Actively working on backport patch since last one week: https://review.opendev.org/#/c/724271 Zuul test and HPE CI got passed. But got some review comment today. Will update code accordingly. Regards, Raghavendra Tilay.
Hi, Backport patch for Train: [1] https://review.opendev.org/#/c/724271/ Addressed all review comments. But Gorka discovered a regression bug & submitted a patch for same to master: [2] https://review.opendev.org/#/c/739033/ Review of this patch in progress. Once patch [2] is approved, backported to U and T, then patch [1] would be approved. Regards, Raghavendra Tilay.
Hi, Backport patch has been approved & merged into Train: [1] https://review.opendev.org/#/c/724271/ Also patches for regression bug have been approved & merged: [2] https://review.opendev.org/#/c/739033/ [master/victoria] https://review.opendev.org/#/c/741428/ [ussuri] https://review.opendev.org/#/c/741879/ [train] Regards, Raghavendra Tilay.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: openstack-cinder security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4283