1542607 – When evacuating an hypervisor, we get a VolumeDeviceNotFound

Bug 1542607 - When evacuating an hypervisor, we get a VolumeDeviceNotFound

Summary: When evacuating an hypervisor, we get a VolumeDeviceNotFound

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-cinder
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Gorka Eguileor
QA Contact:	Avi Avraham
Docs Contact:	RHOS Documentation Team
URL:
Whiteboard:
Depends On:	1504661 1504670 1504671
Blocks:	1464146
TreeView+	depends on / blocked

Reported:	2018-02-06 16:11 UTC by David Vallee Delisle
Modified:	2022-08-16 11:19 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1504661
Environment:
Last Closed:	2019-07-17 14:26:56 UTC
Target Upstream Version:
Embargoed:
Flags:	tshefi: automate_bug-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-4853	0	None	None	None	2022-08-16 11:19:08 UTC

Description David Vallee Delisle 2018-02-06 16:11:06 UTC

+++ This bug was initially created as a clone of Bug #1504661 +++

When we run an evacuation off an hypervisor, all the VMs are moving out at the same time, which means that there's a lot of cinder operations at the same time.

Sometimes, one of the VM refuses to boot on the new hypervisor and when this happens, there's a VolumeDeviceNotFound in the logs on that new hypervisor.

It is preceded by these messages:
~~~
2018-01-29 17:19:46.200 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.4.97:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.204 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.4.98:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.207 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.6.98:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.209 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.6.97:3260 not found on sysfs after logging in.
~~~

We were able to pull a tcpdump when this happens, and we see the connection to the iscsi portal but there's no device attached. 


According to the SAN (VNX) logs, we have these messages:
~~~
2018-01-29 10:45:01 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Success (Navisphere CLI command: ' storagegroup -addhlu -hlu 106 -alu 4 -gname compute-21.domain.com -o ')

2018-01-29 11:14:10 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Success (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-21.domain.com -o ')

2018-01-29 11:20:47 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Failed (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-02.domain.com -o '. Error: storagegroup command failed Error returned from A

2018-01-29 11:38:09 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Failed (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-02.domain.com -o '. Error: storagegroup command failed Error returned from A
~~~

And the last message loops for a while.

When we activate DEBUG on cinder, the problem vanishes.

Comment 1 Gorka Eguileor 2018-02-08 14:20:11 UTC

This seems to be a problem on the storage array or in the vendor driver, it could be a good idea to set the driver to debug log levels instead of the whole cinder service to see if we can reproduce the problem and get some more information.

I think this can be done by adding ",cinder.volume.drivers.emc=DEBUG" to the "default_log_levels", which would end up looking something like this:

default_log_levels = amqp=WARN,amqplib=WARN,boto=WARN,qpid=WARN,sqlalchemy=WARN,suds=INFO,oslo.messaging=INFO,oslo_messaging=INFO,iso8601=WARN,requests.packages.urllib3.connectionpool=WARN,urllib3.connectionpool=WARN,websocket=WARN,requests.packages.urllib3.util.retry=WARN,urllib3.util.retry=WARN,keystonemiddleware=WARN,routes.middleware=WARN,stevedore=WARN,taskflow=WARN,keystoneauth=WARN,oslo.cache=INFO,dogpile.core.dogpile=INFO,cinder.volume.drivers.emc=DEBUG


In the TCP dump, were there any scan requests from the initiator after the login?

Comment 2 David Vallee Delisle 2018-02-08 17:25:02 UTC

Hello Gorka,

Thanks for looking it up. When analyzing the captures, I see an inquiry called "Device Identification Page" which is replied by a:
"001. .... = Qualifier: Device type is supported by server but is not connected to logical unit (0x1)"

When I compare with a capture where it works, this is returned:
"000. .... = Qualifier: Device type is connected to logical unit (0x0)"

It looks like the SAN is presenting a target, but the target is not backed by block devices.

We asked the customer for the SP collect from the SAN and We don't see the addhlu happening (maybe I missed it though).

I'll ask the customer to activate debug on the driver.

Thank you very much,

DVD

Comment 7 Tzach Shefi 2019-08-26 14:27:52 UTC

Close loop wise nothing to automate/test.

Note You need to log in before you can comment on or make changes to this bug.