Bug 1542607

Summary:	When evacuating an hypervisor, we get a VolumeDeviceNotFound
Product:	Red Hat OpenStack	Reporter:	David Vallee Delisle <dvd>
Component:	openstack-cinder	Assignee:	Gorka Eguileor <geguileo>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Avi Avraham <aavraham>
Severity:	high	Docs Contact:	RHOS Documentation Team <rhos-docs>
Priority:	medium
Version:	10.0 (Newton)	CC:	aavraham, cpaquin, cschwede, cswanson, dciabrin, dhill, dvd, ebeaudoi, geguileo, jthomas, lkuchlan, lmarsh, marjones, mbayer, mburns, pgrist, scohen, srevivo, tshefi
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---	Flags:	tshefi: automate_bug-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1504661	Environment:
Last Closed:	2019-07-17 14:26:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1504661, 1504670, 1504671
Bug Blocks:	1464146

Description David Vallee Delisle 2018-02-06 16:11:06 UTC

+++ This bug was initially created as a clone of Bug #1504661 +++

When we run an evacuation off an hypervisor, all the VMs are moving out at the same time, which means that there's a lot of cinder operations at the same time.

Sometimes, one of the VM refuses to boot on the new hypervisor and when this happens, there's a VolumeDeviceNotFound in the logs on that new hypervisor.

It is preceded by these messages:
~~~
2018-01-29 17:19:46.200 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.4.97:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.204 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.4.98:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.207 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.6.98:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.209 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.6.97:3260 not found on sysfs after logging in.
~~~

We were able to pull a tcpdump when this happens, and we see the connection to the iscsi portal but there's no device attached. 


According to the SAN (VNX) logs, we have these messages:
~~~
2018-01-29 10:45:01 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Success (Navisphere CLI command: ' storagegroup -addhlu -hlu 106 -alu 4 -gname compute-21.domain.com -o ')

2018-01-29 11:14:10 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Success (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-21.domain.com -o ')

2018-01-29 11:20:47 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Failed (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-02.domain.com -o '. Error: storagegroup command failed Error returned from A

2018-01-29 11:38:09 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Failed (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-02.domain.com -o '. Error: storagegroup command failed Error returned from A
~~~

And the last message loops for a while.

When we activate DEBUG on cinder, the problem vanishes.

Comment 1 Gorka Eguileor 2018-02-08 14:20:11 UTC

This seems to be a problem on the storage array or in the vendor driver, it could be a good idea to set the driver to debug log levels instead of the whole cinder service to see if we can reproduce the problem and get some more information.

I think this can be done by adding ",cinder.volume.drivers.emc=DEBUG" to the "default_log_levels", which would end up looking something like this:

default_log_levels = amqp=WARN,amqplib=WARN,boto=WARN,qpid=WARN,sqlalchemy=WARN,suds=INFO,oslo.messaging=INFO,oslo_messaging=INFO,iso8601=WARN,requests.packages.urllib3.connectionpool=WARN,urllib3.connectionpool=WARN,websocket=WARN,requests.packages.urllib3.util.retry=WARN,urllib3.util.retry=WARN,keystonemiddleware=WARN,routes.middleware=WARN,stevedore=WARN,taskflow=WARN,keystoneauth=WARN,oslo.cache=INFO,dogpile.core.dogpile=INFO,cinder.volume.drivers.emc=DEBUG


In the TCP dump, were there any scan requests from the initiator after the login?

Comment 2 David Vallee Delisle 2018-02-08 17:25:02 UTC

Hello Gorka,

Thanks for looking it up. When analyzing the captures, I see an inquiry called "Device Identification Page" which is replied by a:
"001. .... = Qualifier: Device type is supported by server but is not connected to logical unit (0x1)"

When I compare with a capture where it works, this is returned:
"000. .... = Qualifier: Device type is connected to logical unit (0x0)"

It looks like the SAN is presenting a target, but the target is not backed by block devices.

We asked the customer for the SP collect from the SAN and We don't see the addhlu happening (maybe I missed it though).

I'll ask the customer to activate debug on the driver.

Thank you very much,

DVD

Comment 7 Tzach Shefi 2019-08-26 14:27:52 UTC

Close loop wise nothing to automate/test.