Bug 1542607

Summary: When evacuating an hypervisor, we get a VolumeDeviceNotFound
Product: Red Hat OpenStack Reporter: David Vallee Delisle <dvd>
Component: openstack-cinderAssignee: Gorka Eguileor <geguileo>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Avi Avraham <aavraham>
Severity: high Docs Contact: RHOS Documentation Team <rhos-docs>
Priority: medium    
Version: 10.0 (Newton)CC: aavraham, cpaquin, cschwede, cswanson, dciabrin, dhill, dvd, ebeaudoi, geguileo, jthomas, lkuchlan, lmarsh, marjones, mbayer, mburns, pgrist, scohen, srevivo, tshefi
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---Flags: tshefi: automate_bug-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1504661 Environment:
Last Closed: 2019-07-17 14:26:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1504661, 1504670, 1504671    
Bug Blocks: 1464146    

Description David Vallee Delisle 2018-02-06 16:11:06 UTC
+++ This bug was initially created as a clone of Bug #1504661 +++

When we run an evacuation off an hypervisor, all the VMs are moving out at the same time, which means that there's a lot of cinder operations at the same time.

Sometimes, one of the VM refuses to boot on the new hypervisor and when this happens, there's a VolumeDeviceNotFound in the logs on that new hypervisor.

It is preceded by these messages:
~~~
2018-01-29 17:19:46.200 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.4.97:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.204 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.4.98:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.207 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.6.98:3260 not found on sysfs after logging in.
2018-01-29 17:19:46.209 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.6.97:3260 not found on sysfs after logging in.
~~~

We were able to pull a tcpdump when this happens, and we see the connection to the iscsi portal but there's no device attached. 


According to the SAN (VNX) logs, we have these messages:
~~~
2018-01-29 10:45:01 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Success (Navisphere CLI command: ' storagegroup -addhlu -hlu 106 -alu 4 -gname compute-21.domain.com -o ')

2018-01-29 11:14:10 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Success (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-21.domain.com -o ')

2018-01-29 11:20:47 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Failed (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-02.domain.com -o '. Error: storagegroup command failed Error returned from A

2018-01-29 11:38:09 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Failed (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-02.domain.com -o '. Error: storagegroup command failed Error returned from A
~~~

And the last message loops for a while.

When we activate DEBUG on cinder, the problem vanishes.

Comment 1 Gorka Eguileor 2018-02-08 14:20:11 UTC
This seems to be a problem on the storage array or in the vendor driver, it could be a good idea to set the driver to debug log levels instead of the whole cinder service to see if we can reproduce the problem and get some more information.

I think this can be done by adding ",cinder.volume.drivers.emc=DEBUG" to the "default_log_levels", which would end up looking something like this:

default_log_levels = amqp=WARN,amqplib=WARN,boto=WARN,qpid=WARN,sqlalchemy=WARN,suds=INFO,oslo.messaging=INFO,oslo_messaging=INFO,iso8601=WARN,requests.packages.urllib3.connectionpool=WARN,urllib3.connectionpool=WARN,websocket=WARN,requests.packages.urllib3.util.retry=WARN,urllib3.util.retry=WARN,keystonemiddleware=WARN,routes.middleware=WARN,stevedore=WARN,taskflow=WARN,keystoneauth=WARN,oslo.cache=INFO,dogpile.core.dogpile=INFO,cinder.volume.drivers.emc=DEBUG


In the TCP dump, were there any scan requests from the initiator after the login?

Comment 2 David Vallee Delisle 2018-02-08 17:25:02 UTC
Hello Gorka,

Thanks for looking it up. When analyzing the captures, I see an inquiry called "Device Identification Page" which is replied by a:
"001. .... = Qualifier: Device type is supported by server but is not connected to logical unit (0x1)"

When I compare with a capture where it works, this is returned:
"000. .... = Qualifier: Device type is connected to logical unit (0x0)"

It looks like the SAN is presenting a target, but the target is not backed by block devices.

We asked the customer for the SP collect from the SAN and We don't see the addhlu happening (maybe I missed it though).

I'll ask the customer to activate debug on the driver.

Thank you very much,

DVD

Comment 7 Tzach Shefi 2019-08-26 14:27:52 UTC
Close loop wise nothing to automate/test.