+++ This bug was initially created as a clone of Bug #1504661 +++ When we run an evacuation off an hypervisor, all the VMs are moving out at the same time, which means that there's a lot of cinder operations at the same time. Sometimes, one of the VM refuses to boot on the new hypervisor and when this happens, there's a VolumeDeviceNotFound in the logs on that new hypervisor. It is preceded by these messages: ~~~ 2018-01-29 17:19:46.200 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.4.97:3260 not found on sysfs after logging in. 2018-01-29 17:19:46.204 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.4.98:3260 not found on sysfs after logging in. 2018-01-29 17:19:46.207 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.6.98:3260 not found on sysfs after logging in. 2018-01-29 17:19:46.209 7420 WARNING os_brick.initiator.connectors.iscsi [req-7efe9dc5-ef34-43a7-82fd-b6cc7db52ba9 39f1dca3a061450aa7837e28c7f325f9 2f1fd480e80444d6ae4c2cccfae6b383 - - -] LUN 106 on iSCSI portal 192.168.6.97:3260 not found on sysfs after logging in. ~~~ We were able to pull a tcpdump when this happens, and we see the connection to the iscsi portal but there's no device attached. According to the SAN (VNX) logs, we have these messages: ~~~ 2018-01-29 10:45:01 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Success (Navisphere CLI command: ' storagegroup -addhlu -hlu 106 -alu 4 -gname compute-21.domain.com -o ') 2018-01-29 11:14:10 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Success (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-21.domain.com -o ') 2018-01-29 11:20:47 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Failed (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-02.domain.com -o '. Error: storagegroup command failed Error returned from A 2018-01-29 11:38:09 [0x4600 APM00172522735 N/A SPA N/A] [Application] 'storagegroup' called by ' Navi User vnxadmin' (10.10.10.10) with result: Failed (Navisphere CLI command: ' storagegroup -removehlu -hlu 106 -gname compute-02.domain.com -o '. Error: storagegroup command failed Error returned from A ~~~ And the last message loops for a while. When we activate DEBUG on cinder, the problem vanishes.
This seems to be a problem on the storage array or in the vendor driver, it could be a good idea to set the driver to debug log levels instead of the whole cinder service to see if we can reproduce the problem and get some more information. I think this can be done by adding ",cinder.volume.drivers.emc=DEBUG" to the "default_log_levels", which would end up looking something like this: default_log_levels = amqp=WARN,amqplib=WARN,boto=WARN,qpid=WARN,sqlalchemy=WARN,suds=INFO,oslo.messaging=INFO,oslo_messaging=INFO,iso8601=WARN,requests.packages.urllib3.connectionpool=WARN,urllib3.connectionpool=WARN,websocket=WARN,requests.packages.urllib3.util.retry=WARN,urllib3.util.retry=WARN,keystonemiddleware=WARN,routes.middleware=WARN,stevedore=WARN,taskflow=WARN,keystoneauth=WARN,oslo.cache=INFO,dogpile.core.dogpile=INFO,cinder.volume.drivers.emc=DEBUG In the TCP dump, were there any scan requests from the initiator after the login?
Hello Gorka, Thanks for looking it up. When analyzing the captures, I see an inquiry called "Device Identification Page" which is replied by a: "001. .... = Qualifier: Device type is supported by server but is not connected to logical unit (0x1)" When I compare with a capture where it works, this is returned: "000. .... = Qualifier: Device type is connected to logical unit (0x0)" It looks like the SAN is presenting a target, but the target is not backed by block devices. We asked the customer for the SP collect from the SAN and We don't see the addhlu happening (maybe I missed it though). I'll ask the customer to activate debug on the driver. Thank you very much, DVD
Close loop wise nothing to automate/test.