Bug 1419476 - [vdsm] multipath cannot access physical devices for iSCSI connectStorageServer
Summary: [vdsm] multipath cannot access physical devices for iSCSI connectStorageServer
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.19.4
Hardware: x86_64
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.1.1
: ---
Assignee: Nir Soffer
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-06 09:54 UTC by Elad
Modified: 2017-02-07 18:23 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-07 17:12:32 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)
/var/log/ from hypervisor and engine.log (1.55 MB, application/x-gzip)
2017-02-06 09:54 UTC, Elad
no flags Details
host-deploy log (31.67 KB, application/x-gzip)
2017-02-06 15:06 UTC, Elad
no flags Details
all host-deploy logs (149.20 KB, application/x-gzip)
2017-02-06 17:55 UTC, Elad
no flags Details

Description Elad 2017-02-06 09:54:29 UTC
Created attachment 1247983 [details]
/var/log/ from hypervisor and engine.log

Description of problem:
Installed clean RHV 4.1.0-4 and tried to add an iSCSI domain. The hosts are exposed to few LUN via 4 XtremIO iSCSI portals. Discovered and logged in to all targets from Webadmin but no LUNs were displayed. In vdsm.log saw that  multipath failed to access the exposed volumes with a InaccessiblePhysDev exception.


Version-Release number of selected component (if applicable):
vdsm-4.19.4-1.el7ev.x86_64
vdsm-python-4.19.4-1.el7ev.noarch
vdsm-cli-4.19.4-1.el7ev.noarch
vdsm-hook-ethtool-options-4.19.4-1.el7ev.noarch
vdsm-jsonrpc-4.19.4-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.19.4-1.el7ev.noarch
vdsm-api-4.19.4-1.el7ev.noarch
vdsm-yajsonrpc-4.19.4-1.el7ev.noarch
vdsm-xmlrpc-4.19.4-1.el7ev.noarch
device-mapper-event-libs-1.02.135-1.el7_3.2.x86_64
device-mapper-libs-1.02.135-1.el7_3.2.x86_64
device-mapper-event-1.02.135-1.el7_3.2.x86_64
device-mapper-multipath-0.4.9-99.el7_3.1.x86_64
device-mapper-1.02.135-1.el7_3.2.x86_64
device-mapper-multipath-libs-0.4.9-99.el7_3.1.x86_64
device-mapper-persistent-data-0.6.3-1.el7.x86_64
iscsi-initiator-utils-6.2.0.873-35.el7.x86_64
iscsi-initiator-utils-iscsiuio-6.2.0.873-35.el7.x86_64
libiscsi-1.9.0-7.el7.x86_64


How reproducible:


Steps to Reproduce:
1. Add an iSCSI domain in a newly installed environment


Actual results:
No LUNs are presented after discovery and login were performed with no error in Webadmin.

No iSCSI discoverSendTargets of iSCSI login in vdsm.log for the following connectStorageServer, as should be.


2017-02-06 01:38:04,917 INFO  (jsonrpc/3) [dispatcher] Run and protect: connectStorageServer(domType=3, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id': '00000000-0000-0000-0000-000000000000', 'con
nection': '10.35.146.129', 'iqn': 'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00', 'user': u'', 'tpgt': '1', 'password': '********', 'port': '3260'}], options=None) (logUtils:49)
2017-02-06 01:38:04,977 INFO  (jsonrpc/3) [storage.ISCSI] iSCSI iface.net_ifacename not provided. Skipping. (iscsi:590)
2017-02-06 01:38:05,072 INFO  (jsonrpc/1) [dispatcher] Run and protect: repoStats(options=None) (logUtils:49)


Multipath exception:

2017-02-06 01:38:11,850 INFO  (check/loop) [storage.SANLock] Acquiring host id for domain db9dad41-fd69-4904-a3fc-d51fd315dbb2 (id: 1) (clusterlock:282)
2017-02-06 01:38:12,049 WARN  (jsonrpc/6) [storage.LVM] lvm pvs failed: 5 [] ['  WARNING: Not using lvmetad because config setting use_lvmetad=0.', '  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).', '  Failed to find device for physical volume "/dev/mapper/3514f0c5a516008d4".'] (lvm:324)
2017-02-06 01:38:12,049 WARN  (jsonrpc/6) [storage.HSM] getPV failed for guid: 3514f0c5a516008d4 (hsm:1972)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 1969, in _getDeviceList
    pv = lvm.getPV(guid)
  File "/usr/share/vdsm/storage/lvm.py", line 856, in getPV
    raise se.InaccessiblePhysDev((pvName,))
InaccessiblePhysDev: Multipath cannot access physical device(s): "devices=('3514f0c5a516008d4',)"


'multipath -ll' output is null



Workaround: 
After reboot to the hosts, the devices get exposed

Expected results:
LUNs should be visible 

Additional info:
/var/log/ from hypervisor and engine.log

Comment 1 Elad 2017-02-06 10:06:43 UTC
How reproducible:
100%

Comment 2 Red Hat Bugzilla Rules Engine 2017-02-06 12:15:09 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Nir Soffer 2017-02-06 14:20:25 UTC
Looks like bad deploy - vdsm was started without running:

    "vdsm-tool configure  --force"

Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: Error:
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: One of the modules is not configured to work with VDSM.
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: To configure the module use the following:
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: 'vdsm-tool configure [--module module-name]'.
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: If all modules are not configured try to use:
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: 'vdsm-tool configure --force'
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: (The force flag will stop the module's service and start it
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: afterwards automatically to load the new configuration.)
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: multipath requires configuration
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: libvirt is not configured for vdsm yet
Feb  6 00:59:07 storage-ge2-vdsm1 vdsmd_init_common.sh: Modules certificates, sebool, multipath, passwd, sanlock, libvirt are not configured

Please attach host-deploy logs from this installation.

Comment 4 Elad 2017-02-06 15:06:24 UTC
Created attachment 1248050 [details]
host-deploy log

Comment 5 Nir Soffer 2017-02-06 16:38:20 UTC
Looking in host-deploy log, vdsm was deployed successfully on 
2017-02-06 01:34:41. The logs from Feb  6 00:59:07 were
the results of unconfigured vdsm installed on this system
and are not related to the issue with multipath devices.

2017-02-06 01:34:41 DEBUG otopi.plugins.ovirt_host_deploy.vdsm.packages plugin.execute:921 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stdout:

Checking configuration status...

multipath requires configuration
libvirt is not configured for vdsm yet
FAILED: conflicting vdsm and libvirt-qemu tls configuration.
vdsm.conf with ssl=True requires the following changes:
libvirtd.conf: listen_tcp=0, auth_tcp="sasl", listen_tls=1
qemu.conf: spice_tls=1.

Running configure...
Reconfiguration of sebool is done.
Reconfiguration of multipath is done.
Reconfiguration of passwd is done.
Reconfiguration of sanlock is done.
Reconfiguration of libvirt is done.

Done configuring modules to VDSM.

2017-02-06 01:34:41 DEBUG otopi.plugins.ovirt_host_deploy.vdsm.packages plugin.execute:926 execute-output: ('/bin/vdsm-tool', 'configure', '--force') stderr:

Need to check the rest of the logs after this time.

Comment 6 Elad 2017-02-06 17:52:33 UTC
Tested with previous build [1], the issue doesn't reproduce.

[1] 
rhevm-4.1.0.3-0.1.el7.noarch
vdsm-4.19.4-1.el7ev.x86_64rhevm-4.1.0.3-0.1.el7.noarch
device-mapper-multipath-0.4.9-99.el7_3.1.x86_64
iscsi-initiator-utils-6.2.0.873-35.el7.x86_64

Comment 7 Elad 2017-02-06 17:55:42 UTC
Created attachment 1248106 [details]
all host-deploy logs

Comment 8 Nir Soffer 2017-02-06 20:27:37 UTC
(In reply to Elad from comment #0)
> How reproducible:

Do you know how to reproduce this? I would like to inspect the host
in the state before the reboot.

Can you reproduce it again with the same build?

Comment 10 Nir Soffer 2017-02-07 12:47:38 UTC
Liron, we can see that getDeviceList retruns empty list - no devices are available
on the host:

2017-02-06 01:38:10,710 INFO  (jsonrpc/0) [dispatcher] Run and protect: getDeviceList, Return response: {'devList': []} (logUtils:52)
2017-02-06 01:38:10,710 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getDeviceList succeeded in 0.22 seconds (__init__:515)

But after that, engine call again getDeviceList with a specific guid:

2017-02-06 01:38:11,816 INFO  (jsonrpc/6) [dispatcher] Run and protect: getDeviceList(storageType=3, guids=['3514f0c5a516008d4'], checkStatus=False, options={}) (logUtils:49)

Regardless of the reason why devices are not available on this, host, it seems
that engine is ignoring the result of getDeviceList, and trying to create a vg
with non-existent device.

Can you explain this?

Comment 11 Nir Soffer 2017-02-07 13:25:44 UTC
Back to /var/log/messages...

Before vdsm was started, multipath was started as expected (required by vdsm):

Feb  6 01:34:42 storage-ge2-vdsm1 kernel: device-mapper: multipath: version 1.11.0 loaded

vdsm was started successfully:

Feb  6 01:34:45 storage-ge2-vdsm1 systemd: Started Virtual Desktop Server Manager network restoration.
Feb  6 01:34:45 storage-ge2-vdsm1 systemd: Starting Virtual Desktop Server Manager...
...
Feb  6 01:34:46 storage-ge2-vdsm1 vdsmd_init_common.sh: vdsm: Running check_is_configured
Feb  6 01:34:47 storage-ge2-vdsm1 vdsmd_init_common.sh: Current revision of multipath.conf detected, preserving
...
Feb  6 01:34:49 storage-ge2-vdsm1 systemd: Started Virtual Desktop Server Manager.

Then vdsm connected to the iscsi server, and new sessions started:

Feb  6 01:38:05 storage-ge2-vdsm1 kernel: scsi host3: iSCSI Initiator over TCP/IP
...
Feb  6 01:38:05 storage-ge2-vdsm1 iscsid: Connection1:0 to [target: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00, portal: 10.35.146.129,3260] through [iface: default] is operational now

In vdsm log, we see that vdsm connected to iscsi target:

2017-02-06 01:38:04,917 INFO  (jsonrpc/3) [dispatcher] Run and protect: connectStorageServer(domType=3, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id': '00000000-0000-0000-000
0-000000000000', 'connection': '10.35.146.129', 'iqn': 'iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c00', 'user': u'', 'tpgt': '1', 'password': '********', 'port': '3260'}], options=
None) (logUtils:49)
2017-02-06 01:38:04,977 INFO  (jsonrpc/3) [storage.ISCSI] iSCSI iface.net_ifacename not provided. Skipping. (iscsi:590)
2017-02-06 01:38:05,072 INFO  (jsonrpc/1) [dispatcher] Run and protect: repoStats(options=None) (logUtils:49)
2017-02-06 01:38:05,072 INFO  (jsonrpc/1) [dispatcher] Run and protect: repoStats, Return response: {'0840c75a-8292-4900-91de-51b6599afe86': {'code': 0, 'actual': True, 'version': 4, 'acquir
ed': True, 'delay': '0.000583175', 'lastCheck': '1.7', 'valid': True}, 'db9dad41-fd69-4904-a3fc-d51fd315dbb2': {'code': 0, 'actual': True, 'version': 4, 'acquired': False, 'delay': '0.001194
32', 'lastCheck': '3.2', 'valid': True}, 'ecca1745-5f22-4841-8d1b-fb5b62f0c6ee': {'code': 0, 'actual': True, 'version': 4, 'acquired': True, 'delay': '0.000578069', 'lastCheck': '8.2', 'vali
d': True}} (logUtils:52)
2017-02-06 01:38:05,091 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getStats succeeded in 0.02 seconds (__init__:515)
2017-02-06 01:38:05,300 INFO  (jsonrpc/3) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '00000000-0000-0000-0000-000000000000'}]} (
logUtils:52)

(We see couple of connections)

Finally vdsm is running getDeviceList - in this flow we do scsi rescan,
and we should find the devices exposed by the storage server.

2017-02-06 01:38:10,493 INFO  (jsonrpc/0) [dispatcher] Run and protect: getDeviceList(storageType=3, guids=(), checkStatus=False, options={}) (logUtils:49)
2017-02-06 01:38:10,710 INFO  (jsonrpc/0) [dispatcher] Run and protect: getDeviceList, Return response: {'devList': []} (logUtils:52)

There is no error, so the scan was successful.

Since we don't see anything not in vdsm log or in /var/log/messages
my conclusion is that the devices were not accessible to the 
host at this time, and this is environment issue.

Until we have a host in this state that we can inspect, we cannot do
anything with this bug.

When you reproduce again, please change log levels to DEBUG so we have
more information.

Should be closed as insufficient data.

Comment 12 Elad 2017-02-07 13:39:29 UTC
I'm preparing an environment so you'll be able to examine

Comment 13 Elad 2017-02-07 14:13:26 UTC
Environment provided offline

Comment 14 Elad 2017-02-07 14:21:33 UTC
Resetting the need-info for Liron

Comment 15 Yaniv Kaul 2017-02-07 17:05:33 UTC
(In reply to Elad from comment #12)
> I'm preparing an environment so you'll be able to examine

Does it happen against any other storage?

Comment 16 Nir Soffer 2017-02-07 17:12:32 UTC
The issue was incorrect iscsi initiator name because iscsid service was not
restarted after replacing /etc/iscsi/initiatorname.iscsi.


Note You need to log in before you can comment on or make changes to this bug.