Bug 967300 - [storage] Hosts become non-operational when adding first Storage Domain – Failed ConnectStoragePool - "iscsiadm -R" failed find a new LUN's
Summary: [storage] Hosts become non-operational when adding first Storage Domain – Fa...
Keywords:
Status: CLOSED DUPLICATE of bug 964768
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Sergey Gotliv
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks: 1019461
TreeView+ depends on / blocked
 
Reported: 2013-05-26 12:17 UTC by vvyazmin@redhat.com
Modified: 2016-02-10 17:55 UTC (History)
13 users (show)

Fixed In Version: is9
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-11-10 13:39:23 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
amureini: Triaged+


Attachments (Terms of Use)
## Logs rhevm, vdsm, libvirt (1.85 MB, application/x-gzip)
2013-05-26 12:17 UTC, vvyazmin@redhat.com
no flags Details
## Logs rhevm, vdsm (14.38 MB, application/x-gzip)
2013-06-03 11:16 UTC, vvyazmin@redhat.com
no flags Details

Description vvyazmin@redhat.com 2013-05-26 12:17:13 UTC
Created attachment 753306 [details]
## Logs rhevm, vdsm, libvirt

Description of problem:
In scale environment, some hosts become non-operational when adding first Storage Domain – Failed ConnectStoragePool 

Version-Release number of selected component (if applicable):
RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create iSCSI DC with 50 hosts (in my case 50 fake hosts)
2. Add first Storage Domain
  
Actual results:
Some hosts become non-operational

Expected results:
Succeed add first Storage Domain (in scale enviroment) without problems

Impact on user:

Workaround:
Reboot hosts

Additional info:

/var/log/ovirt-engine/engine.log

2013-05-26 13:18:55,655 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-86) [2d63dceb] Command ConnectStoragePoolVDS execution failed. Exception: IRSNoMasterDomainException: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=c372dcbd-8f8c-4f56-ab6b-68e9ef8ec464, msdUUID=0dfe042a-6268-412d-9d0e-502c3d688119

/var/log/vdsm/vdsm.log

Thread-1051::ERROR::2013-05-26 13:18:55,569::sdc::143::Storage.StorageDomainCache::(_findDomain) domain 0dfe042a-6268-412d-9d0e-502c3d688119 not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 170, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('0dfe042a-6268-412d-9d0e-502c3d688119',)
Thread-1051::DEBUG::2013-05-26 13:18:55,615::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.c372dcbd-8f8c-4f56-ab6b-68e9ef8ec464'
Thread-1051::DEBUG::2013-05-26 13:18:55,616::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.c372dcbd-8f8c-4f56-ab6b-68e9ef8ec464' (0 active users)
Thread-1051::DEBUG::2013-05-26 13:18:55,616::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.c372dcbd-8f8c-4f56-ab6b-68e9ef8ec464' is free, finding out if anyone is waiting for it.
Thread-1051::DEBUG::2013-05-26 13:18:55,617::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.c372dcbd-8f8c-4f56-ab6b-68e9ef8ec464', Clearing records.
Thread-1051::ERROR::2013-05-26 13:18:55,622::task::850::TaskManager.Task::(_setError) Task=`0e5666f5-84fd-475e-87ae-821e164e3c03`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 41, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 940, in connectStoragePool
    masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 987, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 644, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1179, in __rebuild
    self.masterDomain = self.getMasterDomain(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1529, in getMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=c372dcbd-8f8c-4f56-ab6b-68e9ef8ec464, msdUUID=0dfe042a-6268-412d-9d0e-502c3d688119'
Thread-1051::DEBUG::2013-05-26 13:18:55,672::task::869::TaskManager.Task::(_run) Task=`0e5666f5-84fd-475e-87ae-821e164e3c03`::Task._run: 0e5666f5-84fd-475e-87ae-821e164e3c03 ('c372dcbd-8f8c-4f56-ab6b-68e9ef8ec464', 12, 'c372dcbd-8f8c-4f56-ab6b-68e9ef8ec464', '0dfe042a-6268-412d-9d0e-502c3d688119', 1) {} failed - stopping task

Comment 1 Ayal Baron 2013-05-26 14:55:09 UTC
Vlad, are you sure that all your hosts are mapped to the storage properly? in the vdsm log it appears that the connect succeeded but there is no vg there on any of the luns.

Thread-964::DEBUG::2013-05-26 13:15:07,205::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.1992-04.com.emc:cx.ckm00121000438.a6 -I default -p 10.35.160.104:3260 --op=ne
w' (cwd None)
Thread-964::DEBUG::2013-05-26 13:15:07,275::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0
Thread-964::DEBUG::2013-05-26 13:15:07,276::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.1992-04.com.emc:cx.ckm00121000438.a6 -I default -p 10.35.160.104:3260 -l' (cw
d None)
Thread-964::DEBUG::2013-05-26 13:15:08,026::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0
Thread-964::DEBUG::2013-05-26 13:15:08,027::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/iscsiadm -m node -T iqn.1992-04.com.emc:cx.ckm00121000438.a6 -I default -p 10.35.160.104:3260 -n node
.startup -v manual --op=update' (cwd None)
Thread-964::DEBUG::2013-05-26 13:15:08,233::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = ''; <rc> = 0
Thread-964::DEBUG::2013-05-26 13:15:08,234::hsm::2263::Storage.HSM::(__prefetchDomains) Found SD uuids: ()
Thread-964::DEBUG::2013-05-26 13:15:08,234::hsm::2312::Storage.HSM::(connectStorageServer) knownSDs: {}
Thread-964::INFO::2013-05-26 13:15:08,234::logUtils::42::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '26c2951b-0b8b-4136-9af0-a0818aed1aa3'}]
}

Comment 2 vvyazmin@redhat.com 2013-05-28 08:30:36 UTC
Ayal, yes, you absolutely right, and yesterday, after investigation with Haim, we found what the root cause to this problem.

1. You right, this is not scale bug (I already change summary of bug), the problem with probability. This error I will always(1) get when I run test with more then 50 servers. Same error ,I think, you will get if you run this test 50 times ,with one server. 

2. Root cause, if server was already connected to same server before, and you send iscsiadm command, return answer, that server already logged in, and no new LUN's are found. I will open two new bugs, that describe this problem and root cause with more relevant information, and logs. In this moment, similar root cause, was already described in BZ964768


(1) This problem wont appear if on all servers you validate, that no connections to iSCSI targets exists, before you connect first SD.

Comment 3 vvyazmin@redhat.com 2013-06-03 11:14:21 UTC
More information to this bug (attached more logs): 

Steps to Reproduce:
1. Create iSCSI DC with 50 hosts (in my case 50 fake hosts) connected to 9 Storage Domain, when each one consisted of 8 PV's 
2. All host in “up” state, and all SD in “Active” state
3. Maintenance all 50 hosts (no problem with thread are found)
4. All host in “Maintenance” state
5. Attached all hosts to new iSCSI DC
6. Activate all host
7. All host in “up” state
8. Try create a first SD
  
Actual results:
Failed create first SD.
This happens because, in step 3, some hosts (on 6 from 50) hold iSCSI session to old targets

Expected results:
1. Succeed create first SD
2. Block create  first SD action some host hold iSCSI session to old targets
3. Get warning message in UI, and logs

Impact on user:

Workaround:
Manually disconnect from iSCSI target
    iscsiadm -m session -u
    iscsiadm -m node -o delete

Comment 4 vvyazmin@redhat.com 2013-06-03 11:16:00 UTC
Created attachment 756248 [details]
## Logs rhevm, vdsm

Comment 5 Maor 2013-07-04 11:47:23 UTC
from the log it seems that the filter rejects all devices :

storageRefresh::DEBUG::2013-05-26 14:13:36,490::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm lvs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags' (cwd None)



Only after restart, the filter approve list of devices :

MainThread::INFO::2013-05-26 14:15:41,319::vdsm::88::vds::(run) I am the actual vdsm 4.10.2-18.0.el6ev dhcp160-212.scl.lab.tlv.redhat.com (2.6.32-358.2.1.el6.x86_64)
storageRefresh::DEBUG::2013-05-26 14:15:44,940::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm pvs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%36006016012823000a801fa6cfec5e211|36006016012823000aa01fa6cfec5e211|36006016012823000ac01fa6cfec5e211|36006016012823000ae01fa6cfec5e211|36006016012823000b001fa6cfec5e211|36006016012823000b201fa6cfec5e211|36006016012823000b401fa6cfec5e211|36006016012823000b601fa6cfec5e211|36006016012823000b801fa6cfec5e211|36006016012823000d8e0bf65fec5e211|36006016012823000dae0bf65fec5e211|36006016012823000dae72a79fec5e211|36006016012823000dce0bf65fec5e211|36006016012823000dce72a79fec5e211|36006016012823000dee0bf65fec5e211|36006016012823000dee72a79fec5e211|36006016012823000e07e0c73fec5e211|36006016012823000e0e0bf65fec5e211|36006016012823000e0e72a79fec5e211|36006016012823000e27e0c73fec5e211|36006016012823000e2e0bf65fec5e211|36006016012823000e2e72a79fec5e211|36006016012823000e47e0c73fec5e211|36006016012823000e4e0bf65fec5e211|36006016012823000e4e72a79fec5e211|36006016012823000e67e0c73fec5e211|36006016012823000e6e0bf65fec5e211|36006016012823000e6e72a79fec5e211|36006016012823000e87e0c73fec5e211|36006016012823000e8e0bf65fec5e211|36006016012823000ea7e0c73fec5e211|36006016012823000eae0bf65fec5e211|36006016012823000ec7e0c73fec5e211|36006016012823000ee7e0c73fec5e211|36006016012823000f07e0c73fec5e211|36006016012823000f27e0c73fec5e211%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size' (cwd None)

Comment 6 Maor 2013-07-04 11:52:33 UTC
since workaround fix that issue (see comment 3) I downgrade the severity to high

Comment 7 Yeela Kaplan 2013-07-28 13:30:41 UTC
As I have commented on BZ#964768, comment#3, engine needs to first validate that the lun is visible from all hosts. That means, check getDeviceList answer from all hosts before sending createSD, which is currently being disregarded.

Comment 8 Ayal Baron 2013-09-02 15:20:58 UTC
(In reply to Yeela Kaplan from comment #7)
> As I have commented on BZ#964768, comment#3, engine needs to first validate
> that the lun is visible from all hosts. That means, check getDeviceList
> answer from all hosts before sending createSD, which is currently being
> disregarded.

This is not correct, createSD does not require all hosts to see the luns since createSD is not performed in the context of a pool (i.e. user can use it for different cases so we don't know what group of hosts will need it).
Why are the hosts not 'seeing' the luns?

Comment 13 Sergey Gotliv 2013-11-10 13:39:23 UTC

*** This bug has been marked as a duplicate of bug 964768 ***


Note You need to log in before you can comment on or make changes to this bug.