Bug 1176517 - [scale] Data Center crashing and contending forever due to unknown device on missing pvs. All SDs are Unknown/Inactive.
Summary: [scale] Data Center crashing and contending forever due to unknown device on ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-3.6.5
: 3.6.0
Assignee: Nir Soffer
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks: 968349 1181678
TreeView+ depends on / blocked
 
Reported: 2014-12-22 08:32 UTC by Yuri Obshansky
Modified: 2016-04-25 14:27 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1181678 (view as bug list)
Environment:
Last Closed: 2016-04-25 14:27:55 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Engine log (3.30 MB, application/x-gzip)
2014-12-22 08:33 UTC, Yuri Obshansky
no flags Details
Vdsm 1 log (615.84 KB, application/x-xz)
2014-12-22 08:33 UTC, Yuri Obshansky
no flags Details
Vdsm 2 log (643.86 KB, application/x-xz)
2014-12-22 08:34 UTC, Yuri Obshansky
no flags Details
Fake Host multipath output (37.10 KB, text/plain)
2014-12-22 08:36 UTC, Yuri Obshansky
no flags Details
Screenshot (159.46 KB, image/jpeg)
2014-12-22 08:43 UTC, Yuri Obshansky
no flags Details
Second engine.log (4.35 MB, application/x-gzip)
2014-12-22 09:36 UTC, Yuri Obshansky
no flags Details
DB dump (1.47 MB, application/x-gzip)
2014-12-22 10:45 UTC, Yuri Obshansky
no flags Details
SPM host vdsm log 1 (616.16 KB, application/x-xz)
2014-12-22 10:46 UTC, Yuri Obshansky
no flags Details
SPM host vdsm log 2 (636.53 KB, application/x-xz)
2014-12-22 10:46 UTC, Yuri Obshansky
no flags Details
SPM host vdsm log 3 (455.45 KB, application/x-xz)
2014-12-22 10:47 UTC, Yuri Obshansky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 36352 0 master MERGED lvm: Handle special "unknown device" pv_name Never
oVirt gerrit 36782 0 ovirt-3.5 MERGED lvm: Handle special "unknown device" pv_name Never

Description Yuri Obshansky 2014-12-22 08:32:22 UTC
Description of problem:
Problem found during verification of Bug 968349 - [scale] [storage] ConnectStorageServer failed - The thread pool is out of limit (engine finish its thread pool). It is blocker for Bug 968349 verification.
 
Created:
2 Data Centers - Real and Fake
Real DC has physical host which run 50 VMs (4CPUs, 2G RAM) and NetApp NFS Storage
Fake DC has 50 fake hosts and 10 ISCSI Storage Domains
After several minutes after activation last ISCSI Storage Domain
Data Center crashing and contending forever and all ISCSI Storage Domains to Unknown/Inactive status. 
Errors found in engine.log:
2014-12-22 03:10:43,678 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-60) [38c55ac7] Failed in HSMGetTaskStatusVDS method
2014-12-22 03:10:43,678 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-60) [38c55ac7] Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = [Errno 19] Could not find dm device named `unknown device`, code = 100
2014-12-22 03:10:43,777 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-60) [701e7d60] IrsBroker::Failed::GetStoragePoolInfoVDS due to: IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed 

vdsm.log
Thread-27528::ERROR::2014-12-22 08:01:08,343::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 1b6d6d26-9e81-4b05-8913-20d10a0bb779 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 221, in _monitorDomain
    self.domain = sdCache.produce(self.sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1360, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/blockSD.py", line 421, in __init__
    lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))
  File "/usr/share/vdsm/storage/lvm.py", line 1020, in checkVGBlockSizes
    _checkpvsblksize(pvs, vgBlkSize)
  File "/usr/share/vdsm/storage/lvm.py", line 997, in _checkpvsblksize
    pvBlkSize = _getpvblksize(pv)
  File "/usr/share/vdsm/storage/lvm.py", line 991, in _getpvblksize
    dev = devicemapper.getDmId(os.path.basename(pv))
  File "/usr/share/vdsm/storage/devicemapper.py", line 39, in getDmId
    deviceMultipathName)
OSError: [Errno 19] Could not find dm device named `unknown device` 

Uploaded new engine and vdsm log files

Version-Release number of selected component (if applicable):
RHEV-M 3.5.0-0.22.el6ev
RHEL - 6Server - 6.6.0.2.el6
libvirt-0.10.2-46.el6_6.1
vdsm-4.16.7.6-1.el6ev 

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2014-12-22 08:33:04 UTC
Created attachment 971909 [details]
Engine log

Comment 2 Yuri Obshansky 2014-12-22 08:33:44 UTC
Created attachment 971910 [details]
Vdsm 1 log

Comment 3 Yuri Obshansky 2014-12-22 08:34:08 UTC
Created attachment 971911 [details]
Vdsm 2 log

Comment 4 Yuri Obshansky 2014-12-22 08:36:18 UTC
Created attachment 971913 [details]
Fake Host multipath output

Comment 5 Yuri Obshansky 2014-12-22 08:43:09 UTC
Created attachment 971915 [details]
Screenshot

Comment 6 Yuri Obshansky 2014-12-22 09:36:59 UTC
Created attachment 971922 [details]
Second engine.log

Comment 7 Allon Mureinik 2014-12-22 10:00:58 UTC
Nir, can you take a look please?

Comment 8 Allon Mureinik 2014-12-22 10:01:40 UTC
Aharon, have we ever seen such behavior on a non-scale env?

Comment 9 Allon Mureinik 2014-12-22 10:05:37 UTC
(In reply to Allon Mureinik from comment #7)
> Nir, can you take a look please?

(In reply to Allon Mureinik from comment #8)
> Aharon, have we ever seen such behavior on a non-scale env?

We need to triage and find the root cause for this.
If it only occurs in extreme scales (100 hosts), we can delay it for 3.5.1 and handle properly. If there's an actual logical error that may surface in any customer env, we need to address it PDQ.

Comment 10 Vered Volansky 2014-12-22 10:07:33 UTC
vdsm logs are rotation of the same host, selected randomly.
The hosts are contending for SPM, no one succeeds.
We need the SPM logs from before this went south. Please check the engine log which one got spmstop and add the vdsm log(s) from the relevant time.

The logs repeat themselves from the time it occurred, which is 19/12, as are the logs attached.

If you can't find the SPM, we'll need all the hosts' logs.

Please also add the DB before you delete your env.

Comment 11 Yuri Obshansky 2014-12-22 10:44:39 UTC
Correction: 50 hosts
which is not extreme setup AFAIK

Comment 12 Yuri Obshansky 2014-12-22 10:45:34 UTC
Created attachment 971950 [details]
DB dump

Comment 13 Yuri Obshansky 2014-12-22 10:46:06 UTC
Created attachment 971951 [details]
SPM host vdsm log 1

Comment 14 Yuri Obshansky 2014-12-22 10:46:39 UTC
Created attachment 971952 [details]
SPM host vdsm log 2

Comment 15 Yuri Obshansky 2014-12-22 10:47:09 UTC
Created attachment 971953 [details]
SPM host vdsm log 3

Comment 16 Yuri Obshansky 2014-12-22 10:49:27 UTC
Uploaded RHEV-M DB dump and 3 vdsm.log files from SPM Host.
Please, verify if it is correct and readable
I'm going erase my RHEV-M setup

Comment 17 Aharon Canan 2014-12-22 11:08:11 UTC
(In reply to Allon Mureinik from comment #8)
> Aharon, have we ever seen such behavior on a non-scale env?

Not that I am aware of.
We probably would have open bug in such case.

Comment 18 Allon Mureinik 2014-12-22 11:09:45 UTC
(In reply to Aharon Canan from comment #17)
> (In reply to Allon Mureinik from comment #8)
> > Aharon, have we ever seen such behavior on a non-scale env?
> 
> Not that I am aware of.
> We probably would have open bug in such case.
Yup - I couldn't find one, wanted to double check with you, though.

Comment 19 Nir Soffer 2014-12-22 14:13:22 UTC
(In reply to Yuri Obshansky from comment #16)
> Uploaded RHEV-M DB dump and 3 vdsm.log files from SPM Host.

We need to see earlier logs.

I suggest you backup everything in /var/log/ to somewhere so we can check all logs.

Specific logs needed are:

- Older vdsm logs
- /var/log/messages
- /var/log/sanlock.log

Comment 20 Yuri Obshansky 2014-12-22 14:17:00 UTC
Sorry, I already don't have those log files.
Environment erased.

Comment 21 Nir Soffer 2014-12-22 14:40:14 UTC
How many time did you reproduce this? What are the steps needed to reproduce this?

Comment 22 Yuri Obshansky 2014-12-22 14:47:58 UTC
Once. 
Create:
80 LUNs on ISCSI Storage with predefined InitiatorName
(for example InitiatorName=iqn.1994-05.com.redhat:f5abf64c500 till 
InitiatorName=iqn.1994-05.com.redhat:f5abf64c550)
2 Data Centers - Real and Fake
Real DC has physical host which run 50 VMs (4CPUs, 2G RAM) and NetApp NFS Storage
Fake DC has 50 fake hosts 
change each fake host initiatorName up to predefined above.
Add 10 ISCSI Storage Domains and Activate.
After several minutes after activation last ISCSI Storage Domain
Data Center crashing and contending forever and all ISCSI Storage Domains to Unknown/Inactive status.

Comment 23 Nir Soffer 2014-12-22 15:02:28 UTC
The issue here is that one of the pvs name is "unknown device". Vdsm is not the source for this name (it does not initialize anything to this value). However, we can find this string in lvm:

# strings /usr/sbin/lvm | grep 'unknown device'
unknown device

// lib/device/dev-cache.c
1075 const char *dev_name(const struct device *dev)
1076 {
1077     return (dev && dev->aliases.n) ? dm_list_item(dev->aliases.n, struct dm_str_list)->str :
1078         "unknown device";
1079 }

Is this possible that lvm return "unknown device" as pv_name in output of vgs command?

Comment 24 Nir Soffer 2014-12-22 15:03:55 UTC
Zdenek, can you help with comment 23?

Comment 25 Federico Simoncelli 2014-12-22 15:28:44 UTC
I think happens when one of the pvs (of a vg) is missing. I'll leave to Zdenek to confirm.

Comment 26 Nir Soffer 2014-12-22 16:27:58 UTC
The suspected lvm command:

Thread-27264::DEBUG::2014-12-19 09:01:07,442::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_s
tate=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''a|/dev/mapper/3514f0c54476007cf|/dev/mapper/3514f0c54476007d0|/dev/mapper/3514f0c54476007d1|/dev/mapper/3514f0c54476007d2|/dev/map
per/3514f0c54476007d3|/dev/mapper/3514f0c54476007d4|/dev/mapper/3514f0c54476007d5|/dev/mapper/3514f0c54476007d6|/dev/mapper/3514f0c54476007d7|/dev/mapper/3514f0c54476007d8|/dev/mapper/3514f0c54476007d9|/dev/mapp
er/3514f0c54476007da|/dev/mapper/3514f0c54476007db|/dev/mapper/3514f0c54476007dc|/dev/mapper/3514f0c54476007dd|/dev/mapper/3514f0c54476007de|/dev/mapper/3514f0c54476007df|/dev/mapper/3514f0c54476007e0|/dev/mappe
r/3514f0c54476007e1|/dev/mapper/3514f0c54476007e2|/dev/mapper/3514f0c54476007e3|/dev/mapper/3514f0c54476007e4|/dev/mapper/3514f0c54476007e5|/dev/mapper/3514f0c54476007e6|/dev/mapper/3514f0c54476007e7|/dev/mapper
/3514f0c54476007e8|/dev/mapper/3514f0c54476007e9|/dev/mapper/3514f0c54476007ea|/dev/mapper/3514f0c54476007eb|/dev/mapper/3514f0c54476007ec|/dev/mapper/3514f0c54476007ed|/dev/mapper/3514f0c54476007ee|/dev/mapper/
3514f0c54476007ef|/dev/mapper/3514f0c54476007f0|/dev/mapper/3514f0c54476007f1|/dev/mapper/3514f0c54476007f2|/dev/mapper/3514f0c54476007f3|/dev/mapper/3514f0c54476007f4|/dev/mapper/3514f0c54476007f5|/dev/mapper/3
514f0c54476007f6|/dev/mapper/3514f0c54476007f7|/dev/mapper/3514f0c54476007f8|/dev/mapper/3514f0c54476007f9|/dev/mapper/3514f0c54476007fa|/dev/mapper/3514f0c54476007fb|/dev/mapper/3514f0c54476007fc|/dev/mapper/35
14f0c54476007fd|/dev/mapper/3514f0c54476007fe|/dev/mapper/3514f0c54476007ff|/dev/mapper/3514f0c5447600800|/dev/mapper/3514f0c5447600801|/dev/mapper/3514f0c5447600802|/dev/mapper/3514f0c5447600803|/dev/mapper/351
4f0c5447600804|/dev/mapper/3514f0c5447600805|/dev/mapper/3514f0c5447600806|/dev/mapper/3514f0c5447600807|/dev/mapper/3514f0c5447600808|/dev/mapper/3514f0c5447600809|/dev/mapper/3514f0c544760080a|/dev/mapper/3514
f0c544760080b|/dev/mapper/3514f0c544760080c|/dev/mapper/3514f0c544760080d|/dev/mapper/3514f0c544760080e|/dev/mapper/3514f0c544760080f|/dev/mapper/3514f0c5447600810|/dev/mapper/3514f0c5447600811|/dev/mapper/3514f
0c5447600812|/dev/mapper/3514f0c5447600813|/dev/mapper/3514f0c5447600814|/dev/mapper/3514f0c5447600815|/dev/mapper/3514f0c5447600816|/dev/mapper/3514f0c5447600817|/dev/mapper/3514f0c5447600818|/dev/mapper/3514f0
c5447600819|/dev/mapper/3514f0c544760081a|/dev/mapper/3514f0c544760081b|/dev/mapper/3514f0c544760081c|/dev/mapper/3514f0c544760081d|/dev/mapper/3514f0c544760081e|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1
  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,
extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 1b6d6d26-9e81-4b05-8913-20d10a0bb779 (cwd None)

The command was successful, but we got few warnings in stderr:

Thread-27264::DEBUG::2014-12-19 09:01:10,250::lvm::288::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = "  Couldn't find device with uuid f4lZyM-vtBO-71ql-E4O8-ls6J-XK6L-oP6wjk.\n  Couldn't find device with uuid xk4
zcV-C3wO-arG3-fU9P-kE1Q-y1B9-sdnEQ3.\n  Couldn't find device with uuid er01Lz-QSzr-EsBn-L7fW-6ndc-Gzcd-AFK4S8.\n  Couldn't find device with uuid VVeaPf-83Zb-iL05-gAGw-NIs7-7q75-ZkFSaL.\n"; <rc> = 0

Comment 27 Allon Mureinik 2014-12-23 10:17:05 UTC
Since we've never seen this in "regular" envs, and since the fix will quite possibly be a fix in LVM, I'm pushing out to 3.5.1, as to not block 3.5.0.

Comment 28 Nir Soffer 2014-12-23 12:01:56 UTC
The "unknown device" value was introduced in:

commit 6b4c9ff7409edbcd357d089dae94b48a12b2c3ca
Author: Alasdair Kergon <agk>
Date:   Tue Apr 30 17:12:37 2002 +0000

    Some partial VG support with format_text.

So it seems that vdsm need to handle this special value.

Comment 29 Nir Soffer 2014-12-23 12:06:02 UTC
Yuri, can you describe the iscsi storage setup?

- How many storage servers targets?
- How many luns on each target?
- How many luns per storage domain?
- Did you have storage domain with luns from different targets?

Comment 30 Yuri Obshansky 2014-12-23 12:12:55 UTC
Aharon, could you describe iscsi storage setup?

Comment 31 Aharon Canan 2014-12-23 12:17:17 UTC
4 target ports
all luns should be exposed from all targets (we have much more luns to other setups, not only this setup)

Luns per storage domain and different targets - need Yuri to answer as the setup owner.

Comment 32 Yuri Obshansky 2014-12-23 12:19:04 UTC
Sorry, I missed it. 
- How many luns per storage domain?
4 per storage domain
- Did you have storage domain with luns from different targets?
no

Comment 33 Nir Soffer 2014-12-23 12:48:15 UTC
Aharon, in this setup, if one target is disconnected, the lun should still be
available from the other targets - right?

I can reproduce the "unknown device" response from lvm when creating a vg with
2 luns, one accessible only from one target, and the other only from another
target, and then logging out the node from the other target.

This looks like this when running vgs:

# vgs --noheadings --units b --nosuffix -o uuid,pv_name 2abf8872-f51c-401b-b747-a0f5313c92de
  PV xe1S32-0xbb-BDiJ-MAcJ-fOk8-GS1O-okRpAf not recognised. Is the device missing?
  0YDWj1-3171-diDk-547L-VMNN-scC4-qdOPv8 /dev/mapper/1IET_00070003
  0YDWj1-3171-diDk-547L-VMNN-scC4-qdOPv8 unknown device           

And in pvs:

# pvs -o name,vg_name | grep 2abf8872-f51c-401b-b747-a0f5313c92de
  PV xe1S32-0xbb-BDiJ-MAcJ-fOk8-GS1O-okRpAf not recognised. Is the device missing?
  PV xe1S32-0xbb-BDiJ-MAcJ-fOk8-GS1O-okRpAf not recognised. Is the device missing?
  PV xe1S32-0xbb-BDiJ-MAcJ-fOk8-GS1O-okRpAf not recognised. Is the device missing?
  /dev/mapper/1IET_00070003                     2abf8872-f51c-401b-b747-a0f5313c92de
  unknown device                                2abf8872-f51c-401b-b747-a0f5313c92de

So we have 2 issues:
1. vdsm does not handle the special "unknown device" value from lvm
2. How we got this state in the setup used for this test?

Comment 34 Aharon Canan 2014-12-23 12:53:02 UTC
(In reply to Nir Soffer from comment #33)
> Aharon, in this setup, if one target is disconnected, the lun should still be
> available from the other targets - right?

the lun should be available from other targets, the question is if We log in to all targets or only to 1...

Comment 35 Yuri Obshansky 2014-12-23 14:01:57 UTC
Maybe it is related to fake host which is running on VM with vdsm-hook-faqemu?

Comment 36 Nir Soffer 2014-12-23 17:37:22 UTC
(In reply to Aharon Canan from comment #34)
> (In reply to Nir Soffer from comment #33)
> > Aharon, in this setup, if one target is disconnected, the lun should still be
> > available from the other targets - right?
> 
> the lun should be available from other targets, the question is if We log in
> to all targets or only to 1...

According to the database dump, we are connected to all targets, and all luns are
connected to all targets.

-- Data for Name: lun_storage_server_connection_map; Type: TABLE DATA; Schema: public; Owner: engine

3514f0c54476007f4       64b6ccf2-9e53-4cfb-a54e-eca7d4c2f61f
3514f0c54476007f4       ed8bb70d-2a98-4610-92aa-e0bd64d4fcb2
3514f0c54476007f4       3f9c0525-3c02-48bb-8d57-b36bc9c0a9c0
3514f0c54476007f4       a026c6ef-7c04-43c3-b608-27c1ae19672c
...

-- Data for Name: storage_server_connections; Type: TABLE DATA; Schema: public; Owner: engine

ed8bb70d-2a98-4610-92aa-e0bd64d4fcb2    10.35.146.129   \N      \N      iqn.2008-05.com.xtremio:001e675b8ee0    3260    1       3       \N      \N      \N      \N      \N
a026c6ef-7c04-43c3-b608-27c1ae19672c    10.35.146.161   \N      \N      iqn.2008-05.com.xtremio:001e675b8ee1    3260    1       3       \N      \N      \N      \N      \N
3f9c0525-3c02-48bb-8d57-b36bc9c0a9c0    10.35.146.193   \N      \N      iqn.2008-05.com.xtremio:001e675ba170    3260    1       3       \N      \N      \N      \N      \N
64b6ccf2-9e53-4cfb-a54e-eca7d4c2f61f    10.35.146.225   \N      \N      iqn.2008-05.com.xtremio:001e675ba171    3260    1       3       \N      \N      \N      \N      \N
...

Daniel, can you confirm this?

Comment 37 Nir Soffer 2014-12-24 15:03:43 UTC
I reproduced this issue on both rhel7 and 6.6. The ways to reproduce this are creating a storage domain with 2 luns, each one from different server, and blocking access to one of the servers.

Can be reproduced by:
- using iscsiadm to log out from a node
- blocking io to a server using iptables
- using incorrect initiator name

Vdsm handles now the "unknown device" special value from lvm and will
display clear errors about missing pv.

We don't have enough logs to tell why a pv was missing, but I suspect 
that this was incorrect iscsi initiator name (maybe it was correct
but iscsid service was not started after changing it?).

I don't think we can do anything else at this time.

Comment 39 Gil Klein 2016-02-23 17:08:53 UTC
Reducing severity based on comment #37

Comment 40 Elad 2016-04-03 15:07:27 UTC
vdsm reports correctly when a PV is missing from a VG:

Thread-113060::ERROR::2016-04-03 17:59:06,499::lvm::403::Storage.LVM::(_reloadvgs) vg a45d5603-4de8-4a5d-bd50-2c097d28584e has pv_count 2 but pv_names ('/dev/mapper/3514f0c5a51600290',)
Thread-113060::DEBUG::2016-04-03 17:59:06,499::lvm::415::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-113060::ERROR::2016-04-03 17:59:06,500::monitor::276::Storage.Monitor::(_monitorDomain) Error monitoring domain a45d5603-4de8-4a5d-bd50-2c097d28584e
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain
    self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 339, in _performDomainSelftest
    self.domain.selftest()
  File "/usr/share/vdsm/storage/blockSD.py", line 859, in selftest
    raise se.StorageDomainAccessError(self.sdUUID)
StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('a45d5603-4de8-4a5d-bd50-2c097d28584e',)


Reproduced with a storage domain, created from 2 LUN that reside on different storage servers (I used XtremIO and RHEL tgt as storage backends).
I disconnected the hypervisor from one of the storage servers using iptables and got an 'unknown device' for vgs:


[root@blond-vdsh ~]# pvscan --cache && vgs -o +pv_name          
  /dev/mapper/360000000000000000e00000000020002: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/360000000000000000e00000000020002: read failed after 0 of 4096 at 104792064: Input/output error
  /dev/mapper/360000000000000000e00000000020002: read failed after 0 of 4096 at 104849408: Input/output error
  /dev/mapper/360000000000000000e00000000020002: read failed after 0 of 4096 at 4096: Input/output error
  /dev/mapper/360000000000000000e00000000020003: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/360000000000000000e00000000020003: read failed after 0 of 4096 at 6456082432: Input/output error
  /dev/mapper/360000000000000000e00000000020003: read failed after 0 of 4096 at 6456147968: Input/output error
  /dev/mapper/360000000000000000e00000000020003: read failed after 0 of 4096 at 4096: Input/output error
  WARNING: Device for PV 3u8DBW-9oK4-TbmZ-WI9e-J8xw-9zQ5-5GKJ71 not found or rejected by a filter.
  VG                                   #PV #LV #SN Attr   VSize  VFree  PV                           
  37f88c34-93cc-46c5-ae61-5a9231d0ccac   1  10   0 wz--n- 54.62g 46.50g /dev/mapper/3514f0c5a5160018f
  a45d5603-4de8-4a5d-bd50-2c097d28584e   2   8   0 wz-pn- 40.38g 36.25g /dev/mapper/3514f0c5a51600290
  a45d5603-4de8-4a5d-bd50-2c097d28584e   2   8   0 wz-pn- 40.38g 36.25g unknown device               
  d032a86b-6bb4-46f3-963d-cf655e94bed9   1  13   0 wz--n- 65.62g 45.50g /dev/mapper/3514f0c5a51600195



Verified using:
rhevm-3.6.5-0.1.el6.noarch
vdsm-4.17.25-0.el7ev.noarch

Nir, I added need-info on you just to make sure I did the correct verification, thanks.

Comment 41 Nir Soffer 2016-04-03 15:26:23 UTC
(In reply to Elad from comment #40)
> Nir, I added need-info on you just to make sure I did the correct
> verification, thanks.

Thanks Elad, the verification looks good.

Comment 42 Eyal Edri 2016-04-25 14:27:55 UTC
3.6.5 released


Note You need to log in before you can comment on or make changes to this bug.