Created attachment 1348127 [details] logs from engine and hypervisor Description of problem: A Gluster domain, that its file system became read only, causes the whole storage pool to be down, no matter if the Gluster domain was the master or not. Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20171101152059.git8b85834.el7.centos.noarch vdsm-jsonrpc-4.20.6-11.git582b910.el7.centos.noarch vdsm-yajsonrpc-4.20.6-11.git582b910.el7.centos.noarch vdsm-client-4.20.6-11.git582b910.el7.centos.noarch vdsm-http-4.20.6-11.git582b910.el7.centos.noarch vdsm-hook-vfio-mdev-4.20.6-11.git582b910.el7.centos.noarch vdsm-4.20.6-11.git582b910.el7.centos.x86_64 vdsm-api-4.20.6-11.git582b910.el7.centos.noarch vdsm-hook-vhostmd-4.20.6-11.git582b910.el7.centos.noarch vdsm-hook-vmfex-dev-4.20.6-11.git582b910.el7.centos.noarch vdsm-hook-openstacknet-4.20.6-11.git582b910.el7.centos.noarch vdsm-python-4.20.6-11.git582b910.el7.centos.noarch vdsm-hook-fcoe-4.20.6-11.git582b910.el7.centos.noarch vdsm-hook-ethtool-options-4.20.6-11.git582b910.el7.centos.noarch glusterfs-client-xlators-3.12.1-2.el7.x86_64 glusterfs-rdma-3.12.1-2.el7.x86_64 glusterfs-3.12.1-2.el7.x86_64 glusterfs-cli-3.12.1-2.el7.x86_64 glusterfs-libs-3.12.1-2.el7.x86_64 glusterfs-api-3.12.1-2.el7.x86_64 glusterfs-fuse-3.12.1-2.el7.x86_64 libvirt-daemon-driver-storage-3.2.0-14.el7_4.3.x86_64 libvirt-python-3.2.0-3.el7_4.1.x86_64 libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.3.x86_64 libvirt-lock-sanlock-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-lxc-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-secret-3.2.0-14.el7_4.3.x86_64 libvirt-libs-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-interface-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-kvm-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-config-network-3.2.0-14.el7_4.3.x86_64 libvirt-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-network-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.3.x86_64 libvirt-client-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-3.2.0-14.el7_4.3.x86_64 libvirt-python-3.2.0-3.el7_4.1.x86_64 libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.3.x86_64 libvirt-lock-sanlock-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-lxc-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-secret-3.2.0-14.el7_4.3.x86_64 libvirt-libs-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-interface-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-kvm-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-config-network-3.2.0-14.el7_4.3.x86_64 libvirt-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-network-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.3.x86_64 libvirt-client-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.3.x86_64 libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.3.x86_64 qemu-kvm-ev-2.9.0-16.el7_4.8.1.x86_64 qemu-img-ev-2.9.0-16.el7_4.8.1.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch qemu-kvm-common-ev-2.9.0-16.el7_4.8.1.x86_64 sanlock-python-3.5.0-1.el7.x86_64 sanlock-3.5.0-1.el7.x86_64 libvirt-lock-sanlock-3.2.0-14.el7_4.3.x86_64 sanlock-lib-3.5.0-1.el7.x86_64 selinux-policy-targeted-3.13.1-166.el7_4.5.noarch libselinux-2.5-11.el7.x86_64 libselinux-utils-2.5-11.el7.x86_64 libselinux-python-2.5-11.el7.x86_64 selinux-policy-3.13.1-166.el7_4.5.noarch kernel: 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 13 10:46:25 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server 7.4 (Maipo) How reproducible: Always Steps to Reproduce: 1. Have a data center with more than 1 host and a Gluster domain reside on a replica 3 volume and another domain from different type 2. Cause 2 of the Gluster volume bricks where the Gluster domain resides to become unreachable to the hosts Actual results: Read only file system is reported for the Gluster domain which reside on a replica 3 volume while 2 of the volume's bricks are unreachable (which makes the volume RO): 2017-11-05 14:39:25,886+0200 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='21e71231-b46c-4df0-9a56-a03674758f1f') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in createStorageDomain File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2571, in createStorageDomain storageType, domVersion) File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 83, in create version) File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 50, in _preCreateValidation fileSD.validateFileSystemFeatures(sdUUID, domPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 100, in validateFileSystemFeatures oop.getProcessPool(sdUUID).directTouch(testFilePath) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 319, in directTouch ioproc.touch(path, flags, mode) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 575, in touch self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 459, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 30] Read-only file system SpmStart fails on all the DC hosts: Engine: 2017-11-05 14:34:10,427+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [4628c4ae] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: SpmStart failed 2017-11-05 14:34:12,334+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engine-Thread-34337) [667a7af2] IrsBroker::Failed::DetachStorageDomainVDS: IRSGenericException: IRSErrorException: SpmStart failed 2017-11-05 14:34:12,336+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-34337) [667a7af2] FINISH, DetachStorageDomainVDSCommand, log id: 5965d2f 2017-11-05 14:34:12,336+02 ERROR [org.ovirt.engine.core.bll.storage.domain.ForceRemoveStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-34337) [667a7af2] Could not force detach storage domain 'test_gluster_2': EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed (Failed with error ENGINE and code 5001) Expected results: In case there is a Gluster domain with RO file system, only the Gluster domain should be affected. Additional info: Gluster volume details: Type: Replicate Volume ID: 9c28d910-62b1-41dc-a5d4-3639812b0ac1 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: gluster01.scl.lab.tlv.redhat.com:/gluster_volumes/jenkins_ge12_volume_1 Brick2: gluster02.scl.lab.tlv.redhat.com:/gluster_volumes/jenkins_ge12_volume_1 Brick3: gluster03.scl.lab.tlv.redhat.com:/gluster_volumes/jenkins_ge12_volume_1 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet storage.owner-gid: 36 storage.owner-uid: 36 [root@gluster01 ~]# gluster peer status Number of Peers: 2 Hostname: 10.35.83.241 Uuid: ccdbbe2f-d0c2-404b-a50a-c686b99f3f78 State: Peer in Cluster (Disconnected) Hostname: 10.35.83.242 Uuid: 88604198-b9f1-4796-a970-b85730bf6407 State: Peer in Cluster (Disconnected)
After no host can be selected as SPM, I'd expect the monitoring to conclude that the domain is the problem, and set it on inactive.
I'm raising the priority for the HCI use case. We need to fix this by the oVirt GA.
Adam, can you please scrub?
Yep... Working on it.
How are you making the two hosts unreachable? I am finding the following errors in the vdsm.log which happen when trying to locate storage domains on storage. This problem appears to block the the produce operation for all domains which inhibits the domain monitor from starting. 2017-11-05 15:01:41,170+0200 DEBUG (monitor/4cc7fe5) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600a09803830347a625d467a7974572d|/dev/mapper/3600a098038 30347a625d467a79745738|/dev/mapper/3600a09803830347a625d467a79745739|/dev/mapper/3600a09803830347a625d467a79745741|/dev/mapper/3600a09803830347a625d467a79745742|/dev/mapper/3600a09 803830347a625d467a79745743|/dev/mapper/3600a09803830347a625d467a79746175|/dev/mapper/3600a09803830347a625d467a79746432|/dev/mapper/3600a09803830347a625d467a79746433|'\'', '\''r|.*| '\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix -- separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 4cc7fe5f-f5f5-4732-8ea1- dfea0c391db0 (cwd None) (commands:70) 2017-11-05 15:01:41,272+0200 DEBUG (monitor/4cc7fe5) [storage.Misc.excCmd] FAILED: <err> = ' /dev/mapper/3600a09803830347a625d467a79745739: read failed after 0 of 4096 at 0: Input /output error\n /dev/mapper/3600a09803830347a625d467a79745739: read failed after 0 of 4096 at 161061208064: Input/output error\n /dev/mapper/3600a09803830347a625d467a79745739: re ad failed after 0 of 4096 at 161061265408: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745739 was disabled\n /d ev/mapper/3600a09803830347a625d467a79745738: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapper/3600a09803830347a625d467a79745738: read failed after 0 of 4096 at 1 61061208064: Input/output error\n /dev/mapper/3600a09803830347a625d467a79745738: read failed after 0 of 4096 at 161061265408: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745738 was disabled\n /dev/mapper/3600a09803830347a625d467a7974572d: read failed after 0 of 4096 at 0: Input/output erro r\n /dev/mapper/3600a09803830347a625d467a7974572d: read failed after 0 of 4096 at 161061208064: Input/output error\n /dev/mapper/3600a09803830347a625d467a7974572d: read failed af ter 0 of 4096 at 161061265408: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a7974572d was disabled\n /dev/mapper/36 00a09803830347a625d467a79745741: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapper/3600a09803830347a625d467a79745741: read failed after 0 of 4096 at 53687025664: Input/output error\n /dev/mapper/3600a09803830347a625d467a79745741: read failed after 0 of 4096 at 53687083008: Input/output error\n WARNING: Error counts reached a limit of 3. D evice /dev/mapper/3600a09803830347a625d467a79745741 was disabled\n /dev/mapper/3600a09803830347a625d467a79745742: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapp er/3600a09803830347a625d467a79745742: read failed after 0 of 4096 at 53687025664: Input/output error\n /dev/mapper/3600a09803830347a625d467a79745742: read failed after 0 of 4096 a t 53687083008: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745742 was disabled\n /dev/mapper/3600a09803830347a6 25d467a79745743: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapper/3600a09803830347a625d467a79745743: read failed after 0 of 4096 at 53687025664: Input/output err or\n /dev/mapper/3600a09803830347a625d467a79745743: read failed after 0 of 4096 at 53687083008: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mappe r/3600a09803830347a625d467a79745743 was disabled\n /dev/mapper/3600a09803830347a625d467a79746175: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapper/3600a09803830 347a625d467a79746175: read failed after 0 of 4096 at 139607343104: Input/output error\n /dev/mapper/3600a09803830347a625d467a79746175: read failed after 0 of 4096 at 139607400448: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79746175 was disabled\n /dev/mapper/3600a09803830347a625d467a7974643 2: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapper/3600a09803830347a625d467a79746432: read failed after 0 of 4096 at 16106061824: Input/output error\n /dev/map per/3600a09803830347a625d467a79746432: read failed after 0 of 4096 at 16106119168: Input/output error\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a0980383 0347a625d467a79746432 was disabled\n /dev/mapper/3600a09803830347a625d467a79746433: read failed after 0 of 4096 at 0: Input/output error\n /dev/mapper/3600a09803830347a625d467a79 746433: read failed after 0 of 4096 at 37585092608: Input/output error\n /dev/mapper/3600a09803830347a625d467a79746433: read failed after 0 of 4096 at 37585149952: Input/output er ror\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79746433 was disabled\n WARNING: Error counts reached a limit of 3. Device /dev/mapp er/3600a09803830347a625d467a79745739 was disabled\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745738 was disabled\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a7974572d was disabled\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a6 25d467a79745741 was disabled\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745742 was disabled\n WARNING: Error counts reached a lim it of 3. Device /dev/mapper/3600a09803830347a625d467a79745743 was disabled\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79746175 was d isabled\n WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79746432 was disabled\n WARNING: Error counts reached a limit of 3. Device /dev/ mapper/3600a09803830347a625d467a79746433 was disabled\n Volume group "4cc7fe5f-f5f5-4732-8ea1-dfea0c391db0" not found\n Cannot process volume group 4cc7fe5f-f5f5-4732-8ea1-dfea0c 391db0\n'; <rc> = 5 (commands:94)
(In reply to Adam Litke from comment #5) > How are you making the two hosts unreachable? I am finding the following > errors in the vdsm.log which happen when trying to locate storage domains on > storage. This problem appears to block the the produce operation for all > domains which inhibits the domain monitor from starting. It happened due to malfunction in our Gluster setup, in which 2/3 of the Gluster peers were down and this caused the Gluster volume FS to become RO. *The domain's Gluster volume is replica3 which consists of a brick on each one of the Gluster peers.
(In reply to Elad from comment #6) > (In reply to Adam Litke from comment #5) > > How are you making the two hosts unreachable? I am finding the following > > errors in the vdsm.log which happen when trying to locate storage domains on > > storage. This problem appears to block the the produce operation for all > > domains which inhibits the domain monitor from starting. > > It happened due to malfunction in our Gluster setup, in which 2/3 of the > Gluster peers were down and this caused the Gluster volume FS to become RO. > > *The domain's Gluster volume is replica3 which consists of a brick on each > one of the Gluster peers. Did anything else go wrong during this time? The lvm I/O errors are concerning to me (even if not directly related).
The vdsm logs don't seem to match up with the time from the engine logs? The engine starts showing errors at 11 am but I don't have any logs for the hosts until 3pm?
I don't have any more logs from the time of this occurrence. I'll reproduce again and provide the logs.
(In reply to Elad from comment #9) > I don't have any more logs from the time of this occurrence. I'll reproduce > again and provide the logs. Re-setting NEEDINFO for reproduction.
I wasn't able to reproduce the issue. Fred, please try to work with the current supplied information.
(In reply to Elad from comment #11) > I wasn't able to reproduce the issue. > Fred, please try to work with the current supplied information. I'm closing for the time being. Adam has investigated it above. It's a very good tier 4 tests (which can easily be done with Gluster), so I suggest re-creating it (I suggesting taking 2 different Gluster setups with replica 3 - keep one healthy, the other, bring down 2 nodes). Please re-open when reproduced.
Created attachment 1387669 [details] logs2 We encountered the bug once again. Our Gluster cluster is based on 3 virtual machines and 2 of these machines entered into Paused state which caused the replica3 volumes provided by them to become RO file system. These volumes are used by our environments (for RHV automation). 2018-01-29 10:44:58,754+0200 ERROR (upgrade/6d1d3bb) [storage.StoragePool] FINISH thread <Thread(upgrade/6d1d3bb, started daemon 139700764600064)> failed (concurrent:201) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 189, in _upgradePoolDomain domain = sdCache.produce(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce domain.getRealDomain() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 391, in __init__ validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 104, in validateFileSystemFeatures oop.getProcessPool(sdUUID).directTouch(testFilePath) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 319, in directTouch ioproc.touch(path, flags, mode) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 577, in touch self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 461, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 30] Read-only file system 2018-01-29 10:44:58,898+0200 DEBUG (upgrade/84c91da) [storage.StorageDomainManifest] Reading domain in path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__1/84c 91da9-15ec-447f-92b5-a16016d917a2 (fileSD:168) Hypervisor: vdsm-4.20.17-1.el7ev.x86_64 libvirt-3.9.0-6.el7.x86_64 glusterfs-3.8.4-53.el7.x86_64 Engine: rhvm-4.2.1.3-0.1.el7.noarch Gluster servers: glusterfs-server-3.12.2-1.el7.x86_64
Are there other domains in the setup? Was the gluster domain the master SD?
(In reply to Allon Mureinik from comment #14) > Are there other domains in the setup? Yes, NFS and iSCSI > Was the gluster domain the master SD? The Gluster domain doesn't have to be master for the bug to reproduce
Allon, this bug occurs also on 4.1.9. Should we clone it?
(In reply to Elad from comment #16) > Allon, this bug occurs also on 4.1.9. Should we clone it? It's an upstream oVirt bug, no need to clone - we can just retarget it if we decide to fix it in 4.1.z. But I'd like to get a sense of the scope of the fix before retargetting. This bug has probably been around forever, and 4.1.10 will probably be the last 4.1.z release. I don't want to introduce anything we aren't 100% sure of.
oVirt Version 4.1.2.2-1.el7.centos Yesterday I had routing issues with a gateway, but not sure why this happen, as the ovirt engine and hypervisors behind the gateway on a private network... But it looks gluster related. The odd thing is the glusterFS is connect and i can read/write without issue to the gluster mount. [root@terri-wolven 8fe3f7f3-15c9-4376-b20e-5e237b31b185]# ls -halt total 16K drwxr-xr-x. 4 vdsm kvm 4.0K Jan 31 15:48 . drwxr-xr-x. 11 vdsm kvm 4.0K Jan 31 15:33 images drwxr-xr-x. 6 vdsm kvm 4.0K Nov 21 22:31 .. drwxr-xr-x. 2 vdsm kvm 4.0K Aug 3 13:42 dom_md 2018-01-31 15:41:19,353+0000 ERROR (monitor/8fe3f7f) [storage.Monitor] Setting up monitor for 8fe3f7f3-15c9-4376-b20e-5e237b31b185 failed (monitor:329) Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 326, in _setupLoop self._setupMonitor() File "/usr/share/vdsm/storage/monitor.py", line 349, in _setupMonitor self._produceDomain() File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 401, in wrapper value = meth(self, *a, **kw) File "/usr/share/vdsm/storage/monitor.py", line 367, in _produceDomain self.domain = sdCache.produce(self.sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 112, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain return findMethod(sdUUID) File "/usr/share/vdsm/storage/glusterSD.py", line 55, in findDomain return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/fileSD.py", line 364, in __init__ validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint) File "/usr/share/vdsm/storage/fileSD.py", line 98, in validateFileSystemFeatures oop.getProcessPool(sdUUID).directTouch(testFilePath) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 333, in directTouch ioproc.touch(path, flags, mode) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 582, in touch self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 466, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 5] Input/output error
specifically it looks like there is an error with the 'directTouch" (testFilePath). I manually ran 'touch' on the glusterFS mouunt point and there was no issue.
I added a INFO log to the outOfProcess.py...line 333.. OSError: [Errno 5] Input/output error 2018-01-31 16:16:27,654+0000 INFO (tasks/8) [storage.oop] Touching File at: /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__ ... (outOfProcess:333) 2018-01-31 16:16:27,661+0000 ERROR (tasks/8) [storage.StoragePool] Unexpected error (sp:384) I attempted to 'move' said file to /tmp... [root@terri-wolven 8fe3f7f3-15c9-4376-b20e-5e237b31b185]# mv /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__ /tmp/ mv: error reading ‘/rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__’: Input/output error mv: failed to extend ‘/tmp/__DIRECT_IO_TEST__’: Input/output error
(In reply to Charlie from comment #20) > I added a INFO log to the outOfProcess.py...line 333.. > > OSError: [Errno 5] Input/output error > > 2018-01-31 16:16:27,654+0000 INFO (tasks/8) [storage.oop] Touching File at: > /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__ > ... (outOfProcess:333) > > 2018-01-31 16:16:27,661+0000 ERROR (tasks/8) [storage.StoragePool] > Unexpected error (sp:384) > > > I attempted to 'move' said file to /tmp... > > [root@terri-wolven 8fe3f7f3-15c9-4376-b20e-5e237b31b185]# mv > /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__ > /tmp/ > mv: error reading > ‘/rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__’: > Input/output error > mv: failed to extend ‘/tmp/__DIRECT_IO_TEST__’: Input/output error Odd... I just forced a umount of "/rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0" and now all my normal NFS storage domains are up. vdsm still complaining about not being able to touch __DIRECT_IO_TEST__ I then ran 'rm /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__' and now my glusterfs storage domain is back up too. I believe this was related to some kind of IO locking on the __DIRECT_IO_TEST__ file pointer/node and it would appear that forcing an umount and removing that file resolved the issue?
(In reply to Elad from comment #15) > (In reply to Allon Mureinik from comment #14) > > Are there other domains in the setup? > Yes, NFS and iSCSI > > Was the gluster domain the master SD? > The Gluster domain doesn't have to be master for the bug to reproduce I did the same test where the gluster SD is not the master, and it worked fine. Only the gluster SD is reported as Inactive. I will do another test where the gluster SD is the master
It does reproduce in case that the gluster SD is the master. - The gluster SD becomes inactive - The engine tries to start the SPM with another master domain - The startSpm verb fails when trying to build the domains. The gluster SD is passed as "active" to the VDSM, (I guess for trying to recover) but it fails to produce the domain due to the ReadOnly state. A quite similar scenario happens if the master domain is a NFS SD and I change the permissions to RO (chmod a-w __DIRECT_IO_TEST__), and then again all the DC is down. The exception in the NFS flow: (there are similar exceptions both in getSpmStatus and in connectStoragePool) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in getSpmStatus File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 633, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 627, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 140, in getSpmStatus return self._backend.getSpmStatus() File "/usr/lib/python2.7/site-packages/vdsm/storage/spbackends.py", line 432, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain return findMethod(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 126, in findDomain return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 391, in __init__ validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 104, in validateFileSystemFeatures oop.getProcessPool(sdUUID).directTouch(testFilePath) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 319, in directTouch ioproc.touch(path, flags, mode) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 575, in touch self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 459, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 13] Permission denied
In case of a 'OSError: [Errno 30] Read-only file system', while a Gluster domain is the master, reconstruct takes place and succeeds. Also, went over the latest RHV automation runs (tiers 1-3). No regressions related to the fix here were found. Used: vdsm-4.20.23-1.el7ev.x86_64 glusterfs-3.8.4-54.3.el7.x86_64 rhvm-4.2.2.6-0.1.el7.noarch
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.