RHEV 3.1.2: rhevm-3.1.0-43.el6ev.noarch RHEV-H 6.3 - 20130129.0.el6_3 vdsm-4.9.6-45.2.el6_3 Customer has NFS DC with 3 Data Domains. All 3 died for some time, getting the following errors in vdsm.log: Thread-3481568::WARNING::2013-05-27 13:01:05,696::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout Traceback (most recent call last): File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll Timeout Thread-1309111::ERROR::2013-05-27 13:01:05,697::task::853::TaskManager.Task::(_setError) Task=`e063e1a4-385c-4394-b7cc-b4204f422d4b`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run File "/usr/share/vdsm/logUtils.py", line 38, in wrapper File "/usr/share/vdsm/storage/hsm.py", line 2718, in getVolumeSize File "/usr/share/vdsm/storage/volume.py", line 321, in getVSize File "/usr/share/vdsm/storage/sdc.py", line 97, in produce File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain StorageDomainDoesNotExist: Storage domain does not exist: ('84c8878d-ce58-42ef-bccc-b318da754b53',) At some point storage recovered, but customer cannot start some vms with error: Thread-3493871::ERROR::2013-05-27 15:45:33,727::dispatcher::69::Storage.Dispatcher.Protect::(run) [Errno 2] No such file or directory: '/rhev/data-center/f4759db8-a99e-11e1-a99c-cf4a75f008a7/a5b877e7-eb1a-48a2-95d9-6dc734f74ea2/images/c4e4f09a-f9ab-4867-81e5-7ff458ef5ccb/80f2743c-5f81-43f1-bcfc-2c18fdb7f856' Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run File "/usr/share/vdsm/storage/task.py", line 1164, in prepare OSError: [Errno 2] No such file or directory: '/rhev/data-center/f4759db8-a99e-11e1-a99c-cf4a75f008a7/a5b877e7-eb1a-48a2-95d9-6dc734f74ea2/images/c4e4f09a-f9ab-4867-81e5-7ff458ef5ccb/80f2743c-5f81-43f1-bcfc-2c18fdb7f856' Thread-3493871::DEBUG::2013-05-27 15:45:33,727::vm::607::vm.Vm::(_startUnderlyingVm) vmId=`61b2210e-e499-4db1-b72c-8fc125ed9be5`::_ongoingCreations released Thread-3493871::ERROR::2013-05-27 15:45:33,728::vm::631::vm.Vm::(_startUnderlyingVm) vmId=`61b2210e-e499-4db1-b72c-8fc125ed9be5`::The vm start process failed [mku: first occurance of KeyError: 'truesize'] Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 597, in _startUnderlyingVm File "/usr/share/vdsm/libvirtvm.py", line 1373, in _run File "/usr/share/vdsm/vm.py", line 454, in buildConfDevices File "/usr/share/vdsm/vm.py", line 381, in _normalizeVdsmImg KeyError: 'truesize' Thread-3493871::DEBUG::2013-05-27 15:45:33,731::vm::965::vm.Vm::(setDownStatus) vmId=`61b2210e-e499-4db1-b72c-8fc125ed9be5`::Changed state to Down: 'truesize' ------- After checking the disk presence with ls command, we can see that using direct location - it works, but using the symbolink link - it does not: /rhev/data-center/f4759db8-a99e-11e1-a99c-cf4a75f008a7/a5b877e7-eb1a-48a2-95d9-6dc734f74ea2/images/13a616a8-ea90-420c-8a23-0070e34c9cfd/cca97c0e-6f5a-44ff-bde3-7c33da3c321f We only have direct path: customer_storage_server:/vol/storage/rhev /rhev/data-center/mnt/customer_storage_server:export_folder/a5b877e7-eb1a-48a2-95d9-6dc734f74ea2/images/13a616a8-ea90-420c-8a23-0070e34c9cfd/cca97c0e-6f5a-44ff-bde3-7c33da3c321f After restarting vdsmd service on the host, the issue resolved and vms started successfully. Note: not always putting host in maintenance possible in production setups. I would appreciate some guidance in finding the root cause for this issue. Providing logs soon.
it sounds to me more like a copy of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=960952 I encountered the same issue today after LSM failed during the createVolume step and when I tried re-running the vm it failed to run or migrate on one of the hosts because of wrong linking. Thread-262::ERROR::2013-05-28 15:55:38,012::vm::704::vm.Vm::(_startUnderlyingVm) vmId=`0780cbb2-fc29-48b1-88d9-0838f6920ec3`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 664, in _startUnderlyingVm self._run() File "/usr/share/vdsm/libvirtvm.py", line 1458, in _run self.preparePaths(devices[vm.DISK_DEVICES]) File "/usr/share/vdsm/vm.py", line 725, in preparePaths drive['path'] = self.cif.prepareVolumePath(drive, self.id) File "/usr/share/vdsm/clientIF.py", line 275, in prepareVolumePath raise vm.VolumeError(drive) this is the link that's sent to one host: [root@cougar02 ~]# ls -l /rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a6d6df52-a2d9-4598-a28d-0d22fe7e02a5/b5038832-5c63-4f9d-a11b-d7aeb93edbb1 ls: cannot access /rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a6d6df52-a2d9-4598-a28d-0d22fe7e02a5/b5038832-5c63-4f9d-a11b-d7aeb93edbb1: No such file or directory this is the correct link: [root@cougar02 ~]# ls -l /rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a6d6df52-a2d9-4598-a28d-0d22fe7e02a5/7a3202f1-a98e-4f6e-b1b8-d8cf02aa61c5 lrwxrwxrwx. 1 vdsm kvm 78 May 28 15:51 /rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a6d6df52-a2d9-4598-a28d-0d22fe7e02a5/7a3202f1-a98e-4f6e-b1b8-d8cf02aa61c5 -> /dev/81ef11d0-4c0c-47b4-8953-d61a6af442d8/7a3202f1-a98e-4f6e-b1b8-d8cf02aa61c5 I will attach the logs
Created attachment 753907 [details] logs from dafna's test