Bug 967676 - After loosing connectivity with nfs storage some VMs unable to start with 'truesize' error
Summary: After loosing connectivity with nfs storage some VMs unable to start with 'tr...
Keywords:
Status: CLOSED DUPLICATE of bug 960952
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.1.2
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.3.0
Assignee: Nobody's working on this, feel free to take it
QA Contact:
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-28 01:01 UTC by Marina Kalinin
Modified: 2016-02-10 16:55 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-05-28 17:08:38 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from dafna's test (1.87 MB, application/x-gzip)
2013-05-28 13:07 UTC, Dafna Ron
no flags Details

Description Marina Kalinin 2013-05-28 01:01:07 UTC
RHEV 3.1.2:
rhevm-3.1.0-43.el6ev.noarch
RHEV-H 6.3 - 20130129.0.el6_3
vdsm-4.9.6-45.2.el6_3

Customer has NFS DC with 3 Data Domains.
All 3 died for some time, getting the following errors in vdsm.log:
Thread-3481568::WARNING::2013-05-27 13:01:05,696::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll
Timeout
Thread-1309111::ERROR::2013-05-27 13:01:05,697::task::853::TaskManager.Task::(_setError) Task=`e063e1a4-385c-4394-b7cc-b4204f422d4b`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
  File "/usr/share/vdsm/storage/hsm.py", line 2718, in getVolumeSize
  File "/usr/share/vdsm/storage/volume.py", line 321, in getVSize
  File "/usr/share/vdsm/storage/sdc.py", line 97, in produce
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
StorageDomainDoesNotExist: Storage domain does not exist: ('84c8878d-ce58-42ef-bccc-b318da754b53',)

At some point storage recovered, but customer cannot start some vms with error:
Thread-3493871::ERROR::2013-05-27 15:45:33,727::dispatcher::69::Storage.Dispatcher.Protect::(run) [Errno 2] No such file or directory: '/rhev/data-center/f4759db8-a99e-11e1-a99c-cf4a75f008a7/a5b877e7-eb1a-48a2-95d9-6dc734f74ea2/images/c4e4f09a-f9ab-4867-81e5-7ff458ef5ccb/80f2743c-5f81-43f1-bcfc-2c18fdb7f856'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
  File "/usr/share/vdsm/storage/task.py", line 1164, in prepare
OSError: [Errno 2] No such file or directory: '/rhev/data-center/f4759db8-a99e-11e1-a99c-cf4a75f008a7/a5b877e7-eb1a-48a2-95d9-6dc734f74ea2/images/c4e4f09a-f9ab-4867-81e5-7ff458ef5ccb/80f2743c-5f81-43f1-bcfc-2c18fdb7f856'
Thread-3493871::DEBUG::2013-05-27 15:45:33,727::vm::607::vm.Vm::(_startUnderlyingVm) vmId=`61b2210e-e499-4db1-b72c-8fc125ed9be5`::_ongoingCreations released
Thread-3493871::ERROR::2013-05-27 15:45:33,728::vm::631::vm.Vm::(_startUnderlyingVm) vmId=`61b2210e-e499-4db1-b72c-8fc125ed9be5`::The vm start process failed
[mku: first occurance of KeyError: 'truesize']
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 597, in _startUnderlyingVm
  File "/usr/share/vdsm/libvirtvm.py", line 1373, in _run
  File "/usr/share/vdsm/vm.py", line 454, in buildConfDevices
  File "/usr/share/vdsm/vm.py", line 381, in _normalizeVdsmImg
KeyError: 'truesize'
Thread-3493871::DEBUG::2013-05-27 15:45:33,731::vm::965::vm.Vm::(setDownStatus) vmId=`61b2210e-e499-4db1-b72c-8fc125ed9be5`::Changed state to Down: 'truesize'
-------

After checking the disk presence with ls command, we can see that using direct location - it works, but using the symbolink link - it does not:
/rhev/data-center/f4759db8-a99e-11e1-a99c-cf4a75f008a7/a5b877e7-eb1a-48a2-95d9-6dc734f74ea2/images/13a616a8-ea90-420c-8a23-0070e34c9cfd/cca97c0e-6f5a-44ff-bde3-7c33da3c321f
We only have direct path:
customer_storage_server:/vol/storage/rhev 
/rhev/data-center/mnt/customer_storage_server:export_folder/a5b877e7-eb1a-48a2-95d9-6dc734f74ea2/images/13a616a8-ea90-420c-8a23-0070e34c9cfd/cca97c0e-6f5a-44ff-bde3-7c33da3c321f

After restarting vdsmd service on the host, the issue resolved and vms started successfully. Note: not always putting host in maintenance possible in production setups.
I would appreciate some guidance in finding the root cause for this issue.
Providing logs soon.

Comment 3 Dafna Ron 2013-05-28 13:06:21 UTC
it sounds to me more like a copy of this bug: https://bugzilla.redhat.com/show_bug.cgi?id=960952

I encountered the same issue today after LSM failed during the createVolume step and when I tried re-running the vm it failed to run or migrate on one of the hosts because of wrong linking. 

Thread-262::ERROR::2013-05-28 15:55:38,012::vm::704::vm.Vm::(_startUnderlyingVm) vmId=`0780cbb2-fc29-48b1-88d9-0838f6920ec3`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 664, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1458, in _run
    self.preparePaths(devices[vm.DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 725, in preparePaths
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 275, in prepareVolumePath
    raise vm.VolumeError(drive)

this is the link that's sent to one host: 

[root@cougar02 ~]# ls -l /rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a6d6df52-a2d9-4598-a28d-0d22fe7e02a5/b5038832-5c63-4f9d-a11b-d7aeb93edbb1
ls: cannot access /rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a6d6df52-a2d9-4598-a28d-0d22fe7e02a5/b5038832-5c63-4f9d-a11b-d7aeb93edbb1: No such file or directory


this is the correct link: 

[root@cougar02 ~]# ls -l /rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a6d6df52-a2d9-4598-a28d-0d22fe7e02a5/7a3202f1-a98e-4f6e-b1b8-d8cf02aa61c5 
lrwxrwxrwx. 1 vdsm kvm 78 May 28 15:51 /rhev/data-center/7fd33b43-a9f4-4eb7-a885-e9583a929ceb/81ef11d0-4c0c-47b4-8953-d61a6af442d8/images/a6d6df52-a2d9-4598-a28d-0d22fe7e02a5/7a3202f1-a98e-4f6e-b1b8-d8cf02aa61c5 -> /dev/81ef11d0-4c0c-47b4-8953-d61a6af442d8/7a3202f1-a98e-4f6e-b1b8-d8cf02aa61c5

I will attach the logs

Comment 4 Dafna Ron 2013-05-28 13:07:39 UTC
Created attachment 753907 [details]
logs from dafna's test


Note You need to log in before you can comment on or make changes to this bug.