Created attachment 712611 [details] logs Description of problem: I failed to deleteImage with time out and than vdsm crashed: Thread-4438::ERROR::2013-03-18 15:28:22,968::task::842::TaskManager.Task::(_setError) Task=`5af6236e-0ff7-45a1-83eb-27697b260725`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 849, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1470, in deleteImage sdUUID, imgUUID, volsByImg) File "/usr/share/vdsm/storage/hsm.py", line 762, in _spmSchedule name, func, *args) File "/usr/share/vdsm/storage/taskManager.py", line 64, in scheduleJob task.setPersistence(store, cleanPolicy=TaskCleanType.manual) File "/usr/share/vdsm/storage/task.py", line 1085, in setPersistence self.persist() File "/usr/share/vdsm/storage/task.py", line 1107, in persist self._save(self.store) File "/usr/share/vdsm/storage/task.py", line 754, in _save getProcPool().fileUtils.fsyncPath(origTaskDir) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction *args, **kwargs) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll raise Timeout() Timeout Version-Release number of selected component (if applicable): sf10 vdsm-4.10.2-11.0.el6ev.x86_64 How reproducible: Steps to Reproduce: 1. create 2 vm's on iscsi storage with 2 thin provision disks on each vm 2. run the vm's and live migrate all the disks twice (waite for the move to end once and than move the disks again) 3. Actual results: deleteImage fails with timeout and vdsm crashes Expected results: we should succeed to migrate the disks Additional info:logs
Fede, please take a look to see what's going on here.
The provided logs are: vdsm.log.1: from 2013-03-18 07:01:01 to 2013-03-18 15:54:03 vdsm.log: from 2013-03-18 17:28:51 to 2013-03-19 12:44:10 I believe that there's a missing log (15:54:03 - 17:28:51) The traceback reported in the bug description belongs to vdsm.log.1: Thread-4438::ERROR::2013-03-18 15:28:22,968::task::842::TaskManager.Task::(_setError) Task=`5af6236e-0ff7-45a1-83eb-27697b260725`::Unexpected error But in vdsm.log.1 there's no trace of any live storage migration: $ egrep -c "(call (vmSnapshot|vmDiskReplicate))|(Run and protect: (cloneImageStructure|syncImageData))" vdsm.log.1 0 What I can see instead is a series of createVolume/deleteImage: $ egrep -c "Run and protect: (createVolume|deleteImage)\(" vdsm.log.1 173 Probably a scale test (note that all the deleteImage had postZero='true'). During this time few deleteImage weren't able to update and persist their status within the 1 minute timeout (process_pool_timeout): Thread-4343::WARNING::2013-03-18 15:25:36,625::task::588::TaskManager.Task::(_updateState) Task=`fa2c4108-dfec-4b78-9705-d0483ca91a25`::Task._updateState: failed persisting task fa2c4108-dfec-4b78-9705-d0483ca91a25 Thread-4385::WARNING::2013-03-18 15:26:56,873::task::588::TaskManager.Task::(_updateState) Task=`b58395ca-67e2-47c5-b9e8-e01020db4716`::Task._updateState: failed persisting task b58395ca-67e2-47c5-b9e8-e01020db4716 Thread-4438::ERROR::2013-03-18 15:28:22,968::task::842::TaskManager.Task::(_setError) Task=`5af6236e-0ff7-45a1-83eb-27697b260725`::Unexpected error I assume that the excessive load on the storage both prevented the tasks from being persisted (within the timeout) and also triggered the auto fencing mechanism that killed vdsm: MainThread::INFO::2013-03-18 15:28:32,825::logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None) MainThread::DEBUG::2013-03-18 15:28:33,198::clientIF::231::vds::(prepareForShutdown) cannot run prepareForShutdown concurrently MainThread::INFO::2013-03-18 15:28:33,680::logUtils::39::dispatcher::(wrapper) Run and protect: prepareForShutdown, Return response: None MainThread::INFO::2013-03-18 15:28:33,947::vdsm::89::vds::(run) I am the actual vdsm 4.10-11.0 gold-vdsc.qa.lab.tlv.redhat.com (2.6.32-358.2.1.el6.x86_64) Then during the recovery process bug 922515 appeared: MainThread::ERROR::2013-03-18 15:28:34,534::clientIF::263::vds::(_initIRS) Error initializing IRS ... AttributeError: 'list' object has no attribute 'split' Now considering vdsm.log we can notice some calls related to live storage migration (unrelated to the traceback mentioned in the bug description): $ egrep -c "(call (vmSnapshot|vmDiskReplicate))|(Run and protect: (cloneImageStructure|syncImageData))" vdsm.log 42 If we consider only the heavy tasks involved with live storage migration there were 4 syncImageData running at the same time: $ grep "Run and protect: syncImageData(" vdsm.log | cut -c 31-118 21:52:35,196::logUtils::37::dispatcher::(wrapper) Run and protect: syncImageData(spUUID= 21:52:37,929::logUtils::37::dispatcher::(wrapper) Run and protect: syncImageData(spUUID= 21:53:07,057::logUtils::37::dispatcher::(wrapper) Run and protect: syncImageData(spUUID= 21:53:46,993::logUtils::37::dispatcher::(wrapper) Run and protect: syncImageData(spUUID= on top of which we have to consider the VM data replication. I also didn't notice any deleteImage during the live storage migration (as opposed to the bug description) since the first one appears ~13 hours later: Thread-85::INFO::2013-03-19 10:50:43,152::logUtils::37::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='a9ebf0f9-b784-4d6f-837f-f2580a58841d', spUUID='9bfcecb2-417d-462d-96c4-3013793b7fe9', imgUUID='e6e6bcbc-9a18-4e12-a558-7ae595e2ef4b', postZero='false', force='false') This probably triggered again the excessive load on the storage that prevented the tasks from being persisted: Thread-7112::WARNING::2013-03-18 21:55:32,852::task::588::TaskManager.Task::(_updateState) Task=`fbc6922e-1c4d-4636-98e3-734bee50db05`::Task._updateState: failed persisting task fbc6922e-1c4d-4636-98e3-734bee50db05 Finally autofencing was triggered again: MainThread::INFO::2013-03-18 21:55:50,593::logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None) MainThread::DEBUG::2013-03-18 21:55:50,965::clientIF::231::vds::(prepareForShutdown) cannot run prepareForShutdown concurrently MainThread::INFO::2013-03-18 21:55:51,158::logUtils::39::dispatcher::(wrapper) Run and protect: prepareForShutdown, Return response: None MainThread::INFO::2013-03-18 21:55:51,338::vdsm::89::vds::(run) I am the actual vdsm 4.10-11.0 gold-vdsc.qa.lab.tlv.redhat.com (2.6.32-358.2.1.el6.x86_64) and during the recovery appeared bug 922515: MainThread::ERROR::2013-03-18 21:55:51,991::clientIF::263::vds::(_initIRS) Error initializing IRS ... AttributeError: 'list' object has no attribute 'split' The same errors are present in both logs during different flows and there's nothing specific to live storage migration. If the fencing mechanism wasn't triggered I could have considered an issue in crabRpc (at least for what concerns persisting the tasks) but everything makes me think of a real storage overload (or problem) that caused an high latency.
Dafna, can you attach the missing log + sanlock log? Also, can you describe the full flow?
Also, why is this a regression?
this bug is not a load issue. I have reproduced the scenario again and attaching the logs. iscsi storage, 2 vm's with two thin provision disks 10GB each (NOT wipe after delete disks) each vm is running a different host. select the vm and move both disks (for each of the vms) once the move is done, move the disks again. results: we fail to move the disks for the second time for both vm's. vm running on hsm fails the move because we failed to remove the volume on the first move. vm running on spm fails because vdsm crashed. full logs attached including sanlock, libvirt and vm's qemu log.
Created attachment 731058 [details] logs
I found multiple issues in the attached logs. With regard to: Thread-449::ERROR::2013-04-03 10:56:24,726::BindingXMLRPC::932::vds::(wrapper) unexpected error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 918, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 345, in vmDiskReplicateStart return vm.diskReplicateStart(srcDisk, dstDisk) File "/usr/share/vdsm/API.py", line 520, in diskReplicateStart return v.diskReplicateStart(srcDisk, dstDisk) File "/usr/share/vdsm/libvirtvm.py", line 2271, in diskReplicateStart self._setDiskReplica(srcDrive, dstDisk) File "/usr/share/vdsm/libvirtvm.py", line 2241, in _setDiskReplica self.saveState() File "/usr/share/vdsm/libvirtvm.py", line 2509, in saveState vm.Vm.saveState(self) File "/usr/share/vdsm/vm.py", line 761, in saveState toSave = deepcopy(self.status()) File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy y = copier(x, memo) File "/usr/lib64/python2.6/copy.py", line 255, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy y = copier(x, memo) File "/usr/lib64/python2.6/copy.py", line 228, in _deepcopy_list y.append(deepcopy(a, memo)) File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy y = copier(x, memo) File "/usr/lib64/python2.6/copy.py", line 254, in _deepcopy_dict for key, value in x.iteritems(): RuntimeError: dictionary changed size during iteration VDSM failed to update (save) the state of a VM because of two concurrent vmDiskReplicateStart which modified the VM configuration at the same time: Thread-449::DEBUG::2013-04-03 10:56:24,717::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.131]::call vmDiskReplicateStart with ('c3cdb482-8472-4f8a-b2ee-332118d467d1', {'device': 'disk', 'domainID': '6829602d-352a-40a4-af70-376f6e498f85', 'volumeID': '7ec4cbcc-1f0a-48b4-8ba3-000b42a0701f', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': '88929489-c525-49f9-9b1b-4efe28c4a706'}, {'device': 'disk', 'domainID': 'a3282596-8f78-4930-bb76-bebeb657babf', 'volumeID': '7ec4cbcc-1f0a-48b4-8ba3-000b42a0701f', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': '88929489-c525-49f9-9b1b-4efe28c4a706'}) {} flowID [2904eb87] Thread-450::DEBUG::2013-04-03 10:56:24,724::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.131]::call vmDiskReplicateStart with ('c3cdb482-8472-4f8a-b2ee-332118d467d1', {'device': 'disk', 'domainID': '6829602d-352a-40a4-af70-376f6e498f85', 'volumeID': 'b0463b99-16a0-4ca7-b9b3-ff370dc200e4', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': 'f4eca5b2-1f0f-4d6e-9da7-fbee2d9e532e'}, {'device': 'disk', 'domainID': 'a3282596-8f78-4930-bb76-bebeb657babf', 'volumeID': 'b0463b99-16a0-4ca7-b9b3-ff370dc200e4', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': 'f4eca5b2-1f0f-4d6e-9da7-fbee2d9e532e'}) {} flowID [2904eb87] That can be easily resolved in VDSM but I suggest to open a bug on the engine too as this exception should have been handled (one of the vmDiskReplicateStart failed => retry or rollback to the source). If it's needed I can provide a custom VDSM that triggers the exception without having to reproduce the entire scenario. Beside that I found again traces of storage overload: MainThread::INFO::2013-04-03 11:03:14,647::logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None) ... 6bbc5822-3e98-4d13-8c7c-92e62d4006a6::WARNING::2013-04-03 11:03:53,802::task::579::TaskManager.Task::(_updateState) Task=`6bbc5822-3e98-4d13-8c7c-92e62d4006a6`::Task._updateState: failed persisting task 6bbc5822-3e98-4d13-8c7c-92e62d4006a6 Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 576, in _updateState self.persist() File "/usr/share/vdsm/storage/task.py", line 1098, in persist self._save(self.store) File "/usr/share/vdsm/storage/task.py", line 717, in _save raise se.TaskDirError("_save: no such task dir '%s'" % origTaskDir) TaskDirError: can't find/access task dir: ("_save: no such task dir '/rhev/data-center/574d2c32-013c-4210-ab82-334188bd6171/mastersd/master/tasks/6bbc5822-3e98-4d13-8c7c-92e62d4006a6'",) ... MainThread::INFO::2013-04-03 11:03:54,328::vdsm::89::vds::(run) I am the actual vdsm 4.10-12.0 cougar01.scl.lab.tlv.redhat.com (2.6.32-358.2.1.el6.x86_64) 2013-04-03 11:03:12+0300 1114 [6957]: s1 check_our_lease warning 78 last_success 1036 2013-04-03 11:03:13+0300 1115 [6957]: s1 check_our_lease warning 79 last_success 1036 2013-04-03 11:03:14+0300 1116 [6957]: s1 check_our_lease failed 80 2013-04-03 11:03:14+0300 1116 [6957]: s1 kill 10381 sig 15 count 1 2013-04-03 11:03:15+0300 1117 [6957]: s1 kill 10381 sig 15 count 2 ...
storage overload might be a different issue since scenario is done with only 2 vms with 2 thin provision disks on each vm - not a load environment and should not cause a problem.
As discussed with Ayal, I'm moving this to the backend. This issue should be fixed there handling the possible return status: {'message': 'Unexpected exception', 'code': 16} (to be treated as {'code': 55, 'message': 'Drive replication error'}).
sf 13.1. tested according to Steps to Reproduce.
3.2 has been released