923194 – [backend] concurrent live storage migration of multiple disks might result in a saveState exception

Bug 923194 - [backend] concurrent live storage migration of multiple disks might result in a saveState exception

Summary: [backend] concurrent live storage migration of multiple disks might result in...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Daniel Erez
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	948940
TreeView+	depends on / blocked

Reported:	2013-03-19 12:06 UTC by Dafna Ron
Modified:	2016-02-10 20:56 UTC (History)
CC List:	13 users (show)
Fixed In Version:	sf13.1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	948940 (view as bug list)
Environment:
Last Closed:
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (2.66 MB, application/x-gzip) 2013-03-19 12:06 UTC, Dafna Ron	no flags	Details
logs (2.12 MB, application/x-gzip) 2013-04-03 08:20 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	13726	0	None	None	None	Never

Description Dafna Ron 2013-03-19 12:06:54 UTC

Created attachment 712611 [details]
logs

Description of problem:

I failed to deleteImage with time out and than vdsm crashed: 

Thread-4438::ERROR::2013-03-18 15:28:22,968::task::842::TaskManager.Task::(_setError) Task=`5af6236e-0ff7-45a1-83eb-27697b260725`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 849, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1470, in deleteImage
    sdUUID, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/hsm.py", line 762, in _spmSchedule
    name, func, *args)
  File "/usr/share/vdsm/storage/taskManager.py", line 64, in scheduleJob
    task.setPersistence(store, cleanPolicy=TaskCleanType.manual)
  File "/usr/share/vdsm/storage/task.py", line 1085, in setPersistence
    self.persist()
  File "/usr/share/vdsm/storage/task.py", line 1107, in persist
    self._save(self.store)
  File "/usr/share/vdsm/storage/task.py", line 754, in _save
    getProcPool().fileUtils.fsyncPath(origTaskDir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction
    *args, **kwargs)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll
    raise Timeout()
Timeout


Version-Release number of selected component (if applicable):

sf10
vdsm-4.10.2-11.0.el6ev.x86_64

How reproducible:


Steps to Reproduce:
1. create 2 vm's on iscsi storage with 2 thin provision disks on each vm
2. run the vm's and live migrate all the disks twice (waite for the move to end once and than move the disks again)
3.
  
Actual results:

deleteImage fails with timeout and vdsm crashes

Expected results:

we should succeed to migrate the disks

Additional info:logs

Comment 2 Ayal Baron 2013-03-27 09:40:21 UTC

Fede, please take a look to see what's going on here.

Comment 3 Federico Simoncelli 2013-03-29 16:59:14 UTC

The provided logs are:

vdsm.log.1: from 2013-03-18 07:01:01 to 2013-03-18 15:54:03
vdsm.log:   from 2013-03-18 17:28:51 to 2013-03-19 12:44:10

I believe that there's a missing log (15:54:03 - 17:28:51)

The traceback reported in the bug description belongs to vdsm.log.1:

Thread-4438::ERROR::2013-03-18 15:28:22,968::task::842::TaskManager.Task::(_setError) Task=`5af6236e-0ff7-45a1-83eb-27697b260725`::Unexpected error

But in vdsm.log.1 there's no trace of any live storage migration:

$ egrep -c "(call (vmSnapshot|vmDiskReplicate))|(Run and protect: (cloneImageStructure|syncImageData))" vdsm.log.1 
0

What I can see instead is a series of createVolume/deleteImage:

$ egrep -c "Run and protect: (createVolume|deleteImage)\(" vdsm.log.1 
173

Probably a scale test (note that all the deleteImage had postZero='true').

During this time few deleteImage weren't able to update and persist their status within the 1 minute timeout (process_pool_timeout):

Thread-4343::WARNING::2013-03-18 15:25:36,625::task::588::TaskManager.Task::(_updateState) Task=`fa2c4108-dfec-4b78-9705-d0483ca91a25`::Task._updateState: failed persisting task fa2c4108-dfec-4b78-9705-d0483ca91a25
Thread-4385::WARNING::2013-03-18 15:26:56,873::task::588::TaskManager.Task::(_updateState) Task=`b58395ca-67e2-47c5-b9e8-e01020db4716`::Task._updateState: failed persisting task b58395ca-67e2-47c5-b9e8-e01020db4716
Thread-4438::ERROR::2013-03-18 15:28:22,968::task::842::TaskManager.Task::(_setError) Task=`5af6236e-0ff7-45a1-83eb-27697b260725`::Unexpected error

I assume that the excessive load on the storage both prevented the tasks from being persisted (within the timeout) and also triggered the auto fencing mechanism that killed vdsm:

MainThread::INFO::2013-03-18 15:28:32,825::logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
MainThread::DEBUG::2013-03-18 15:28:33,198::clientIF::231::vds::(prepareForShutdown) cannot run prepareForShutdown concurrently
MainThread::INFO::2013-03-18 15:28:33,680::logUtils::39::dispatcher::(wrapper) Run and protect: prepareForShutdown, Return
 response: None
MainThread::INFO::2013-03-18 15:28:33,947::vdsm::89::vds::(run) I am the actual vdsm 4.10-11.0 gold-vdsc.qa.lab.tlv.redhat.com (2.6.32-358.2.1.el6.x86_64)

Then during the recovery process bug 922515 appeared:

MainThread::ERROR::2013-03-18 15:28:34,534::clientIF::263::vds::(_initIRS) Error initializing IRS
...
AttributeError: 'list' object has no attribute 'split'

Now considering vdsm.log we can notice some calls related to live storage migration (unrelated to the traceback mentioned in the bug description):

$ egrep -c "(call (vmSnapshot|vmDiskReplicate))|(Run and protect: (cloneImageStructure|syncImageData))" vdsm.log
42

If we consider only the heavy tasks involved with live storage migration there were 4 syncImageData running at the same time:

$ grep "Run and protect: syncImageData(" vdsm.log | cut -c 31-118
21:52:35,196::logUtils::37::dispatcher::(wrapper) Run and protect: syncImageData(spUUID=
21:52:37,929::logUtils::37::dispatcher::(wrapper) Run and protect: syncImageData(spUUID=
21:53:07,057::logUtils::37::dispatcher::(wrapper) Run and protect: syncImageData(spUUID=
21:53:46,993::logUtils::37::dispatcher::(wrapper) Run and protect: syncImageData(spUUID=

on top of which we have to consider the VM data replication. I also didn't notice any deleteImage during the live storage migration (as opposed to the bug description) since the first one appears ~13 hours later:

Thread-85::INFO::2013-03-19 10:50:43,152::logUtils::37::dispatcher::(wrapper) Run and protect: deleteImage(sdUUID='a9ebf0f9-b784-4d6f-837f-f2580a58841d', spUUID='9bfcecb2-417d-462d-96c4-3013793b7fe9', imgUUID='e6e6bcbc-9a18-4e12-a558-7ae595e2ef4b', postZero='false', force='false')

This probably triggered again the excessive load on the storage that prevented the tasks from being persisted:

Thread-7112::WARNING::2013-03-18 21:55:32,852::task::588::TaskManager.Task::(_updateState) Task=`fbc6922e-1c4d-4636-98e3-734bee50db05`::Task._updateState: failed persisting task fbc6922e-1c4d-4636-98e3-734bee50db05

Finally autofencing was triggered again:

MainThread::INFO::2013-03-18 21:55:50,593::logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
MainThread::DEBUG::2013-03-18 21:55:50,965::clientIF::231::vds::(prepareForShutdown) cannot run prepareForShutdown concurrently
MainThread::INFO::2013-03-18 21:55:51,158::logUtils::39::dispatcher::(wrapper) Run and protect: prepareForShutdown, Return response: None
MainThread::INFO::2013-03-18 21:55:51,338::vdsm::89::vds::(run) I am the actual vdsm 4.10-11.0 gold-vdsc.qa.lab.tlv.redhat.com (2.6.32-358.2.1.el6.x86_64)

and during the recovery appeared bug 922515:

MainThread::ERROR::2013-03-18 21:55:51,991::clientIF::263::vds::(_initIRS) Error initializing IRS
...
AttributeError: 'list' object has no attribute 'split'

The same errors are present in both logs during different flows and there's nothing specific to live storage migration. If the fencing mechanism wasn't triggered I could have considered an issue in crabRpc (at least for what concerns persisting the tasks) but everything makes me think of a real storage overload (or problem) that caused an high latency.

Comment 4 Ayal Baron 2013-03-31 19:43:27 UTC

Dafna, can you attach the missing log + sanlock log?
Also, can you describe the full flow?

Comment 5 Ayal Baron 2013-04-03 07:19:13 UTC

Also, why is this a regression?

Comment 6 Dafna Ron 2013-04-03 08:17:23 UTC

this bug is not a load issue.
I have reproduced the scenario again and attaching the logs.

iscsi storage, 2 vm's with two thin provision disks 10GB each (NOT wipe after delete disks)
each vm is running a different host.
select the vm and move both disks (for each of the vms)
once the move is done, move the disks again.

results:

we fail to move the disks for the second time for both vm's.
vm running on hsm fails the move because we failed to remove the volume on the first move.
vm running on spm fails because vdsm crashed.

full logs attached including sanlock, libvirt and vm's qemu log.

Comment 7 Dafna Ron 2013-04-03 08:20:26 UTC

Created attachment 731058 [details]
logs

Comment 8 Federico Simoncelli 2013-04-03 11:42:17 UTC

I found multiple issues in the attached logs.

With regard to:

Thread-449::ERROR::2013-04-03 10:56:24,726::BindingXMLRPC::932::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 918, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 345, in vmDiskReplicateStart
    return vm.diskReplicateStart(srcDisk, dstDisk)
  File "/usr/share/vdsm/API.py", line 520, in diskReplicateStart
    return v.diskReplicateStart(srcDisk, dstDisk)
  File "/usr/share/vdsm/libvirtvm.py", line 2271, in diskReplicateStart
    self._setDiskReplica(srcDrive, dstDisk)
  File "/usr/share/vdsm/libvirtvm.py", line 2241, in _setDiskReplica
    self.saveState()
  File "/usr/share/vdsm/libvirtvm.py", line 2509, in saveState
    vm.Vm.saveState(self)
  File "/usr/share/vdsm/vm.py", line 761, in saveState
    toSave = deepcopy(self.status())
  File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy
    y = copier(x, memo)
  File "/usr/lib64/python2.6/copy.py", line 255, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy
    y = copier(x, memo)
  File "/usr/lib64/python2.6/copy.py", line 228, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy
    y = copier(x, memo)
  File "/usr/lib64/python2.6/copy.py", line 254, in _deepcopy_dict
    for key, value in x.iteritems():
RuntimeError: dictionary changed size during iteration

VDSM failed to update (save) the state of a VM because of two concurrent vmDiskReplicateStart which modified the VM configuration at the same time:

Thread-449::DEBUG::2013-04-03 10:56:24,717::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.131]::call vmDiskReplicateStart with ('c3cdb482-8472-4f8a-b2ee-332118d467d1', {'device': 'disk', 'domainID': '6829602d-352a-40a4-af70-376f6e498f85', 'volumeID': '7ec4cbcc-1f0a-48b4-8ba3-000b42a0701f', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': '88929489-c525-49f9-9b1b-4efe28c4a706'}, {'device': 'disk', 'domainID': 'a3282596-8f78-4930-bb76-bebeb657babf', 'volumeID': '7ec4cbcc-1f0a-48b4-8ba3-000b42a0701f', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': '88929489-c525-49f9-9b1b-4efe28c4a706'}) {} flowID [2904eb87]
Thread-450::DEBUG::2013-04-03 10:56:24,724::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.131]::call vmDiskReplicateStart with ('c3cdb482-8472-4f8a-b2ee-332118d467d1', {'device': 'disk', 'domainID': '6829602d-352a-40a4-af70-376f6e498f85', 'volumeID': 'b0463b99-16a0-4ca7-b9b3-ff370dc200e4', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': 'f4eca5b2-1f0f-4d6e-9da7-fbee2d9e532e'}, {'device': 'disk', 'domainID': 'a3282596-8f78-4930-bb76-bebeb657babf', 'volumeID': 'b0463b99-16a0-4ca7-b9b3-ff370dc200e4', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': 'f4eca5b2-1f0f-4d6e-9da7-fbee2d9e532e'}) {} flowID [2904eb87]

That can be easily resolved in VDSM but I suggest to open a bug on the engine too as this exception should have been handled (one of the vmDiskReplicateStart failed => retry or rollback to the source). If it's needed I can provide a custom VDSM that triggers the exception without having to reproduce the entire scenario.

Beside that I found again traces of storage overload:

MainThread::INFO::2013-04-03 11:03:14,647::logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
...
6bbc5822-3e98-4d13-8c7c-92e62d4006a6::WARNING::2013-04-03 11:03:53,802::task::579::TaskManager.Task::(_updateState) Task=`6bbc5822-3e98-4d13-8c7c-92e62d4006a6`::Task._updateState: failed persisting task 6bbc5822-3e98-4d13-8c7c-92e62d4006a6
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 576, in _updateState
    self.persist()
  File "/usr/share/vdsm/storage/task.py", line 1098, in persist
    self._save(self.store)
  File "/usr/share/vdsm/storage/task.py", line 717, in _save
    raise se.TaskDirError("_save: no such task dir '%s'" % origTaskDir)
TaskDirError: can't find/access task dir: ("_save: no such task dir '/rhev/data-center/574d2c32-013c-4210-ab82-334188bd6171/mastersd/master/tasks/6bbc5822-3e98-4d13-8c7c-92e62d4006a6'",)
...
MainThread::INFO::2013-04-03 11:03:54,328::vdsm::89::vds::(run) I am the actual vdsm 4.10-12.0 cougar01.scl.lab.tlv.redhat.com (2.6.32-358.2.1.el6.x86_64)

2013-04-03 11:03:12+0300 1114 [6957]: s1 check_our_lease warning 78 last_success 1036
2013-04-03 11:03:13+0300 1115 [6957]: s1 check_our_lease warning 79 last_success 1036
2013-04-03 11:03:14+0300 1116 [6957]: s1 check_our_lease failed 80
2013-04-03 11:03:14+0300 1116 [6957]: s1 kill 10381 sig 15 count 1
2013-04-03 11:03:15+0300 1117 [6957]: s1 kill 10381 sig 15 count 2
...

Comment 9 Dafna Ron 2013-04-03 11:53:55 UTC

storage overload might be a different issue since scenario is done with only 2 vms with 2 thin provision disks on each vm - not a load environment and should not cause a problem.

Comment 10 Federico Simoncelli 2013-04-03 12:57:25 UTC

As discussed with Ayal, I'm moving this to the backend. This issue should be fixed there handling the possible return status: {'message': 'Unexpected exception', 'code': 16} (to be treated as {'code': 55, 'message': 'Drive replication error'}).

Comment 11 Leonid Natapov 2013-04-17 13:35:19 UTC

sf 13.1. tested according to Steps to Reproduce.

Comment 12 Itamar Heim 2013-06-11 09:45:30 UTC

3.2 has been released

Comment 13 Itamar Heim 2013-06-11 09:45:33 UTC

3.2 has been released

Comment 14 Itamar Heim 2013-06-11 09:56:03 UTC

3.2 has been released

Note You need to log in before you can comment on or make changes to this bug.