948940 – [vdsm] concurrent live storage migration of multiple disks might result in a saveState exception

Bug 948940 - [vdsm] concurrent live storage migration of multiple disks might result in a saveState exception

Summary: [vdsm] concurrent live storage migration of multiple disks might result in a ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Federico Simoncelli
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	923194
Blocks:	948448
TreeView+	depends on / blocked

Reported:	2013-04-05 14:40 UTC by Federico Simoncelli
Modified:	2022-07-09 06:00 UTC (History)
CC List:	11 users (show)
Fixed In Version:	vdsm-4.10.2-18.0.el6ev
Doc Type:	Bug Fix
Doc Text:	Previously, concurrent live storage migration of multiple disks sometimes resulted in a saveState exception. vdsm-4.10.2-18.0.el6ev does not throw a saveState exception during concurrent live storage migration of multiple disks. Concurrent storage migration of multiple disks now succeeds.
Clone Of:	923194
Environment:
Last Closed:	2013-06-10 20:47:55 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-47088	None	None	None	2022-07-09 06:00:25 UTC
Red Hat Product Errata	RHSA-2013:0886	normal	SHIPPED_LIVE	Moderate: rhev 3.2 - vdsm security and bug fix update	2013-06-11 00:25:02 UTC
oVirt gerrit	13624	None	None	None	Never

Description Federico Simoncelli 2013-04-05 14:40:37 UTC

This is the vdsm part to avoid the "RuntimeError: dictionary changed size during iteration" exception.


+++ This bug was initially created as a clone of Bug #923194 +++

--- Additional comment from Federico Simoncelli on 2013-04-03 07:42:17 EDT ---

I found multiple issues in the attached logs.

With regard to:

Thread-449::ERROR::2013-04-03 10:56:24,726::BindingXMLRPC::932::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 918, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 345, in vmDiskReplicateStart
    return vm.diskReplicateStart(srcDisk, dstDisk)
  File "/usr/share/vdsm/API.py", line 520, in diskReplicateStart
    return v.diskReplicateStart(srcDisk, dstDisk)
  File "/usr/share/vdsm/libvirtvm.py", line 2271, in diskReplicateStart
    self._setDiskReplica(srcDrive, dstDisk)
  File "/usr/share/vdsm/libvirtvm.py", line 2241, in _setDiskReplica
    self.saveState()
  File "/usr/share/vdsm/libvirtvm.py", line 2509, in saveState
    vm.Vm.saveState(self)
  File "/usr/share/vdsm/vm.py", line 761, in saveState
    toSave = deepcopy(self.status())
  File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy
    y = copier(x, memo)
  File "/usr/lib64/python2.6/copy.py", line 255, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy
    y = copier(x, memo)
  File "/usr/lib64/python2.6/copy.py", line 228, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/usr/lib64/python2.6/copy.py", line 162, in deepcopy
    y = copier(x, memo)
  File "/usr/lib64/python2.6/copy.py", line 254, in _deepcopy_dict
    for key, value in x.iteritems():
RuntimeError: dictionary changed size during iteration

VDSM failed to update (save) the state of a VM because of two concurrent vmDiskReplicateStart which modified the VM configuration at the same time:

Thread-449::DEBUG::2013-04-03 10:56:24,717::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.131]::call vmDiskReplicateStart with ('c3cdb482-8472-4f8a-b2ee-332118d467d1', {'device': 'disk', 'domainID': '6829602d-352a-40a4-af70-376f6e498f85', 'volumeID': '7ec4cbcc-1f0a-48b4-8ba3-000b42a0701f', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': '88929489-c525-49f9-9b1b-4efe28c4a706'}, {'device': 'disk', 'domainID': 'a3282596-8f78-4930-bb76-bebeb657babf', 'volumeID': '7ec4cbcc-1f0a-48b4-8ba3-000b42a0701f', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': '88929489-c525-49f9-9b1b-4efe28c4a706'}) {} flowID [2904eb87]
Thread-450::DEBUG::2013-04-03 10:56:24,724::BindingXMLRPC::913::vds::(wrapper) client [10.35.161.131]::call vmDiskReplicateStart with ('c3cdb482-8472-4f8a-b2ee-332118d467d1', {'device': 'disk', 'domainID': '6829602d-352a-40a4-af70-376f6e498f85', 'volumeID': 'b0463b99-16a0-4ca7-b9b3-ff370dc200e4', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': 'f4eca5b2-1f0f-4d6e-9da7-fbee2d9e532e'}, {'device': 'disk', 'domainID': 'a3282596-8f78-4930-bb76-bebeb657babf', 'volumeID': 'b0463b99-16a0-4ca7-b9b3-ff370dc200e4', 'poolID': '574d2c32-013c-4210-ab82-334188bd6171', 'imageID': 'f4eca5b2-1f0f-4d6e-9da7-fbee2d9e532e'}) {} flowID [2904eb87]

That can be easily resolved in VDSM but I suggest to open a bug on the engine too as this exception should have been handled (one of the vmDiskReplicateStart failed => retry or rollback to the source). If it's needed I can provide a custom VDSM that triggers the exception without having to reproduce the entire scenario.

Beside that I found again traces of storage overload:

MainThread::INFO::2013-04-03 11:03:14,647::logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
...
6bbc5822-3e98-4d13-8c7c-92e62d4006a6::WARNING::2013-04-03 11:03:53,802::task::579::TaskManager.Task::(_updateState) Task=`6bbc5822-3e98-4d13-8c7c-92e62d4006a6`::Task._updateState: failed persisting task 6bbc5822-3e98-4d13-8c7c-92e62d4006a6
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 576, in _updateState
    self.persist()
  File "/usr/share/vdsm/storage/task.py", line 1098, in persist
    self._save(self.store)
  File "/usr/share/vdsm/storage/task.py", line 717, in _save
    raise se.TaskDirError("_save: no such task dir '%s'" % origTaskDir)
TaskDirError: can't find/access task dir: ("_save: no such task dir '/rhev/data-center/574d2c32-013c-4210-ab82-334188bd6171/mastersd/master/tasks/6bbc5822-3e98-4d13-8c7c-92e62d4006a6'",)
...
MainThread::INFO::2013-04-03 11:03:54,328::vdsm::89::vds::(run) I am the actual vdsm 4.10-12.0 cougar01.scl.lab.tlv.redhat.com (2.6.32-358.2.1.el6.x86_64)

2013-04-03 11:03:12+0300 1114 [6957]: s1 check_our_lease warning 78 last_success 1036
2013-04-03 11:03:13+0300 1115 [6957]: s1 check_our_lease warning 79 last_success 1036
2013-04-03 11:03:14+0300 1116 [6957]: s1 check_our_lease failed 80
2013-04-03 11:03:14+0300 1116 [6957]: s1 kill 10381 sig 15 count 1
2013-04-03 11:03:15+0300 1117 [6957]: s1 kill 10381 sig 15 count 2
...

Comment 3 Elad 2013-05-13 08:49:33 UTC

Verified on RHEVM-3.2-SF16:

vdsm-4.10.2-18.0.el6ev.x86_64
rhevm-3.2.0-10.25.beta3.el6ev.noarch
libvirt-0.10.2-18.el6_4.4.x86_64

concurrent storage migration of multiple disks succeeded.

Comment 5 errata-xmlrpc 2013-06-10 20:47:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html

Note You need to log in before you can comment on or make changes to this bug.