883939 – 3.1.z - vdsm [Storage Live Migration]: vm changes state to pause for a few seconds during storage live migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 883939 - 3.1.z - vdsm [Storage Live Migration]: vm changes state to pause for a few seconds during storage live migration

Summary: 3.1.z - vdsm [Storage Live Migration]: vm changes state to pause for a few se...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	6.3
Assignee:	Federico Simoncelli
QA Contact:	Dafna Ron
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	873145
Blocks:	896511
TreeView+	depends on / blocked

Reported:	2012-12-05 15:46 UTC by Chris Pelland
Modified:	2013-02-28 07:01 UTC (History)
CC List:	14 users (show)
Fixed In Version:	vdsm-4.9.6-44.2
Doc Type:	Bug Fix
Doc Text:	Virtual machines were found to sometimes pause for a period of several seconds during storage live migration. To prevent this VDSM has been updated, increasing the amount of storage space requested by each call to extend the storage volumes involved in the migration. This allows storage live migration to occur without pausing the virtual machine(s).
Clone Of:
Clones:	896511 (view as bug list)
Environment:
Last Closed:	2013-01-15 14:52:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
## Logs vdsm, rhevm, libvirt, screen-shots (924.37 KB, application/x-gzip) 2012-12-11 12:38 UTC, vvyazmin@redhat.com	no flags	Details
logs (1.51 MB, application/x-gzip) 2012-12-12 16:53 UTC, Dafna Ron	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:0159	0	normal	SHIPPED_LIVE	vdsm bug fix update	2013-01-15 19:50:47 UTC

Description Chris Pelland 2012-12-05 15:46:49 UTC

This bug has been copied from bug #873145 and has been proposed
to be backported to 6.3 z-stream (EUS).

Comment 4 Federico Simoncelli 2012-12-05 15:49:24 UTC

commit 679b45308ac69a5baed98318a11c16771e339336
Author: Federico Simoncelli <fsimonce>
Date:   Mon Nov 12 08:18:12 2012 -0500

    vm: increase the volume extension on storage migration
    
    During live migration VDSM needs to subsequently extend two volumes
    instead of one; doubling the size of the chunk to extend the watermark
    limit is doubled and VDSM has more time to accomplish the operations.
    
    Change-Id: Ib61375613712feb7118a80c50b73e678d257f251
    Signed-off-by: Federico Simoncelli <fsimonce>

http://gerrit.ovirt.org/#/c/9200/

Comment 6 Stephen Gordon 2012-12-10 17:13:53 UTC

Federico please review the 'Doc Text' to ensure it remains an accurate reflection of the change, I have attempted to write it in a way that it will make sense to customers.

Comment 7 Federico Simoncelli 2012-12-11 11:21:55 UTC

The "Doc Text" is accurate.

Comment 8 vvyazmin@redhat.com 2012-12-11 12:38:50 UTC

Created attachment 661413 [details]
## Logs vdsm, rhevm, libvirt, screen-shots

Comment 9 vvyazmin@redhat.com 2012-12-11 12:43:41 UTC

Failed on RHEVM 3.1 - SI25 environment 

RHEVM: rhevm-3.1.0-33.el6ev.noarch
VDSM: vdsm-4.9.6-44.2.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.6.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.8.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

the storage I used was FC
the vm has 1 disk with OS installed

time ~ : 14:17
vm name: vm-T-RHEL-2000

Comment 10 Federico Simoncelli 2012-12-12 09:48:32 UTC

In the log there are several extend requests with the correct size:

Thread-5144::DEBUG::2012-12-11 14:21:17,287::vm::760::vm.Vm::(extendDriveVolume) vmId=`f8bd9fd3-b4fc-4d12-8743-83437c623c2e`::Requesting an extension for the volume replication: {'newSize': 5120, 'domainID': 'de22c4e0-4c8a-468a-8810-675fcb25efb1', 'name': u'vda', 'volumeID': 'f91ddf5d-c81f-4b45-9500-d3eff6aa317f', 'imageID': 'd395408a-30a3-466f-8a4b-4f0506232c20', 'poolID': '3622182b-86cd-4812-ba02-5cd42d2d1c38'}

But they're never satisfied by the SPM (same machine) because a spmStop was issued:

Thread-4119::INFO::2012-12-11 13:50:51,999::logUtils::37::dispatcher::(wrapper) Run and protect: spmStop(spUUID='3622182b-86cd-4812-ba02-5cd42d2d1c38', options=None)
[...]
Dummy-918::INFO::2012-12-11 13:50:54,865::storage_mailbox::692::Storage.MailBox.SpmMailMonitor::(run) SPM_MailMonitor - Incoming mail monitoring thread stopped
[...]
Thread-4122::INFO::2012-12-11 13:50:56,143::logUtils::37::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='3622182b-86cd-4812-ba02-5cd42d2d1c38', options=None)
Thread-4122::INFO::2012-12-11 13:50:56,144::logUtils::39::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': 0}}

The engine stopped the spm because the host wasn't able to see the domain 72520460-b313-47a8-809f-0b63659597c1:

Thread-4116::ERROR::2012-12-11 13:50:51,912::task::853::TaskManager.Task::(_setError) Task=`60ff069b-6e46-437f-bb56-d068c8cb2281`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1008, in attachStorageDomain
    pool.attachSD(sdUUID)
  File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 932, in attachSD
    dom = sdCache.produce(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 97, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',)

2012-12-11 13:50:53,609 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-5) [3a58b15a] Failed in AttachStorageDomainVDS method
2012-12-11 13:50:53,610 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-5) [3a58b15a] Error code StorageDomainDoesNotExist and error message IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',)
2012-12-11 13:50:53,610 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',)
2012-12-11 13:50:53,647 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] START, SpmStopVDSCommand(HostName = green-vdsb, HostId = 5e1f2e6c-437a-11e2-8a95-001a4a1697a6, storagePoolId = 3622182b-86cd-4812-ba02-5cd42d2d1c38), log id: 6345e175
2012-12-11 13:50:53,667 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] SpmStopVDSCommand::Stopping SPM on vds green-vdsb, pool id 3622182b-86cd-4812-ba02-5cd42d2d1c38
2012-12-11 13:50:54,212 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] FINISH, SpmStopVDSCommand, log id: 6345e175

Moving back to ON_QA to be tested with an active SPM.

Comment 11 Dafna Ron 2012-12-12 16:50:28 UTC

tested on vdsm-4.9.6-44.2.el6_3.x86_64 
I am still seeing the status change from Up to Pause (will attach full logs) 

look at domain: 019c3116-e094-45ce-a974-ab7c5059d849


2012-12-12 18:19:40,719 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-48) VM RHEL6-4 019c3116-e094-45ce-a974-ab7c5059d849 moved from Up --> Paused
2012-12-12 18:19:40,996 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-4-thread-43) START, DeleteImageGroupVDSCommand( storagePoolId = 6d873e0e-80d5-4390-a3dd-f77f59bfdb09, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = a738573e-35dd-41ee-bd5a-c83e6f5a0f7e, imageGroupId = 2392ab0f-864d-4ef5-8495-24b9b0771f6c, postZeros = false, forceDelete = false), log id: 28dbf282
2012-12-12 18:19:42,926 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-58) VM RHEL6-4 019c3116-e094-45ce-a974-ab7c5059d849 moved from Paused --> Up
2012-12-12 18:19:43,805 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (pool-4-thread-44) FINISH, SyncImageGroupDataVDSCommand, log id: 50679c28

Comment 12 Dafna Ron 2012-12-12 16:53:24 UTC

Created attachment 662485 [details]
logs

Comment 13 Federico Simoncelli 2012-12-13 12:23:36 UTC

For what I see domain 019c3116-e094-45ce-a974-ab7c5059d849 started the live storage migration at:

Thread-2329::DEBUG::2012-12-12 18:19:36,293::BindingXMLRPC::894::vds::(wrapper) client [10.35.97.65]::call vmDiskReplicateStart with ('019c3116-e094-45ce-a974-ab7c5059d849', {'device': 'disk', 'domainID': 'a738573e-35dd-41ee-bd5a-c83e6f5a0f7e', 'volumeID': '04e82649-3f17-4752-86cf-94cc59f12577', 'poolID': '6d873e0e-80d5-4390-a3dd-f77f59bfdb09', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9'}, {'device': 'disk', 'domainID': 'f1ec1456-736f-4fa6-8b8e-639ed994bd25', 'volumeID': '04e82649-3f17-4752-86cf-94cc59f12577', 'poolID': '6d873e0e-80d5-4390-a3dd-f77f59bfdb09', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9'}) {}

And the mv paused 3 seconds later (too soon to be an ENOSPC):

Thread-2333::DEBUG::2012-12-12 18:19:39,992::BindingXMLRPC::900::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Paused', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '28387', 'displayIp': '0', 'displayPort': u'5901', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': '-43200', 'hash': '5104074509452472763', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet1': {'macAddr': '00:1a:4a:23:61:72', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet1'}}, 'vmId': '019c3116-e094-45ce-a974-ab7c5059d849', 'displayType': 'vnc', 'cpuUser': '0.20', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '606963', 'apparentsize': '1073741824', 'writeLatency': '-3475122', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9', 'flushLatency': '180835656', 'readRate': '-1168154.53', 'truesize': '1073741824', 'writeRate': '-41718.19'}}, 'monitorResponse': '0', 'statsAge': '1.45', 'cpuIdle': '99.80', 'elapsedTime': '402', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}

Digging in some libvirt logs that weren't attached:

2012-12-12 16:19:37.533+0000: 2067: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7fec80001e40 msg={"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-ide0-0-0","target":"/rhev/data-center/6d873e0e-80d5-4390-a3dd-f77f59bfdb09/f1ec1456-736f-4fa6-8b8e-639ed994bd25/images/0f7a0f00-5f64-4640-94b2-e5f2786e9ab9/04e82649-3f17-4752-86cf-94cc59f12577","full":false,"speed":0,"mode":"existing","format":"qcow2"},"id":"libvirt-213"}

2012-12-12 16:19:39.608+0000: 2066: debug : virDomainSuspend:2379 : dom=0x7fec98001b20, (VM: name=RHEL6-4, uuid=019c3116-e094-45ce-a974-ab7c5059d849)
2012-12-12 16:19:39.608+0000: 2066: debug : qemuDomainObjBeginJobInternal:754 : Starting job: suspend (async=none)
2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7fec80001e40 refs=3
2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorStopCPUs:1081 : mon=0x7fec80001e40
2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7fec80001e40 msg={"execute":"stop","id":"libvirt-215"}

For some reason a virDomainSuspend was issued to libvirt. I'm still investigating why.

Comment 14 Federico Simoncelli 2012-12-17 12:51:44 UTC

At the moment vdsm needs to pause the VM for a short moment during the volume refresh (after an extend). In certain cases (or if the refresh takes few seconds) it might be also reported to the engine. I already looked into removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks too risky for 3.1.

The important thing is that we don't hit enospc/eother errors and we extend in time. Analyzing the logs this part looks fine (no enospc/eother errors).

In my opinion we can either proceed to ON_QA again or eventually based on the logs we already have (please double check everything) mark it as verified.

Comment 15 Yaniv Kaul 2012-12-17 16:30:32 UTC

(In reply to comment #14)
> At the moment vdsm needs to pause the VM for a short moment during the
> volume refresh (after an extend). In certain cases (or if the refresh takes
> few seconds) it might be also reported to the engine. I already looked into
> removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks
> too risky for 3.1.
> 
> The important thing is that we don't hit enospc/eother errors and we extend
> in time. Analyzing the logs this part looks fine (no enospc/eother errors).
> 
> In my opinion we can either proceed to ON_QA again or eventually based on
> the logs we already have (please double check everything) mark it as
> verified.

The bug is still there. 
You can revert the patch that did nothing from user perspective, or keep it (we did not see it causing regressions) and postpone the complete fix to next version, or decide not to fix it and close-wontfix it.
But it cannot move back to QE nor marked as VERIFIED.

Comment 16 Federico Simoncelli 2012-12-17 20:34:12 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > At the moment vdsm needs to pause the VM for a short moment during the
> > volume refresh (after an extend). In certain cases (or if the refresh takes
> > few seconds) it might be also reported to the engine. I already looked into
> > removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks
> > too risky for 3.1.
> > 
> > The important thing is that we don't hit enospc/eother errors and we extend
> > in time. Analyzing the logs this part looks fine (no enospc/eother errors).
> > 
> > In my opinion we can either proceed to ON_QA again or eventually based on
> > the logs we already have (please double check everything) mark it as
> > verified.
> 
> The bug is still there.
> You can revert the patch that did nothing from user perspective, or keep it
> (we did not see it causing regressions) and postpone the complete fix to
> next version, or decide not to fix it and close-wontfix it.
> But it cannot move back to QE nor marked as VERIFIED.

In terms of what the process allows, I don't think we can keep code that doesn't fix a bug. If we can, I might be fine with closing/postponing the bug (even though I think it's a stretch and I dislike it).

That said the bug was fixed (there are no enospc/eother errors), what you hit now is a problem that is present also during the regular vm operations and is not related to live storage migration.

Your options are not viable:

1. remove the patch, bug open -> we're back to the original issue, the bug is still present and we reached an impasse because the working fix is blocked on something else that we cannot fix now
2. leave the patch, bug open -> the issue is resolved but we have no formal trace of that (I don't even think it's allowed)

Options 2 would also require you to properly modify the bug because it's not related to live storage migration anymore.

Comment 17 Ayal Baron 2012-12-17 23:24:53 UTC

Fede, is there anything in lsm which makes it more likely to hit this bug?

Comment 18 Federico Simoncelli 2012-12-18 09:04:54 UTC

(In reply to comment #17)
> Fede, is there anything in lsm which makes it more likely to hit this bug?

I couldn't spot anything, otherwise I would have agreed on fixing it as a live storage migration bug. Obviously we're making two extension requests but that's not the point, the issue is there regardless.

Comment 19 Ayal Baron 2012-12-19 10:02:34 UTC

As discussed with Kaul, marking as VERIFIED.
Dafna, please open a new bug on the other problem (comment 14)

Comment 21 errata-xmlrpc 2013-01-15 14:52:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0159.html

Note You need to log in before you can comment on or make changes to this bug.