Bug 883939

Summary:

3.1.z - vdsm [Storage Live Migration]: vm changes state to pause for a few seconds during storage live migration

Product:

Red Hat Enterprise Linux 6

Reporter:

Chris Pelland <cpelland>

Component:

vdsm

Assignee:

Federico Simoncelli <fsimonce>

Status:

CLOSED ERRATA

QA Contact:

Dafna Ron <dron>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.3

CC:

abaron, amureini, bazulay, cpelland, dron, fsimonce, hateya, iheim, ilvovsky, lpeer, pm-eus, sgordon, vvyazmin, ykaul

Target Milestone:

Keywords:

ZStream

Target Release:

6.3

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

vdsm-4.9.6-44.2

Doc Type:

Bug Fix

Doc Text:

Virtual machines were found to sometimes pause for a period of several seconds during storage live migration. To prevent this VDSM has been updated, increasing the amount of storage space requested by each call to extend the storage volumes involved in the migration. This allows storage live migration to occur without pausing the virtual machine(s).

Story Points:

---

Clone Of:

Clones:

896511 (view as bug list)

Environment:

Last Closed:

2013-01-15 14:52:42 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

873145

Bug Blocks:

896511

Attachments:

Description	Flags
## Logs vdsm, rhevm, libvirt, screen-shots	none
logs	none

Description Chris Pelland 2012-12-05 15:46:49 UTC

This bug has been copied from bug #873145 and has been proposed
to be backported to 6.3 z-stream (EUS).

Comment 4 Federico Simoncelli 2012-12-05 15:49:24 UTC

commit 679b45308ac69a5baed98318a11c16771e339336
Author: Federico Simoncelli <fsimonce>
Date:   Mon Nov 12 08:18:12 2012 -0500

    vm: increase the volume extension on storage migration
    
    During live migration VDSM needs to subsequently extend two volumes
    instead of one; doubling the size of the chunk to extend the watermark
    limit is doubled and VDSM has more time to accomplish the operations.
    
    Change-Id: Ib61375613712feb7118a80c50b73e678d257f251
    Signed-off-by: Federico Simoncelli <fsimonce>

http://gerrit.ovirt.org/#/c/9200/

Comment 6 Stephen Gordon 2012-12-10 17:13:53 UTC

Federico please review the 'Doc Text' to ensure it remains an accurate reflection of the change, I have attempted to write it in a way that it will make sense to customers.

Comment 7 Federico Simoncelli 2012-12-11 11:21:55 UTC

The "Doc Text" is accurate.

Comment 8 vvyazmin@redhat.com 2012-12-11 12:38:50 UTC

Created attachment 661413 [details]
## Logs vdsm, rhevm, libvirt, screen-shots

Comment 9 vvyazmin@redhat.com 2012-12-11 12:43:41 UTC

Failed on RHEVM 3.1 - SI25 environment 

RHEVM: rhevm-3.1.0-33.el6ev.noarch
VDSM: vdsm-4.9.6-44.2.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.6.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.8.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

the storage I used was FC
the vm has 1 disk with OS installed

time ~ : 14:17
vm name: vm-T-RHEL-2000

Comment 10 Federico Simoncelli 2012-12-12 09:48:32 UTC

In the log there are several extend requests with the correct size:

Thread-5144::DEBUG::2012-12-11 14:21:17,287::vm::760::vm.Vm::(extendDriveVolume) vmId=`f8bd9fd3-b4fc-4d12-8743-83437c623c2e`::Requesting an extension for the volume replication: {'newSize': 5120, 'domainID': 'de22c4e0-4c8a-468a-8810-675fcb25efb1', 'name': u'vda', 'volumeID': 'f91ddf5d-c81f-4b45-9500-d3eff6aa317f', 'imageID': 'd395408a-30a3-466f-8a4b-4f0506232c20', 'poolID': '3622182b-86cd-4812-ba02-5cd42d2d1c38'}

But they're never satisfied by the SPM (same machine) because a spmStop was issued:

Thread-4119::INFO::2012-12-11 13:50:51,999::logUtils::37::dispatcher::(wrapper) Run and protect: spmStop(spUUID='3622182b-86cd-4812-ba02-5cd42d2d1c38', options=None)
[...]
Dummy-918::INFO::2012-12-11 13:50:54,865::storage_mailbox::692::Storage.MailBox.SpmMailMonitor::(run) SPM_MailMonitor - Incoming mail monitoring thread stopped
[...]
Thread-4122::INFO::2012-12-11 13:50:56,143::logUtils::37::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='3622182b-86cd-4812-ba02-5cd42d2d1c38', options=None)
Thread-4122::INFO::2012-12-11 13:50:56,144::logUtils::39::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': 0}}

The engine stopped the spm because the host wasn't able to see the domain 72520460-b313-47a8-809f-0b63659597c1:

Thread-4116::ERROR::2012-12-11 13:50:51,912::task::853::TaskManager.Task::(_setError) Task=`60ff069b-6e46-437f-bb56-d068c8cb2281`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1008, in attachStorageDomain
    pool.attachSD(sdUUID)
  File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 932, in attachSD
    dom = sdCache.produce(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 97, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',)

2012-12-11 13:50:53,609 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-5) [3a58b15a] Failed in AttachStorageDomainVDS method
2012-12-11 13:50:53,610 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-5) [3a58b15a] Error code StorageDomainDoesNotExist and error message IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',)
2012-12-11 13:50:53,610 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',)
2012-12-11 13:50:53,647 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] START, SpmStopVDSCommand(HostName = green-vdsb, HostId = 5e1f2e6c-437a-11e2-8a95-001a4a1697a6, storagePoolId = 3622182b-86cd-4812-ba02-5cd42d2d1c38), log id: 6345e175
2012-12-11 13:50:53,667 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] SpmStopVDSCommand::Stopping SPM on vds green-vdsb, pool id 3622182b-86cd-4812-ba02-5cd42d2d1c38
2012-12-11 13:50:54,212 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] FINISH, SpmStopVDSCommand, log id: 6345e175

Moving back to ON_QA to be tested with an active SPM.

Comment 11 Dafna Ron 2012-12-12 16:50:28 UTC

tested on vdsm-4.9.6-44.2.el6_3.x86_64 
I am still seeing the status change from Up to Pause (will attach full logs) 

look at domain: 019c3116-e094-45ce-a974-ab7c5059d849


2012-12-12 18:19:40,719 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-48) VM RHEL6-4 019c3116-e094-45ce-a974-ab7c5059d849 moved from Up --> Paused
2012-12-12 18:19:40,996 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-4-thread-43) START, DeleteImageGroupVDSCommand( storagePoolId = 6d873e0e-80d5-4390-a3dd-f77f59bfdb09, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = a738573e-35dd-41ee-bd5a-c83e6f5a0f7e, imageGroupId = 2392ab0f-864d-4ef5-8495-24b9b0771f6c, postZeros = false, forceDelete = false), log id: 28dbf282
2012-12-12 18:19:42,926 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-58) VM RHEL6-4 019c3116-e094-45ce-a974-ab7c5059d849 moved from Paused --> Up
2012-12-12 18:19:43,805 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (pool-4-thread-44) FINISH, SyncImageGroupDataVDSCommand, log id: 50679c28

Comment 12 Dafna Ron 2012-12-12 16:53:24 UTC

Created attachment 662485 [details]
logs

Comment 13 Federico Simoncelli 2012-12-13 12:23:36 UTC

For what I see domain 019c3116-e094-45ce-a974-ab7c5059d849 started the live storage migration at:

Thread-2329::DEBUG::2012-12-12 18:19:36,293::BindingXMLRPC::894::vds::(wrapper) client [10.35.97.65]::call vmDiskReplicateStart with ('019c3116-e094-45ce-a974-ab7c5059d849', {'device': 'disk', 'domainID': 'a738573e-35dd-41ee-bd5a-c83e6f5a0f7e', 'volumeID': '04e82649-3f17-4752-86cf-94cc59f12577', 'poolID': '6d873e0e-80d5-4390-a3dd-f77f59bfdb09', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9'}, {'device': 'disk', 'domainID': 'f1ec1456-736f-4fa6-8b8e-639ed994bd25', 'volumeID': '04e82649-3f17-4752-86cf-94cc59f12577', 'poolID': '6d873e0e-80d5-4390-a3dd-f77f59bfdb09', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9'}) {}

And the mv paused 3 seconds later (too soon to be an ENOSPC):

Thread-2333::DEBUG::2012-12-12 18:19:39,992::BindingXMLRPC::900::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Paused', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '28387', 'displayIp': '0', 'displayPort': u'5901', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': '-43200', 'hash': '5104074509452472763', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet1': {'macAddr': '00:1a:4a:23:61:72', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet1'}}, 'vmId': '019c3116-e094-45ce-a974-ab7c5059d849', 'displayType': 'vnc', 'cpuUser': '0.20', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '606963', 'apparentsize': '1073741824', 'writeLatency': '-3475122', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9', 'flushLatency': '180835656', 'readRate': '-1168154.53', 'truesize': '1073741824', 'writeRate': '-41718.19'}}, 'monitorResponse': '0', 'statsAge': '1.45', 'cpuIdle': '99.80', 'elapsedTime': '402', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]}

Digging in some libvirt logs that weren't attached:

2012-12-12 16:19:37.533+0000: 2067: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7fec80001e40 msg={"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-ide0-0-0","target":"/rhev/data-center/6d873e0e-80d5-4390-a3dd-f77f59bfdb09/f1ec1456-736f-4fa6-8b8e-639ed994bd25/images/0f7a0f00-5f64-4640-94b2-e5f2786e9ab9/04e82649-3f17-4752-86cf-94cc59f12577","full":false,"speed":0,"mode":"existing","format":"qcow2"},"id":"libvirt-213"}

2012-12-12 16:19:39.608+0000: 2066: debug : virDomainSuspend:2379 : dom=0x7fec98001b20, (VM: name=RHEL6-4, uuid=019c3116-e094-45ce-a974-ab7c5059d849)
2012-12-12 16:19:39.608+0000: 2066: debug : qemuDomainObjBeginJobInternal:754 : Starting job: suspend (async=none)
2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7fec80001e40 refs=3
2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorStopCPUs:1081 : mon=0x7fec80001e40
2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7fec80001e40 msg={"execute":"stop","id":"libvirt-215"}

For some reason a virDomainSuspend was issued to libvirt. I'm still investigating why.

Comment 14 Federico Simoncelli 2012-12-17 12:51:44 UTC

At the moment vdsm needs to pause the VM for a short moment during the volume refresh (after an extend). In certain cases (or if the refresh takes few seconds) it might be also reported to the engine. I already looked into removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks too risky for 3.1.

The important thing is that we don't hit enospc/eother errors and we extend in time. Analyzing the logs this part looks fine (no enospc/eother errors).

In my opinion we can either proceed to ON_QA again or eventually based on the logs we already have (please double check everything) mark it as verified.

Comment 15 Yaniv Kaul 2012-12-17 16:30:32 UTC

(In reply to comment #14)
> At the moment vdsm needs to pause the VM for a short moment during the
> volume refresh (after an extend). In certain cases (or if the refresh takes
> few seconds) it might be also reported to the engine. I already looked into
> removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks
> too risky for 3.1.
> 
> The important thing is that we don't hit enospc/eother errors and we extend
> in time. Analyzing the logs this part looks fine (no enospc/eother errors).
> 
> In my opinion we can either proceed to ON_QA again or eventually based on
> the logs we already have (please double check everything) mark it as
> verified.

The bug is still there. 
You can revert the patch that did nothing from user perspective, or keep it (we did not see it causing regressions) and postpone the complete fix to next version, or decide not to fix it and close-wontfix it.
But it cannot move back to QE nor marked as VERIFIED.

Comment 16 Federico Simoncelli 2012-12-17 20:34:12 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > At the moment vdsm needs to pause the VM for a short moment during the
> > volume refresh (after an extend). In certain cases (or if the refresh takes
> > few seconds) it might be also reported to the engine. I already looked into
> > removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks
> > too risky for 3.1.
> > 
> > The important thing is that we don't hit enospc/eother errors and we extend
> > in time. Analyzing the logs this part looks fine (no enospc/eother errors).
> > 
> > In my opinion we can either proceed to ON_QA again or eventually based on
> > the logs we already have (please double check everything) mark it as
> > verified.
> 
> The bug is still there.
> You can revert the patch that did nothing from user perspective, or keep it
> (we did not see it causing regressions) and postpone the complete fix to
> next version, or decide not to fix it and close-wontfix it.
> But it cannot move back to QE nor marked as VERIFIED.

In terms of what the process allows, I don't think we can keep code that doesn't fix a bug. If we can, I might be fine with closing/postponing the bug (even though I think it's a stretch and I dislike it).

That said the bug was fixed (there are no enospc/eother errors), what you hit now is a problem that is present also during the regular vm operations and is not related to live storage migration.

Your options are not viable:

1. remove the patch, bug open -> we're back to the original issue, the bug is still present and we reached an impasse because the working fix is blocked on something else that we cannot fix now
2. leave the patch, bug open -> the issue is resolved but we have no formal trace of that (I don't even think it's allowed)

Options 2 would also require you to properly modify the bug because it's not related to live storage migration anymore.

Comment 17 Ayal Baron 2012-12-17 23:24:53 UTC

Fede, is there anything in lsm which makes it more likely to hit this bug?

Comment 18 Federico Simoncelli 2012-12-18 09:04:54 UTC

(In reply to comment #17)
> Fede, is there anything in lsm which makes it more likely to hit this bug?

I couldn't spot anything, otherwise I would have agreed on fixing it as a live storage migration bug. Obviously we're making two extension requests but that's not the point, the issue is there regardless.

Comment 19 Ayal Baron 2012-12-19 10:02:34 UTC

As discussed with Kaul, marking as VERIFIED.
Dafna, please open a new bug on the other problem (comment 14)

Comment 21 errata-xmlrpc 2013-01-15 14:52:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0159.html