Bug 883939
Summary: | 3.1.z - vdsm [Storage Live Migration]: vm changes state to pause for a few seconds during storage live migration | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Chris Pelland <cpelland> | ||||||
Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Dafna Ron <dron> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 6.3 | CC: | abaron, amureini, bazulay, cpelland, dron, fsimonce, hateya, iheim, ilvovsky, lpeer, pm-eus, sgordon, vvyazmin, ykaul | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | 6.3 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | storage | ||||||||
Fixed In Version: | vdsm-4.9.6-44.2 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Virtual machines were found to sometimes pause for a period of several seconds during storage live migration. To prevent this VDSM has been updated, increasing the amount of storage space requested by each call to extend the storage volumes involved in the migration. This allows storage live migration to occur without pausing the virtual machine(s).
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 896511 (view as bug list) | Environment: | |||||||
Last Closed: | 2013-01-15 14:52:42 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 873145 | ||||||||
Bug Blocks: | 896511 | ||||||||
Attachments: |
|
Description
Chris Pelland
2012-12-05 15:46:49 UTC
commit 679b45308ac69a5baed98318a11c16771e339336 Author: Federico Simoncelli <fsimonce> Date: Mon Nov 12 08:18:12 2012 -0500 vm: increase the volume extension on storage migration During live migration VDSM needs to subsequently extend two volumes instead of one; doubling the size of the chunk to extend the watermark limit is doubled and VDSM has more time to accomplish the operations. Change-Id: Ib61375613712feb7118a80c50b73e678d257f251 Signed-off-by: Federico Simoncelli <fsimonce> http://gerrit.ovirt.org/#/c/9200/ Federico please review the 'Doc Text' to ensure it remains an accurate reflection of the change, I have attempted to write it in a way that it will make sense to customers. The "Doc Text" is accurate. Created attachment 661413 [details]
## Logs vdsm, rhevm, libvirt, screen-shots
Failed on RHEVM 3.1 - SI25 environment RHEVM: rhevm-3.1.0-33.el6ev.noarch VDSM: vdsm-4.9.6-44.2.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.6.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.8.x86_64 SANLOCK: sanlock-2.3-4.el6_3.x86_64 the storage I used was FC the vm has 1 disk with OS installed time ~ : 14:17 vm name: vm-T-RHEL-2000 In the log there are several extend requests with the correct size: Thread-5144::DEBUG::2012-12-11 14:21:17,287::vm::760::vm.Vm::(extendDriveVolume) vmId=`f8bd9fd3-b4fc-4d12-8743-83437c623c2e`::Requesting an extension for the volume replication: {'newSize': 5120, 'domainID': 'de22c4e0-4c8a-468a-8810-675fcb25efb1', 'name': u'vda', 'volumeID': 'f91ddf5d-c81f-4b45-9500-d3eff6aa317f', 'imageID': 'd395408a-30a3-466f-8a4b-4f0506232c20', 'poolID': '3622182b-86cd-4812-ba02-5cd42d2d1c38'} But they're never satisfied by the SPM (same machine) because a spmStop was issued: Thread-4119::INFO::2012-12-11 13:50:51,999::logUtils::37::dispatcher::(wrapper) Run and protect: spmStop(spUUID='3622182b-86cd-4812-ba02-5cd42d2d1c38', options=None) [...] Dummy-918::INFO::2012-12-11 13:50:54,865::storage_mailbox::692::Storage.MailBox.SpmMailMonitor::(run) SPM_MailMonitor - Incoming mail monitoring thread stopped [...] Thread-4122::INFO::2012-12-11 13:50:56,143::logUtils::37::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='3622182b-86cd-4812-ba02-5cd42d2d1c38', options=None) Thread-4122::INFO::2012-12-11 13:50:56,144::logUtils::39::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': 0}} The engine stopped the spm because the host wasn't able to see the domain 72520460-b313-47a8-809f-0b63659597c1: Thread-4116::ERROR::2012-12-11 13:50:51,912::task::853::TaskManager.Task::(_setError) Task=`60ff069b-6e46-437f-bb56-d068c8cb2281`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1008, in attachStorageDomain pool.attachSD(sdUUID) File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper return f(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 932, in attachSD dom = sdCache.produce(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 97, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',) 2012-12-11 13:50:53,609 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-5) [3a58b15a] Failed in AttachStorageDomainVDS method 2012-12-11 13:50:53,610 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (ajp-/127.0.0.1:8702-5) [3a58b15a] Error code StorageDomainDoesNotExist and error message IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',) 2012-12-11 13:50:53,610 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] IrsBroker::Failed::AttachStorageDomainVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Storage domain does not exist: ('72520460-b313-47a8-809f-0b63659597c1',) 2012-12-11 13:50:53,647 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] START, SpmStopVDSCommand(HostName = green-vdsb, HostId = 5e1f2e6c-437a-11e2-8a95-001a4a1697a6, storagePoolId = 3622182b-86cd-4812-ba02-5cd42d2d1c38), log id: 6345e175 2012-12-11 13:50:53,667 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] SpmStopVDSCommand::Stopping SPM on vds green-vdsb, pool id 3622182b-86cd-4812-ba02-5cd42d2d1c38 2012-12-11 13:50:54,212 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (ajp-/127.0.0.1:8702-5) [3a58b15a] FINISH, SpmStopVDSCommand, log id: 6345e175 Moving back to ON_QA to be tested with an active SPM. tested on vdsm-4.9.6-44.2.el6_3.x86_64 I am still seeing the status change from Up to Pause (will attach full logs) look at domain: 019c3116-e094-45ce-a974-ab7c5059d849 2012-12-12 18:19:40,719 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-48) VM RHEL6-4 019c3116-e094-45ce-a974-ab7c5059d849 moved from Up --> Paused 2012-12-12 18:19:40,996 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-4-thread-43) START, DeleteImageGroupVDSCommand( storagePoolId = 6d873e0e-80d5-4390-a3dd-f77f59bfdb09, ignoreFailoverLimit = false, compatabilityVersion = null, storageDomainId = a738573e-35dd-41ee-bd5a-c83e6f5a0f7e, imageGroupId = 2392ab0f-864d-4ef5-8495-24b9b0771f6c, postZeros = false, forceDelete = false), log id: 28dbf282 2012-12-12 18:19:42,926 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-58) VM RHEL6-4 019c3116-e094-45ce-a974-ab7c5059d849 moved from Paused --> Up 2012-12-12 18:19:43,805 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SyncImageGroupDataVDSCommand] (pool-4-thread-44) FINISH, SyncImageGroupDataVDSCommand, log id: 50679c28 Created attachment 662485 [details]
logs
For what I see domain 019c3116-e094-45ce-a974-ab7c5059d849 started the live storage migration at: Thread-2329::DEBUG::2012-12-12 18:19:36,293::BindingXMLRPC::894::vds::(wrapper) client [10.35.97.65]::call vmDiskReplicateStart with ('019c3116-e094-45ce-a974-ab7c5059d849', {'device': 'disk', 'domainID': 'a738573e-35dd-41ee-bd5a-c83e6f5a0f7e', 'volumeID': '04e82649-3f17-4752-86cf-94cc59f12577', 'poolID': '6d873e0e-80d5-4390-a3dd-f77f59bfdb09', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9'}, {'device': 'disk', 'domainID': 'f1ec1456-736f-4fa6-8b8e-639ed994bd25', 'volumeID': '04e82649-3f17-4752-86cf-94cc59f12577', 'poolID': '6d873e0e-80d5-4390-a3dd-f77f59bfdb09', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9'}) {} And the mv paused 3 seconds later (too soon to be an ENOSPC): Thread-2333::DEBUG::2012-12-12 18:19:39,992::BindingXMLRPC::900::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Paused', 'username': 'Unknown', 'memUsage': '0', 'acpiEnable': 'true', 'pid': '28387', 'displayIp': '0', 'displayPort': u'5901', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': '-43200', 'hash': '5104074509452472763', 'pauseCode': 'NOERR', 'clientIp': '', 'kvmEnable': 'true', 'network': {u'vnet1': {'macAddr': '00:1a:4a:23:61:72', 'rxDropped': '0', 'rxErrors': '0', 'txDropped': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '1000', 'name': u'vnet1'}}, 'vmId': '019c3116-e094-45ce-a974-ab7c5059d849', 'displayType': 'vnc', 'cpuUser': '0.20', 'disks': {u'hdc': {'flushLatency': '0', 'readLatency': '0', 'writeLatency': '0'}, u'hda': {'readLatency': '606963', 'apparentsize': '1073741824', 'writeLatency': '-3475122', 'imageID': '0f7a0f00-5f64-4640-94b2-e5f2786e9ab9', 'flushLatency': '180835656', 'readRate': '-1168154.53', 'truesize': '1073741824', 'writeRate': '-41718.19'}}, 'monitorResponse': '0', 'statsAge': '1.45', 'cpuIdle': '99.80', 'elapsedTime': '402', 'vmType': 'kvm', 'cpuSys': '0.00', 'appsList': [], 'guestIPs': '', 'nice': ''}]} Digging in some libvirt logs that weren't attached: 2012-12-12 16:19:37.533+0000: 2067: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7fec80001e40 msg={"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-ide0-0-0","target":"/rhev/data-center/6d873e0e-80d5-4390-a3dd-f77f59bfdb09/f1ec1456-736f-4fa6-8b8e-639ed994bd25/images/0f7a0f00-5f64-4640-94b2-e5f2786e9ab9/04e82649-3f17-4752-86cf-94cc59f12577","full":false,"speed":0,"mode":"existing","format":"qcow2"},"id":"libvirt-213"} 2012-12-12 16:19:39.608+0000: 2066: debug : virDomainSuspend:2379 : dom=0x7fec98001b20, (VM: name=RHEL6-4, uuid=019c3116-e094-45ce-a974-ab7c5059d849) 2012-12-12 16:19:39.608+0000: 2066: debug : qemuDomainObjBeginJobInternal:754 : Starting job: suspend (async=none) 2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorRef:201 : QEMU_MONITOR_REF: mon=0x7fec80001e40 refs=3 2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorStopCPUs:1081 : mon=0x7fec80001e40 2012-12-12 16:19:39.624+0000: 2066: debug : qemuMonitorSend:823 : QEMU_MONITOR_SEND_MSG: mon=0x7fec80001e40 msg={"execute":"stop","id":"libvirt-215"} For some reason a virDomainSuspend was issued to libvirt. I'm still investigating why. At the moment vdsm needs to pause the VM for a short moment during the volume refresh (after an extend). In certain cases (or if the refresh takes few seconds) it might be also reported to the engine. I already looked into removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks too risky for 3.1. The important thing is that we don't hit enospc/eother errors and we extend in time. Analyzing the logs this part looks fine (no enospc/eother errors). In my opinion we can either proceed to ON_QA again or eventually based on the logs we already have (please double check everything) mark it as verified. (In reply to comment #14) > At the moment vdsm needs to pause the VM for a short moment during the > volume refresh (after an extend). In certain cases (or if the refresh takes > few seconds) it might be also reported to the engine. I already looked into > removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks > too risky for 3.1. > > The important thing is that we don't hit enospc/eother errors and we extend > in time. Analyzing the logs this part looks fine (no enospc/eother errors). > > In my opinion we can either proceed to ON_QA again or eventually based on > the logs we already have (please double check everything) mark it as > verified. The bug is still there. You can revert the patch that did nothing from user perspective, or keep it (we did not see it causing regressions) and postpone the complete fix to next version, or decide not to fix it and close-wontfix it. But it cannot move back to QE nor marked as VERIFIED. (In reply to comment #15) > (In reply to comment #14) > > At the moment vdsm needs to pause the VM for a short moment during the > > volume refresh (after an extend). In certain cases (or if the refresh takes > > few seconds) it might be also reported to the engine. I already looked into > > removing this limitation (http://gerrit.ovirt.org/#/c/8382/) but it looks > > too risky for 3.1. > > > > The important thing is that we don't hit enospc/eother errors and we extend > > in time. Analyzing the logs this part looks fine (no enospc/eother errors). > > > > In my opinion we can either proceed to ON_QA again or eventually based on > > the logs we already have (please double check everything) mark it as > > verified. > > The bug is still there. > You can revert the patch that did nothing from user perspective, or keep it > (we did not see it causing regressions) and postpone the complete fix to > next version, or decide not to fix it and close-wontfix it. > But it cannot move back to QE nor marked as VERIFIED. In terms of what the process allows, I don't think we can keep code that doesn't fix a bug. If we can, I might be fine with closing/postponing the bug (even though I think it's a stretch and I dislike it). That said the bug was fixed (there are no enospc/eother errors), what you hit now is a problem that is present also during the regular vm operations and is not related to live storage migration. Your options are not viable: 1. remove the patch, bug open -> we're back to the original issue, the bug is still present and we reached an impasse because the working fix is blocked on something else that we cannot fix now 2. leave the patch, bug open -> the issue is resolved but we have no formal trace of that (I don't even think it's allowed) Options 2 would also require you to properly modify the bug because it's not related to live storage migration anymore. Fede, is there anything in lsm which makes it more likely to hit this bug? (In reply to comment #17) > Fede, is there anything in lsm which makes it more likely to hit this bug? I couldn't spot anything, otherwise I would have agreed on fixing it as a live storage migration bug. Obviously we're making two extension requests but that's not the point, the issue is there regardless. As discussed with Kaul, marking as VERIFIED. Dafna, please open a new bug on the other problem (comment 14) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0159.html |