Bug 1657762 - Extend volume failed with error "VM <vm_id> has been paused due to no Storage space error." and "VM <vm_id> is down with error. Exit message: Lost connection with qemu process."
Summary: Extend volume failed with error "VM <vm_id> has been paused due to no Storage...
Keywords:
Status: NEW
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.8
Hardware: x86_64
OS: Unspecified
medium
medium vote
Target Milestone: ovirt-4.5.0
: ---
Assignee: Benny Zlotnik
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-10 11:52 UTC by Yosi Ben Shimon
Modified: 2019-10-07 03:27 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)
logs (764.07 KB, application/gzip)
2018-12-10 11:52 UTC, Yosi Ben Shimon
no flags Details
logs_2 (1.14 MB, application/zip)
2018-12-12 13:44 UTC, Yosi Ben Shimon
no flags Details
rhv-4.2.8.7_TestCase5063_logs (768.78 KB, application/zip)
2019-02-17 09:28 UTC, Avihai
no flags Details

Description Yosi Ben Shimon 2018-12-10 11:52:16 UTC
Created attachment 1513030 [details]
logs

Description of problem:
An automation test case failed to resize disk on iscsi storage domain.
Looks like the same issue as here:
https://access.redhat.com/solutions/3423481

From the engine log:

2018-12-05 02:59:39,250+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] VM 'bdc67223-59c9-4b53-96f4-7aeb5ca8852f'(vm_TestCase5063_0502555271) moved from 'Up' --> 'Paused'
2018-12-05 02:59:39,309+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_PAUSED(1,025), VM vm_TestCase5063_0502555271 has been paused.
2018-12-05 02:59:39,320+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM vm_TestCase5063_0502555271 has been paused due to no Storage space error.
2018-12-05 02:59:55,130+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [] VM 'bdc67223-59c9-4b53-96f4-7aeb5ca8852f' was reported as Down on VDS 'a3ebe986-710b-4d72-849a-8a8bd2f16738'(host_mixed_2)
2018-12-05 02:59:55,133+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [] START, DestroyVDSCommand(HostName = host_mixed_2, DestroyVmVDSCommandParameters:{hostId='a3ebe986-710b-4d72-849a-8a8bd2f16738', vmId='bdc67223-59c9-4b53-96f4-7aeb5ca8852f', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 22ff82b0
2018-12-05 02:59:55,138+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [] FINISH, DestroyVDSCommand, log id: 22ff82b0
2018-12-05 02:59:55,138+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [] VM 'bdc67223-59c9-4b53-96f4-7aeb5ca8852f'(vm_TestCase5063_0502555271) moved from 'Paused' --> 'Down'
2018-12-05 02:59:55,153+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [] EVENT_ID: VM_DOWN_ERROR(119), VM vm_TestCase5063_0502555271 is down with error. Exit message: Lost connection with qemu process.


From VDSM log (same log block multiple times):

2018-12-05 02:59:40,953+0200 DEBUG (libvirt/events) [storage.TaskManager.Task] (Task='67f91f50-e668-4335-8939-68a8a03014df') moving from state init -> state preparing (task:602)
2018-12-05 02:59:40,953+0200 INFO  (libvirt/events) [vdsm.api] START sendExtendMsg(spUUID='c6dc7fd0-295a-4f1e-bc4b-ac3c2c7e767f', volDict={'newSize': 2147483648, 'domainID': 'e45546fb-005c-4c1c-b9be-4bde6fe7bbb9', 'name': 'vdb', 'poolID': 'c6dc7fd0-295a-4f1e-bc4b-ac3c2c7e767f', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'internal': False, 'volumeID': '83e6d3db-9410-4af2-b657-fc6640d49b99', 'imageID': 'b6cdbca9-10f4-4df3-aad7-005f1820d783'}, newSize=2147483648, callbackFunc=<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x7f25d6dd8ed0>>) from=internal, task_id=67f91f50-e668-4335-8939-68a8a03014df (api:46)
2018-12-05 02:59:40,953+0200 DEBUG (libvirt/events) [storage.SPM.Messages.Extend] new extend msg created: domain: e45546fb-005c-4c1c-b9be-4bde6fe7bbb9, volume: 83e6d3db-9410-4af2-b657-fc6640d49b99 (mailbox:123)
2018-12-05 02:59:40,954+0200 INFO  (libvirt/events) [vdsm.api] FINISH sendExtendMsg return=None from=internal, task_id=67f91f50-e668-4335-8939-68a8a03014df (api:52)
2018-12-05 02:59:40,954+0200 DEBUG (libvirt/events) [storage.TaskManager.Task] (Task='67f91f50-e668-4335-8939-68a8a03014df') finished: None (task:1201)
2018-12-05 02:59:40,954+0200 DEBUG (libvirt/events) [storage.TaskManager.Task] (Task='67f91f50-e668-4335-8939-68a8a03014df') moving from state preparing -> state finished (task:602)
2018-12-05 02:59:40,955+0200 DEBUG (libvirt/events) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:910)
2018-12-05 02:59:40,955+0200 DEBUG (libvirt/events) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:947)
2018-12-05 02:59:40,955+0200 DEBUG (libvirt/events) [storage.TaskManager.Task] (Task='67f91f50-e668-4335-8939-68a8a03014df') ref 0 aborting False (task:1002)
2018-12-05 02:59:40,963+0200 INFO  (libvirt/events) [virt.vm] (vmId='bdc67223-59c9-4b53-96f4-7aeb5ca8852f') abnormal vm stop device ua-b6cdbca9-10f4-4df3-aad7-005f1820d783 error enospc (vm:5158)
2018-12-05 02:59:40,963+0200 INFO  (libvirt/events) [virt.vm] (vmId='bdc67223-59c9-4b53-96f4-7aeb5ca8852f') CPU stopped: onIOError (vm:6199)
2018-12-05 02:59:40,968+0200 INFO  (libvirt/events) [virt.vm] (vmId='bdc67223-59c9-4b53-96f4-7aeb5ca8852f') Requesting extension for volume 83e6d3db-9410-4af2-b657-fc6640d49b99 on domain e45546fb-005c-4c1c-b9be-4bde6fe7bbb9 (apparent: 1073741824, capacity: 2147483648, allocated: 1078329344, physical: 1073741824 threshold_state: exceeded) (vm:1277)

.....

2018-12-05 02:59:46,014+0200 INFO  (mailbox-hsm/3) [virt.vm] (vmId='bdc67223-59c9-4b53-96f4-7aeb5ca8852f') Extend volume 83e6d3db-9410-4af2-b657-fc6640d49b99 completed <Clock(total=4.59, extend-volume=4.03, refresh-volume=0.38)> (vm:1400)
2018-12-05 02:59:46,015+0200 INFO  (mailbox-hsm/3) [virt.vm] (vmId='bdc67223-59c9-4b53-96f4-7aeb5ca8852f') setting block threshold to 1610612736 bytes for drive 'vdb' (apparentsize 2147483648) (drivemonitor:108)
2018-12-05 02:59:46,015+0200 ERROR (mailbox-hsm/3) [storage.TaskManager.Task] (Task='e9d2f78f-bb41-48e8-81e9-0296c9903c1e') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1406, in __afterVolumeExtension
    self._update_drive_volume_size(drive, volSize)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1422, in _update_drive_volume_size
    self.drive_monitor.set_threshold(drive, volsize.apparentsize)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/drivemonitor.py", line 113, in set_threshold
    self._vm._dom.setBlockThreshold(drive.name, threshold)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 47, in __getattr__
    % self.vmid)
NotConnectedError: VM 'bdc67223-59c9-4b53-96f4-7aeb5ca8852f' was not defined yet or was undefined
2018-12-05 02:59:46,114+0200 DEBUG (mailbox-hsm/3) [storage.TaskManager.Task] (Task='e9d2f78f-bb41-48e8-81e9-0296c9903c1e') Task._run: e9d2f78f-bb41-48e8-81e9-0296c9903c1e ({'newSize': 2147483648, 'domainID': 'e45546fb-005c-4c1c-b9be-4bde6fe7bbb9', 'name': 'vdb', 'poolID': 'c6dc7fd0-295a-4f1e-bc4b-ac3c2c7e767f', 'clock': <Clock(total=4.59, extend-volume=4.03, refresh-volume=0.38)>, 'internal': False, 'volumeID': '83e6d3db-9410-4af2-b657-fc6640d49b99', 'imageID': 'b6cdbca9-10f4-4df3-aad7-005f1820d783'},) {} failed - stopping task (task:894)
2018-12-05 02:59:46,114+0200 DEBUG (mailbox-hsm/3) [storage.TaskManager.Task] (Task='e9d2f78f-bb41-48e8-81e9-0296c9903c1e') stopping in state preparing (force False) (task:1256)
2018-12-05 02:59:46,114+0200 DEBUG (mailbox-hsm/3) [storage.TaskManager.Task] (Task='e9d2f78f-bb41-48e8-81e9-0296c9903c1e') ref 1 aborting True (task:1002)
2018-12-05 02:59:46,115+0200 INFO  (mailbox-hsm/3) [storage.TaskManager.Task] (Task='e9d2f78f-bb41-48e8-81e9-0296c9903c1e') aborting: Task is aborted: u"VM 'bdc67223-59c9-4b53-96f4-7aeb5ca8852f' was not defined yet or was undefined" - code 100 (task:1181)

...

2018-12-05 02:59:46,117+0200 ERROR (mailbox-hsm/3) [storage.ThreadPool.WorkerThread] FINISH task 83439718-dd74-475e-8e91-9fd22ded39cd failed (callback=None, cmd=<function runTask at 0x7f260970d7d0>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x7f25d6dd8ed0>>, {'newSize': 2147483648, 'domainID': 'e45546fb-005c-4c1c-b9be-4bde6fe7bbb9', 'name': 'vdb', 'poolID': 'c6dc7fd0-295a-4f1e-bc4b-ac3c2c7e767f', 'clock': <Clock(total=4.59, extend-volume=4.03, refresh-volume=0.38)>, 'internal': False, 'volumeID': '83e6d3db-9410-4af2-b657-fc6640d49b99', 'imageID': 'b6cdbca9-10f4-4df3-aad7-005f1820d783'})) (threadPool:222)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/threadPool.py", line 209, in _processNextTask
    cmd(args)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mailbox.py", line 80, in runTask
    ctask.prepare(cmd, *args)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare
    raise self.error
NotConnectedError: VM 'bdc67223-59c9-4b53-96f4-7aeb5ca8852f' was not defined yet or was undefined



Version-Release number of selected component (if applicable):
ovirt-engine-4.2.8-0.1.el7ev.noarch
vdsm-4.20.44-1.el7ev.x86_64
libvirt-4.5.0-10.el7_6.3.x86_64
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64


How reproducible:
Unknown as I failed to reproduce it both manually and using automation.

Steps to Reproduce:
1.
2.
3.

Actual results:
The resize failed.

Expected results:
The disk should have been resized successfully.

Additional info:
Attached relevant logs.

Comment 1 Yosi Ben Shimon 2018-12-12 13:43:49 UTC
Same happened again in other test case when extending the disk.
Errors a bit different but Looks like it's related to the same issue.

From the engine log:

2018-12-10 13:45:40,367+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-59) [] VM '460e606c-fb09-4a21-aab8-6165490ebeb9'(vm_TestCase5061_1013432638) moved from 'Up' --> 'Paused'
2018-12-10 13:45:40,386+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-59) [] EVENT_ID: VM_PAUSED(1,025), VM vm_TestCase5061_1013432638 has been paused.


From the VDSM log:

2018-12-10 13:45:55,425+0200 ERROR (jsonrpc/2) [jsonrpc.JsonRpcServer] Internal server error (__init__:611)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
    res = method(**params)
  File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
    result = fn(*methodArgs)
  File "<string>", line 2, in dumpxmls
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1430, in dumpxmls
    for vmId in vmList}
  File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1430, in <dictcomp>
    for vmId in vmList}
KeyError: u'460e606c-fb09-4a21-aab8-6165490ebeb9'


Attaching logs for this TC execution (logs_2.zip)

Comment 2 Yosi Ben Shimon 2018-12-12 13:44:40 UTC
Created attachment 1513679 [details]
logs_2

Comment 3 Yosi Ben Shimon 2018-12-19 13:50:36 UTC
Steps to reproduce according to the test cases:

TestCase5063 (the 1st test case - bug Description):
1. Create VM and start it
2. Add thin provision qcow disk of size 1G and attach to the VM
3. Extend the disk by 1G
4. Start to write using dd
5. Check that the disk is actually growing


TestCase5061 (2nd test case - comment #1):
1. Create VM and start it
2. Add multiple disk permutation of size 1G, attach them to the VM and activate
3. Create snapshot (with memory)
4. Extend each disk by 1G (synchronized)

Comment 4 Sandro Bonazzola 2019-01-28 09:34:04 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 5 Avihai 2019-02-17 09:27:16 UTC
Another reproduction of this issue on rhv-4.2.8-7(ovirt-engine-4.2.8.3-0.1.el7ev.noarch) automation tier1 TestCase5063.
Logs are attached.

Engine log:
2019-02-10 06:57:34,272+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM vm_TestCase5063_1006540117 has been paused due to no Storage space error.
2019-02-10 06:57:50,637+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] VM '66596aa9-4775-4416-94ed-abf01f7dd178' was reported as Down on VDS '950c516f-2c3e-4943-8525-b0a519997293'(host_mixed_2)
2019-02-10 06:57:50,638+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] START, DestroyVDSCommand(HostName = host_mixed_2, DestroyVmVDSCommandParameters:{hostId='950c516f-2c3e-4943-8525-b0a519997293', vmId='66596aa9-4775-4416-94ed-abf01f7dd178', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 69d79ebd
2019-02-10 06:57:50,644+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] FINISH, DestroyVDSCommand, log id: 69d79ebd
2019-02-10 06:57:50,644+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] VM '66596aa9-4775-4416-94ed-abf01f7dd178'(vm_TestCase5063_1006540117) moved from 'Paused' --> 'Down'
2019-02-10 06:57:50,661+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] EVENT_ID: VM_DOWN_ERROR(119), VM vm_TestCase5063_1006540117 is down with error. Exit message: Lost connection with qemu process.

Comment 6 Avihai 2019-02-17 09:28:05 UTC
Created attachment 1535639 [details]
rhv-4.2.8.7_TestCase5063_logs


Note You need to log in before you can comment on or make changes to this bug.