Description of problem: After upgrading a host from vdsm-4.19.31 to vdsm-4.20.9 and migrating several VMs to it I started hitting issues: first the VMs changed to Unknown status, then the host went non responsive. After that I got several errors like this in event log: VDSM ovirt-srv01 command GetCapabilitiesVDS failed: Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80} Later VDSM restarted and the host went back online. Version-Release number of selected component (if applicable): vdsm-4.20.9-1.el7.centos.x86_64 How reproducible: I hit this after live-migrating 7 VMs residing on an iSCSI storage domain. Steps to Reproduce: 1. upgrade host to vdsm-4.20.9 2. live-migrate VMs from vdsm-4.19.31 host Actual results: VDSM goes unresponsive, supposedly due to an unexpectedly high number of tasks Expected results: live migration finishes successfully Additional info: will attach VDSM log in a bit
I can reproduce this by migrating a few more VMs to the host so it does seem related to live migration. On a side note, attempts to migrate VMs off vdsm-4.20.9 back to vdsm-4.19.31 cause VMs to crash but that's probably material for another BZ
It seems to be related to well known mom issue: 2017-12-06 16:07:42,853+0000 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/4 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getAllVmIoTunePolicies', 'id': u'55585710-3120-4d94-86c8-3383db23f794'} at 0x3f17990> timeout=60, duration=60 at 0x3f17290> task#=930 at 0x3233290> (executor:358)
yes, but supposedly that's happening due to the worker queue being full of DriveWatermarkMonitors <Executor periodic workers=4 max_workers=30 <TaskQueue periodic max_tasks=400 tasks(400) Could be a regression in high watermark events, might be also related to bug 1522901 as a trigger. Evgheni, you can try to rule out or confirm the watermark changes by flipping the enable_block_threshold_event option in vdsm.conf to false
(In reply to Michal Skrivanek from comment #4) > yes, but supposedly that's happening due to the worker queue being full of > DriveWatermarkMonitors I remember asking for a feature to dump all workers when the Q is full... No idea where it is. > > <Executor periodic workers=4 max_workers=30 <TaskQueue periodic > max_tasks=400 tasks(400) > > Could be a regression in high watermark events, might be also related to bug > 1522901 as a trigger. Evgheni, you can try to rule out or confirm the > watermark changes by flipping the enable_block_threshold_event option in > vdsm.conf to false
(In reply to Michal Skrivanek from comment #4) > Could be a regression in high watermark events, might be also related to bug > 1522901 as a trigger. Evgheni, you can try to rule out or confirm the > watermark changes by flipping the enable_block_threshold_event option in > vdsm.conf to false I set enable_block_threshold_event = false in [vars] of vdsm.conf and sent an inbound live migration to force a restart of VDSM on the host. I assume at this point it would re-read the config file. In any case, nothing seems to have changed since after the host came back up, incoming migrations still cause VDSM restarts.
Patch applied to affected host. Incoming migration no longer causes VDSM to restart.
(In reply to Yaniv Kaul from comment #5) > (In reply to Michal Skrivanek from comment #4) > > yes, but supposedly that's happening due to the worker queue being full of > > DriveWatermarkMonitors > > I remember asking for a feature to dump all workers when the Q is full... > No idea where it is. So that's there, in the log you can see it's full of DriveWatermarkMonitor threads
(In reply to Michal Skrivanek from comment #10) > (In reply to Yaniv Kaul from comment #5) > > (In reply to Michal Skrivanek from comment #4) > > > yes, but supposedly that's happening due to the worker queue being full of > > > DriveWatermarkMonitors > > > > I remember asking for a feature to dump all workers when the Q is full... > > No idea where it is. > > So that's there, in the log you can see it's full of DriveWatermarkMonitor > threads Yep: https://gerrit.ovirt.org/#/c/81624/
Verify with: Engine:4.2.1.3-0.1.el7 Host 4.1: OS Version:RHEL - 7.4 - 18.el7 Kernel Version:3.10.0 - 693.17.1.el7.x86_64 KVM Version:2.9.0 - 16.el7_4.13.1 LIBVIRT Version:libvirt-3.2.0-14.el7_4.7 VDSM Version:vdsm-4.20.17-1.el7ev Host 4.2: OS Version:RHEL - 7.4 - 18.el7 Kernel Version:3.10.0 - 693.el7.x86_64 KVM Version:2.9.0 - 16.el7_4.14 LIBVIRT Version:libvirt-3.2.0-14.el7_4.9 VDSM Version:vdsm-4.19.45-1.el7ev Steps: Migrate VM from 4.1 host to 4.2 host. Note: I found https://bugzilla.redhat.com/show_bug.cgi?id=1542117 while migrate vm.
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.