Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1600507

Summary: Host not responding
Product: [oVirt] vdsm Reporter: oliver.albl
Component: GeneralAssignee: Roy Golan <rgolan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Daniel Gur <dagur>
Severity: high Docs Contact:
Priority: low    
Version: 4.20.31CC: bugs, christian.grundmann, dfediuck, eedri, mzamazal, oliver.albl, ratamir, rgolan
Target Milestone: ovirt-4.2.8Keywords: Performance
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: ovirt-4.3+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-30 21:51:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Scale RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1627734    
Bug Blocks:    
Attachments:
Description Flags
Engine and VDSM logs
none
VDSM logs none

Description oliver.albl 2018-07-12 12:11:06 UTC
Created attachment 1458381 [details]
Engine and VDSM logs

Description of problem:
I run a cluster with 10 hosts for automated testing purposes. Sometimes a host and its VMs are not responding for some minutes when under load

Version-Release number of selected component (if applicable):
oVirt 4.2.4.5-1.el7
vdsm-4.20.32-1.el7.x86_64

How reproducible:
Generate load on hosts by automatically creating vms, running vm workloads, deleting vms

Steps to Reproduce:
1.
2.
3.

Actual results:
Host reports "VM xxx is not responding" for all VMs
Engine reports "Host xxx is not responding"
VMs stay available and run their workload, but automatic tasks (startup, shutdown, destroy...) cannot continue.

Expected results:
Automatic tasks (startup, shutdown, ...) should be possible.

Additional info:
Uploaded engine.log and vdsm.log from affected host, issue occurred around 21:38

Comment 1 Doron Fediuck 2018-07-16 10:51:15 UTC
We need some more information.
How many VMs are running on the host?
What is the actual load (CPU utilization, memory consumption)?
Are you overcommiting memory?

Comment 2 oliver.albl 2018-07-16 11:28:04 UTC
During the issue there were about 40 VMs running on that host.
Memory usage is about 50 percent, CPU usage around 70 percent.

Comment 3 Roy Golan 2018-07-17 15:20:16 UTC
Every shutdown command will take a lock on the vm, and will spin another api call from outside, from the vmhostmd hook. The lock (aka VmContainerLock) is shared with other flows (like for protecting the external vms list)

The jsonrpc server has 8 threads. This means we can't take more than 4 concurrent shutdowns properly? Lets see

engine.log says your Vms are non-responding. Usually this means we didn't get the statistics for those VMs.

vdsm.log shows that the monitoring of those vms exceeded 60 seconds and therefor is considered non responsive(single VM for brevity):

--- 2018-07-11 21:38:36,330+0200 WARN  (jsonrpc/7) [virt.vm] (vmId='e4caddb3-0883-4d95-84bc-712f7a9e77b1') monitor became unresponsive (command timeout, age=78.9800000004) (vm:6151)

The internal periodic task for that VM is actually blocked and can't get the lock

--- 2018-07-11 21:38:33,872+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/15 running <Task <Operation action=<function <lambda> at 0x7faf84297d70> at 0x7faf842af89
... (stacktrace deleted for brevity)
 File: "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 159, in 
getVMs
  with self.vmContainerLock:

And the reason it is blocked is because it is waiting for the VmContainerLoc, taked by the shutdown command which is in progress for this VM, which is
trying to invoke getVMFullList but doesn't have a thread to do so because 
the json pool has 8 workers:

  File "/usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd", line 35, in <module>
    res = cli.Host.getVMFullList()
  File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 278, in _call
    raise TimeoutError(method, kwargs, timeout)
vdsm.client.TimeoutError: Request Host.getVMFullList with args {} timed out after 60 seconds


The visualize:

Shutdown VM thread :
   - take vmContainer Lock
     - call the cli getFullvmlist
       - wait for a thread another json thread
---

Comment 4 Roy Golan 2018-07-18 06:47:42 UTC
I suggest here a workaround that will also test my analysis:
- uninstall the vdsm-hook-vhostmd package(you are not running SAP workloads right?). That will spare the extra call to cli during shutdown of a vm
- increase the number of json workers - add this to /etc/vdsm/vdsm.conf
[rpc]
worker_threads = 16

Comment 5 oliver.albl 2018-07-18 06:50:30 UTC
Do I need to place the hosts in maintenance mode or can this be done online with vms running?

Comment 6 Roy Golan 2018-07-18 07:06:49 UTC
No, just restart the service after the change, the running vms will stay in tact.

Comment 7 oliver.albl 2018-07-18 07:15:52 UTC
I will add the following to vdsm.conf.d/50... file:

[rpc]
worker_threads = 16

Removing vdsm-hook-vhostmd would also remove packages cockpit-ovirt-dashboard, ovirt-host and ovirt-hosted-engine-setup, so would stay with changing worker_threads in the first step only.

Comment 8 Roy Golan 2018-07-18 08:01:25 UTC
(In reply to oliver.albl from comment #7)
> Removing vdsm-hook-vhostmd would also remove packages
> cockpit-ovirt-dashboard, ovirt-host and ovirt-hosted-engine-setup, so would
> stay with changing worker_threads in the first step only.

It is important you remove it so we can just disable it by just moving this file to another directory:
/usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd

Comment 9 oliver.albl 2018-07-18 08:06:28 UTC
So you want me to remove /usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd (or move it to a backup folder) and restart vdsm, correct?

Comment 10 Roy Golan 2018-07-18 08:36:22 UTC
yes, move it to a back folder. Anyway I don't think you need it cause it is specific to sap workloads.

Comment 11 oliver.albl 2018-07-18 11:02:02 UTC
Do I have to restart vdsm as well or can this be done without restarting (I need some kind of automatism, so the next yum update will not place the file again)?

Comment 12 Roy Golan 2018-07-18 11:46:51 UTC
No you don't have to restart when you displace the file. 
For the future yum update, we proposed a patch to update this hook so it won't affect non sap workloads

Comment 13 oliver.albl 2018-08-30 10:15:16 UTC
Created attachment 1479748 [details]
VDSM logs

It seems we could fix the "Host not responding" symptom, but we still see "VM not responding" states and blocked workers. I upload VDSM logs from today.

Comment 14 Roy Golan 2018-09-05 06:11:10 UTC
Milan I think Dan's fix is the right way to go. Can you take over it and push it?

Comment 15 Milan Zamazal 2018-09-05 15:32:05 UTC
Dan's fix has been merged.

Roy, do you think we need any more fixes for this bug?

Comment 16 Roy Golan 2018-09-07 06:42:17 UTC
Perfect Milan, thanks.

According to Oliver's comment 13 perhaps there are more troubles, can you have a look at  attachment 1479748 [details] ?

Comment 17 Roy Golan 2018-09-07 06:43:40 UTC
I'm raising the priority of this, this fix is important to all installations.

Comment 18 Milan Zamazal 2018-09-11 11:34:51 UTC
It looks like vmContainerLock is still the problem and periodic operations of various executors get stuck on it. This looks suspicious:

  2018-08-30 11:47:21,477+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {u'vmID': u'99b3fff6-f852-4b4d-8fae-259ff78fe2df'}, 'jsonrpc': '2.0', 'method': u'VM.destroy', 'id': u'eb0549e3-919c-4119-a352-52c5e0dd7fab'} at 0x7f4c2c586a50> timeout=60, duration=60 at 0x7f4c2c586f50> task#=62852 at 0x7f4c4474a350>, traceback:
  [...]
  File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1167, in destroyFlags
    ret = libvirtmod.virDomainDestroyFlags(self._o, flags) (executor:363)

Unless it's just a timing coincidence, the libvirt call might be the problem.

I'm afraid it requires further investigation why that happens under load, so I'm moving the bug back to Scale team. I created a separate bug (https://bugzilla.redhat.com/1627734) to track the vhostmd hook issue.

Comment 19 oliver.albl 2018-09-20 08:34:37 UTC
We had to reduce worker threads to default again because we were running in https://bugzilla.redhat.com/show_bug.cgi?id=1620044 where we lose (or cannot recover) VMs!

Comment 20 Sandro Bonazzola 2018-10-23 11:43:13 UTC
Retargeting to 4.2.8 not being identified as blocker for 4.2.7

Comment 21 Roy Golan 2018-10-25 08:33:22 UTC
Milan this should probably be closed as a clone of 1620044?

Comment 22 Milan Zamazal 2018-10-25 14:29:13 UTC
I don't think so. There are different locks involved in the two bugs and there seem to be no paused VMs in this bug. So it's likely these are separate issues.

Comment 23 Roy Golan 2018-10-31 08:07:54 UTC
Raz it looks like the 4.2 bug downport have been Verified. If someone can verify this upstream so we can close this. https://bugzilla.redhat.com/show_bug.cgi?id=1627734

Comment 24 Raz Tamir 2018-10-31 09:09:15 UTC
Hi Roy,

I don't see the relation to the bug you mentioned in comment #23.

Can you please clarify?

Comment 25 Roy Golan 2018-10-31 09:34:54 UTC
Sorry for the mess, lets go again.

This bug has diverged a bit. The patch that was added here is the problem described in the description. Milan opened Bug 1627734 to track the fix for the host non-repsonding caused by the bad hook.

Milan in comment 18 you wrote it happens again - what's the details of this enviorinment?

Comment 26 Milan Zamazal 2018-10-31 09:56:45 UTC
Roy, this is just examination of the logs provided in attachment 1479748 [details]. I haven't met the problem myself.

Comment 27 RHV bug bot 2018-11-28 14:38:47 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Tag 'v4.20.44' doesn't contain patch 'https://gerrit.ovirt.org/93073']
gitweb: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=shortlog;h=refs/tags/v4.20.44

For more info please contact: infra

Comment 28 Eyal Edri 2018-12-13 15:54:01 UTC
If this bug fixes a 4.2 issue, please attach the relevant patch for 4.2 and move to ON_QA if it was part of the last build.

Comment 29 Roy Golan 2018-12-30 21:51:48 UTC
In the end the fix was made on bug 1627734. This bug doesn't have enough info to continue.