Bug 1600507
| Summary: | Host not responding | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | oliver.albl | ||||||
| Component: | General | Assignee: | Roy Golan <rgolan> | ||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Daniel Gur <dagur> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | low | ||||||||
| Version: | 4.20.31 | CC: | bugs, christian.grundmann, dfediuck, eedri, mzamazal, oliver.albl, ratamir, rgolan | ||||||
| Target Milestone: | ovirt-4.2.8 | Keywords: | Performance | ||||||
| Target Release: | --- | Flags: | rule-engine:
ovirt-4.2+
rule-engine: ovirt-4.3+ |
||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-12-30 21:51:48 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Scale | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 1627734 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
We need some more information. How many VMs are running on the host? What is the actual load (CPU utilization, memory consumption)? Are you overcommiting memory? During the issue there were about 40 VMs running on that host. Memory usage is about 50 percent, CPU usage around 70 percent. Every shutdown command will take a lock on the vm, and will spin another api call from outside, from the vmhostmd hook. The lock (aka VmContainerLock) is shared with other flows (like for protecting the external vms list)
The jsonrpc server has 8 threads. This means we can't take more than 4 concurrent shutdowns properly? Lets see
engine.log says your Vms are non-responding. Usually this means we didn't get the statistics for those VMs.
vdsm.log shows that the monitoring of those vms exceeded 60 seconds and therefor is considered non responsive(single VM for brevity):
--- 2018-07-11 21:38:36,330+0200 WARN (jsonrpc/7) [virt.vm] (vmId='e4caddb3-0883-4d95-84bc-712f7a9e77b1') monitor became unresponsive (command timeout, age=78.9800000004) (vm:6151)
The internal periodic task for that VM is actually blocked and can't get the lock
--- 2018-07-11 21:38:33,872+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/15 running <Task <Operation action=<function <lambda> at 0x7faf84297d70> at 0x7faf842af89
... (stacktrace deleted for brevity)
File: "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 159, in
getVMs
with self.vmContainerLock:
And the reason it is blocked is because it is waiting for the VmContainerLoc, taked by the shutdown command which is in progress for this VM, which is
trying to invoke getVMFullList but doesn't have a thread to do so because
the json pool has 8 workers:
File "/usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd", line 35, in <module>
res = cli.Host.getVMFullList()
File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 278, in _call
raise TimeoutError(method, kwargs, timeout)
vdsm.client.TimeoutError: Request Host.getVMFullList with args {} timed out after 60 seconds
The visualize:
Shutdown VM thread :
- take vmContainer Lock
- call the cli getFullvmlist
- wait for a thread another json thread
---
I suggest here a workaround that will also test my analysis: - uninstall the vdsm-hook-vhostmd package(you are not running SAP workloads right?). That will spare the extra call to cli during shutdown of a vm - increase the number of json workers - add this to /etc/vdsm/vdsm.conf [rpc] worker_threads = 16 Do I need to place the hosts in maintenance mode or can this be done online with vms running? No, just restart the service after the change, the running vms will stay in tact. I will add the following to vdsm.conf.d/50... file: [rpc] worker_threads = 16 Removing vdsm-hook-vhostmd would also remove packages cockpit-ovirt-dashboard, ovirt-host and ovirt-hosted-engine-setup, so would stay with changing worker_threads in the first step only. (In reply to oliver.albl from comment #7) > Removing vdsm-hook-vhostmd would also remove packages > cockpit-ovirt-dashboard, ovirt-host and ovirt-hosted-engine-setup, so would > stay with changing worker_threads in the first step only. It is important you remove it so we can just disable it by just moving this file to another directory: /usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd So you want me to remove /usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd (or move it to a backup folder) and restart vdsm, correct? yes, move it to a back folder. Anyway I don't think you need it cause it is specific to sap workloads. Do I have to restart vdsm as well or can this be done without restarting (I need some kind of automatism, so the next yum update will not place the file again)? No you don't have to restart when you displace the file. For the future yum update, we proposed a patch to update this hook so it won't affect non sap workloads Created attachment 1479748 [details]
VDSM logs
It seems we could fix the "Host not responding" symptom, but we still see "VM not responding" states and blocked workers. I upload VDSM logs from today.
Milan I think Dan's fix is the right way to go. Can you take over it and push it? Dan's fix has been merged. Roy, do you think we need any more fixes for this bug? Perfect Milan, thanks. According to Oliver's comment 13 perhaps there are more troubles, can you have a look at attachment 1479748 [details] ? I'm raising the priority of this, this fix is important to all installations. It looks like vmContainerLock is still the problem and periodic operations of various executors get stuck on it. This looks suspicious:
2018-08-30 11:47:21,477+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {u'vmID': u'99b3fff6-f852-4b4d-8fae-259ff78fe2df'}, 'jsonrpc': '2.0', 'method': u'VM.destroy', 'id': u'eb0549e3-919c-4119-a352-52c5e0dd7fab'} at 0x7f4c2c586a50> timeout=60, duration=60 at 0x7f4c2c586f50> task#=62852 at 0x7f4c4474a350>, traceback:
[...]
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1167, in destroyFlags
ret = libvirtmod.virDomainDestroyFlags(self._o, flags) (executor:363)
Unless it's just a timing coincidence, the libvirt call might be the problem.
I'm afraid it requires further investigation why that happens under load, so I'm moving the bug back to Scale team. I created a separate bug (https://bugzilla.redhat.com/1627734) to track the vhostmd hook issue.
We had to reduce worker threads to default again because we were running in https://bugzilla.redhat.com/show_bug.cgi?id=1620044 where we lose (or cannot recover) VMs! Retargeting to 4.2.8 not being identified as blocker for 4.2.7 Milan this should probably be closed as a clone of 1620044? I don't think so. There are different locks involved in the two bugs and there seem to be no paused VMs in this bug. So it's likely these are separate issues. Raz it looks like the 4.2 bug downport have been Verified. If someone can verify this upstream so we can close this. https://bugzilla.redhat.com/show_bug.cgi?id=1627734 Hi Roy, I don't see the relation to the bug you mentioned in comment #23. Can you please clarify? Sorry for the mess, lets go again. This bug has diverged a bit. The patch that was added here is the problem described in the description. Milan opened Bug 1627734 to track the fix for the host non-repsonding caused by the bad hook. Milan in comment 18 you wrote it happens again - what's the details of this enviorinment? Roy, this is just examination of the logs provided in attachment 1479748 [details]. I haven't met the problem myself.
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Tag 'v4.20.44' doesn't contain patch 'https://gerrit.ovirt.org/93073'] gitweb: https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=shortlog;h=refs/tags/v4.20.44 For more info please contact: infra If this bug fixes a 4.2 issue, please attach the relevant patch for 4.2 and move to ON_QA if it was part of the last build. In the end the fix was made on bug 1627734. This bug doesn't have enough info to continue. |
Created attachment 1458381 [details] Engine and VDSM logs Description of problem: I run a cluster with 10 hosts for automated testing purposes. Sometimes a host and its VMs are not responding for some minutes when under load Version-Release number of selected component (if applicable): oVirt 4.2.4.5-1.el7 vdsm-4.20.32-1.el7.x86_64 How reproducible: Generate load on hosts by automatically creating vms, running vm workloads, deleting vms Steps to Reproduce: 1. 2. 3. Actual results: Host reports "VM xxx is not responding" for all VMs Engine reports "Host xxx is not responding" VMs stay available and run their workload, but automatic tasks (startup, shutdown, destroy...) cannot continue. Expected results: Automatic tasks (startup, shutdown, ...) should be possible. Additional info: Uploaded engine.log and vdsm.log from affected host, issue occurred around 21:38