Bug 1250839 (vdsm_thread_leak_w_unrespon_storag)

Summary: [scale] VDSM thread leak with unresponsive storage
Product: [oVirt] vdsm Reporter: Francesco Romani <fromani>
Component: GeneralAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: guy chen <guchen>
Severity: high Docs Contact:
Priority: high    
Version: ---CC: alitke, amureini, bazulay, bugs, danken, fromani, lsurette, mgoldboi, michal.skrivanek, rbalakri, s.kieske, srevivo, ycui, ykaul
Target Milestone: ovirt-3.6.5Flags: rule-engine: ovirt-3.6.z+
mgoldboi: testing_plan_complete?
mgoldboi: planning_ack+
michal.skrivanek: devel_ack+
rule-engine: testing_ack+
Target Release: 4.17.11   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
A bug in Vdsm monitoring subsystem caused Vdsm used to periodically spawn new threads to poll unresponsive storage, causing a leak.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-21 14:42:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Francesco Romani 2015-08-06 06:46:39 UTC
Description of problem:
Early test of the new VDSM 4.17.0 which uses bulk stats provided failure scenarios on which VDSM does not detect and react properly to blocked storage.
VDSM kept replacing stuck threads without taking in account blocking operations.

The information is not clear and we lack a reproducer. The report may be caused by unclean environment.

This BZ is to track investigation about this scenario.


Version-Release number of selected component (if applicable):
3.6.0


How reproducible:
still under investigation


Actual results:
VDSM relentlessy spawn threads when storage-related operation blocks, consuming excessive amount of libvirt and system resources


Expected results:
VDSM should detect stuck storage-related operations and react accordingly, avoiding to leak threads retrying when it makes no sense to do so

Comment 1 Francesco Romani 2015-08-06 14:11:28 UTC
Partially reproduced. Partial reproduction is enough to confirm the bug.

The new sampling code in VDSM aggressively replaces threads which doesn't respond quick enough.
In the worst case, a new periodic operation (disk high water mark) is done every 2 seconds.
Depending on the amount of VM experiencing storage troubles, hence getting stuck calls, the interplay with libvirt timeout (or lack thereof) can create too many thread replacements.

This creates huge amount of thread, most of them are stuck and waiting to die, but wasteful nevertheless.

There is a safety net in place: each operation checks is libvirt seems clogged
using virDomainGetControlInfo(), and go ahead if and only if the API reports free domain.

Apparently, using this call is not free and makes things worse under extreme conditions.

Reproduction:

Have a oVirt setup with a number of VMs greater than the libvirt worker thread pool size (5). I used 32 VMs. I used ISCSI storage but should make no difference.

Make sure each VM is doing some I/O. I used the 'stress' tool with mildly aggressive settings, autostarted at VM boot with systemd

  stress -d 4 -i 4 --hdd-bytes 128M

let the VM do some traffic, then kill the connection to storage, using iptables.

The thread periodic replacement kicks in, the number of VDSM threads rapidly increases. In my test, it reached ~400.

Eventually, in my test the timeout triggered and stuck threads died, setting back VDSM thread count to sane values (~34).

Further investigations will now explore if:
- the behaviour of virDomainGetControlInfo is compliant
- the usage in VDSM of virDomainGetControlInfo is safe and sane
- we need a more cautious approach to assess libvirt business, like the PoC in 
  https://gerrit.ovirt.org/#/c/44401/

Comment 2 Nir Soffer 2015-08-09 11:14:35 UTC
(In reply to Francesco Romani from comment #1)
What is the libvirt timeout, or does it have any timeout? can a libvirt call block
forever?

I think the system should behave like this:

1. Start with X executor threads
2. Once a thread times out, the stuck thread should be discarded, starting
   a new executor
3. The vm which is blocked should not get any new call until the stuck
   thread is finished

In the worst case, we would have N stuck threads for N vms + X executor threads,
until libvirt times out.

Comment 3 Francesco Romani 2015-08-12 08:44:27 UTC
Hi Nir, thanks for chiming in!

I'll do my best to explain the situation in a terse way, apologies if the comment comes out long.

(In reply to Nir Soffer from comment #2)
> (In reply to Francesco Romani from comment #1)
> What is the libvirt timeout, or does it have any timeout? can a libvirt call
> block forever?

Short version:
yes, there is a timeout, it is 30s and the problem is it is too high for our purposes. See below.

Long version:
There are two sides here. One is the daemon side -libvirtd. On this side, libvirtd accepts rpc call from clients (transparently done using the python bindings).

Reference: http://libvirt.org/git/?p=libvirt.git;a=blob;f=daemon/THREADS.txt;h=762ca25c555d3f14c48a1c54e167071a8ffc9774;hb=HEAD

Here I need further investigation to understand what happens if the libvirtd thread worker pool is exhausted. I (used to) think the call will just fail immediately, but I want to make sure.

On the implementation side, speaking of QEMU, libvirtd speaks to the hypervisor using an UNIX domain socket and JSON protocol. Last time I checked, AFAIU the protocol is synchronous, meaning that requests and responses cannot be interleaved, and each request must wait for the previous to complete - will doublecheck.

Libvirt has a timeout here. It waits for 30 seconds before to give up:
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/qemu/qemu_domain.c;h=abf52c9c38fa10ac9a61abb9dfe2d44e012755ee;hb=HEAD#l1433

but this covers the case on which the monitor's already busy and another thread is trying to use it. Which is mostly good enough for us


> I think the system should behave like this:
> 
> 1. Start with X executor threads
> 2. Once a thread times out, the stuck thread should be discarded, starting
>    a new executor
> 3. The vm which is blocked should not get any new call until the stuck
>    thread is finished
> 
> In the worst case, we would have N stuck threads for N vms + X executor
> threads,
> until libvirt times out.

This is correct and I 100% agree.
This is also what it was implemented but point 3 failed under massive scale. 

To implement point 3 I used this API:

http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetControlInfo

used like this:

https://github.com/oVirt/vdsm/blob/master/vdsm/virt/vm.py#L2703

the gist here is: avoid to stress libvirt even further at the slightest clue of clogging. This is the reason why we check only for VIR_DOMAIN_CONTROL_OK, it is the safest of the situations (no job working).

I peeked at the implementation of the code, and it actuall just fetches few state variables:

http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/qemu/qemu_driver.c;h=2e445009fbdd883d539d927a1a8a6574ab80bb7f;hb=HEAD#l2696

hence it seemed a good idea at time. Using this call we reduce the duplication of libvirt state (libvirtd already knows if it's busy or not!)

Finally, the problem we're addressing here:
- massive setup with 240+ cores and 100+ VMs
- unexpected storage outage - this is also the easiest way to trigger qemu blocks.
- vdsm started to leak threads because the protection ensured by isDomainReadyForCommands was ineffective. virDomainGetControlInfo returned OK or blocked itself (!)
  every 30s:
  15 calls to DrivWatermarkMonitor
   1 call to UpdateVolumes (can use libvirt)
   1 call to BlockJobMonitor
   1 call to VMBUlkSampler
  (NumaInfoMonitor doesn't directly access libvirt- this is another bug :))
  so 18 threads per VM which can be discarded (and they were!) and 1 threads that timeouts and it is killed. We lost 34 threads per vm per minute[1].

This is probably the worst possible scenario ever, but we need to handle it, full stop.

My problem now is that I don't have a clear reproducer yet, so I cannot file yet a proper libvirt bug.

While investigating, I'm experimenting alternative approaches to replace isDomainReadyForCommands/virDomainGetControlInfo: see early proof of concept here:
https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:bulk_stats_thread_leak2

Comment 4 Red Hat Bugzilla Rules Engine 2015-10-21 18:27:04 UTC
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.

Comment 5 Yaniv Lavi 2015-10-29 12:26:15 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 6 Francesco Romani 2015-11-10 09:41:01 UTC
https://gerrit.ovirt.org/#/c/48332/ seems to fix this issue. Reproduction instructions will follow.

Comment 7 Francesco Romani 2015-11-10 09:52:38 UTC
Found a good reproducer while investigating the VDSM/virt part of https://bugzilla.redhat.com/show_bug.cgi?id=1261980

Goal: make qemu stuck in *I/O* (SIGSTOP'ing it is not good enough), observe if VDSM detects or not stuck QEMU

Desired behaviour: after a little while, VDSM should detect unresponsive QEMU and stop sendint queries to it, avoiding to have more and more stuck calls, which in turn trigger thread replacement in executor, thus thread leaking

Buggy behaviour: VDSM keeps querying stuck QEMU, or fails to detect stuck QEMU.
VDSM keeps querying blocked QEMU, stuck calls piles up, executor keeps replacing
stuck worker threads, VDSM process leaks threads. See https://bugzilla.redhat.com/show_bug.cgi?id=1250839#c3 for more details

Reproduction steps:
1. have a 3.6.x oVirt environment with two Storage Domains: ISCSI and NFS. It should work also with one Storage Domain NFS only, but not tried
2. prepare two VMs, one with disk on NFS, one with disk on ISCSI
3. run the two VMs
4. inside the VMs, generate some I/O traffic. Make sure I/O is not cached. Example:

  # dd if=/dev/urandom of=/dev/vda oflag=direct
5. [OPTIONAL] check on the host the VMs are generating traffic (vdsCLient getAllVmStats)
6. wait few seconds to make sure the traffic is flowing
7. block NFS using iptables (or any other mean), to make one QEMU stuck

Comment 11 Sandro Bonazzola 2015-11-24 16:43:36 UTC
Please set target release or I can't move the bug to ON_QA automatically.

Comment 12 Red Hat Bugzilla Rules Engine 2015-11-24 18:08:44 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 13 Francesco Romani 2015-11-25 09:12:24 UTC
patch 49143 is not needed to solve this BZ, it "only" makes troubleshooting easier.

Restoring BZ state

Comment 14 Francesco Romani 2016-01-20 08:52:08 UTC
Not sure we should advertise this bug in the docs. Added doc_text just in case.

Comment 15 Francesco Romani 2016-03-07 12:34:16 UTC
Please note that patch http://gerrit.ovirt.org/53998 is not needed to solve this bug. It just makes troubleshoot easier.
Only patch http://gerrit.ovirt.org/48348 fixes this bug, and this was merged long ago.

Comment 16 guy chen 2016-04-12 10:42:55 UTC
Was verified on VDSM 4.17.25.

Run 2 VM's on 1 host, 1 VM connected to iscsi and 1 VM connected to NFS.
At the beginning on the test VDSM had 57 threads.
After started IO load on both VM's VDSM increased to 135 threads.
After blocked NFS storage, for about 30 seconds the number of threads stayed the same, then it dropped to 60.