Bug 965169 - Unable to move tasks from domain cgroup to emulator cgroup
Unable to move tasks from domain cgroup to emulator cgroup
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: libvirt (Show other bugs)
18
All All
unspecified Severity high
: ---
: ---
Assigned To: Eric Blake
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-20 11:20 EDT by IBM Bug Proxy
Modified: 2013-06-24 23:24 EDT (History)
12 users (show)

See Also:
Fixed In Version: libvirt-0.10.2.6-1.fc18
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-24 23:24:58 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
qemu cgroup trace (594.89 KB, application/octet-stream)
2013-05-20 11:21 EDT, IBM Bug Proxy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 91788 None None None Never

  None (edit)
Description IBM Bug Proxy 2013-05-20 11:20:39 EDT
---uname output---
Linux r35lp09 3.8.6-60.x.20130411-s390xrhel
 
Machine Type = zSeries 

---Steps to Reproduce---
 
- I was running 14 kvm guests with workload:
e20 - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 4000, stress -c 1
e21 - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 3000, stress -c  1                        
e22 - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 4000
e23 - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c  2
e23 - 2G memory - 1,4 G swap - vcpu 2  running  - stress -c  2
e10 - 1G memory -  no swap - vcpu1  running  - fmempig 600
e11 - 1G memory - no swap - vcpu 1  running  - fmempig 600                                                           
s40 - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c 1                          
s41 - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 3000, stress -c 1                          
s42  - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 5000                        
s43  - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c 1                        
s44  - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c  2                      
s45  - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c  2                     
e23  - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c 2 

- Than i tried to start 3 additional guest with following result
[root@r35lp09 ?]# ./zdomain.sh
Domain e25 started

(Time: 55042.598 ms)

error: Failed to start domain e26
error: Unable to move tasks from domain cgroup to emulator cgroup in controller 0 for e26: No such process

(Time: 69792.559 ms)

Domain e27 started

(Time: 57426.563 ms)

- i took a qemu cgroup trace

- Analysis for this problem

I understand now what happens, but you won't like it ;).
The system seems to be under extreme I/O pressure and this exposes a race condition in the interaction between libvirt and qemu.
Since this is a conceptual problem which will happen under high I/O load on any kind of architecture it is necessary to open a LTC bugzilla against it.

Here's what happens: When the QEMU process is started for e26 all thread ids are contained in /sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks.
Libvirt now tries to move the CPU threads into /sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/vcpu*/tasks (this succeeds) and the remaining threads into
/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/emulator/tasks.
Now comes the race: one of the threads originally in e26/tasks has ended (which is perfectly OK) but the libvirt still wants to place it in e26/emulator/tasks but this cannot work.

Here's the relevant part of the trace...thread id 51865 is gone by the time libvirt wants to move it.

2013-04-04 09:58:08.221+0000: 51161: debug : virCgroupGetValueStr:361 : Get value /sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks
2013-04-04 09:58:08.221+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/emulator/tasks' to '51805'
2013-04-04 09:58:11.070+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/emulator/tasks' to '51809'
2013-04-04 09:58:12.740+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/emulator/tasks' to '51865'
2013-04-04 09:58:12.740+0000: 51161: debug : virCgroupSetValueStr:335 : Failed to write value '51865': No such process
2013-04-04 09:58:12.740+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks' to '51805'
2013-04-04 09:58:14.660+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks' to '51809'
2013-04-04 09:58:16.660+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks' to '51865'
2013-04-04 09:58:16.660+0000: 51161: debug : virCgroupSetValueStr:335 : Failed to write value '51865': No such process
2013-04-04 09:58:16.660+0000: 51161: error : virCgroupMoveTask:911 : Cannot recover cgroup /sys/fs/cgroup/cpu,cpuacct from /sys/fs/cgroup/cpu,cpuacct
2013-04-04 09:58:16.660+0000: 51161: error : qemuSetupCgroupForEmulator:713 : Unable to move tasks from domain cgroup to emulator cgroup in controller 0 for e26: No such process
Comment 1 IBM Bug Proxy 2013-05-20 11:21:01 EDT
Created attachment 750629 [details]
qemu cgroup trace
Comment 2 Daniel Berrange 2013-05-20 11:23:51 EDT
See also upstream thread about this problem: https://www.redhat.com/archives/libvir-list/2013-May/msg01360.html
Comment 3 Eric Blake 2013-05-20 11:27:42 EDT
Known issue, and I'm working on the fix.  There's a race in libvirt (race one in this thread: https://www.redhat.com/archives/libvir-list/2013-May/msg01360.html).  Sometimes, qemu starts short-lived threads (perhaps glibc is spawning a thread to do aio work while reading a disk image); the race is that if the temporary qemu thread exits in between the time that libvirt reads two tids from the source cgroup and but only one tid remains alive at the time of the write to the destination cgroup, then the second write will fail, and libvirt is turning that failure into a catastrophic cascade that prevents the domain from starting.

It looks like the fix will be teaching libvirt to ignore failure on moving an (exited) process into another cgroup.
Comment 5 Eric Blake 2013-05-20 23:08:24 EDT
Upstream patch posted:
https://www.redhat.com/archives/libvir-list/2013-May/msg01478.html
Comment 6 IBM Bug Proxy 2013-05-21 10:59:44 EDT
------- Comment From aliguori@us.ibm.com 2013-05-21 14:31 EDT-------
Hi Eric,

You should CC qemu-devel if you resubmit.  These threads are our AIO pool.  It has a fixed size and we have logic to tear down idle threads and respawn threads as needed.

We could also add an option to not tear down idle threads if it made cgroup management more deterministic...
Comment 7 Daniel Berrange 2013-05-21 13:40:15 EDT
I don't think there's any need for QEMU to change what its doing. We now iterate until the original cgroup tasks file is empty, so we can be guaranteed to move all QEMU threads, even if it spawns more while we're working. QEMU isn't spawning these threads so fast that this approach is a problem.
Comment 8 Eric Blake 2013-05-21 14:04:28 EDT
Furthermore, if all your threads are being spawned by a master thread (all helper threads share a common parent) rather than the alternative of spawning each new thread from the most-recent thread (later threads are separated from the original parent thread by intermediate threads), then the moment we have moved the parent thread, all further threads that the parent spawns will already be in the right group, at which point libvirt's looping code will generally iterate at most twice before picking up all threads, no matter how fast your master thread is spawning them.  I agree that qemu doesn't need to change its policy on thread usage at this time.
Comment 9 Eric Blake 2013-05-21 16:08:57 EDT
Patch in comment 5 is now commit 83e4c775 upstream, and I have already backported it to v0.10.2-maint (F18) and v1.0.5-maint (F19).
Comment 10 Fedora Update System 2013-06-12 18:14:15 EDT
libvirt-0.10.2.6-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/libvirt-0.10.2.6-1.fc18
Comment 11 Fedora Update System 2013-06-13 22:26:56 EDT
Package libvirt-0.10.2.6-1.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libvirt-0.10.2.6-1.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-10805/libvirt-0.10.2.6-1.fc18
then log in and leave karma (feedback).
Comment 12 IBM Bug Proxy 2013-06-19 03:10:58 EDT
------- Comment From kamaleshb@in.ibm.com 2013-06-19 07:07 EDT-------
Hi,

tested is successfully with libvirt-1.0.6-487.kvm.20130610.ga191a2b.s390x in fc18.

thanks

Agi
Comment 13 Fedora Update System 2013-06-24 23:24:58 EDT
libvirt-0.10.2.6-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.