965169 – Unable to move tasks from domain cgroup to emulator cgroup

Bug 965169 - Unable to move tasks from domain cgroup to emulator cgroup

Summary: Unable to move tasks from domain cgroup to emulator cgroup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	libvirt
Sub Component:
Version:	18
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Eric Blake
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-20 15:20 UTC by IBM Bug Proxy
Modified:	2013-06-25 03:24 UTC (History)
CC List:	12 users (show)
Fixed In Version:	libvirt-0.10.2.6-1.fc18
Clone Of:
Environment:
Last Closed:	2013-06-25 03:24:58 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
qemu cgroup trace (594.89 KB, application/octet-stream) 2013-05-20 15:21 UTC, IBM Bug Proxy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	91788	0	None	None	None	Never

Description IBM Bug Proxy 2013-05-20 15:20:39 UTC

---uname output---
Linux r35lp09 3.8.6-60.x.20130411-s390xrhel
 
Machine Type = zSeries 

---Steps to Reproduce---
 
- I was running 14 kvm guests with workload:
e20 - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 4000, stress -c 1
e21 - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 3000, stress -c  1                        
e22 - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 4000
e23 - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c  2
e23 - 2G memory - 1,4 G swap - vcpu 2  running  - stress -c  2
e10 - 1G memory -  no swap - vcpu1  running  - fmempig 600
e11 - 1G memory - no swap - vcpu 1  running  - fmempig 600                                                           
s40 - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c 1                          
s41 - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 3000, stress -c 1                          
s42  - 4G memory - 1,4 G swap - vcpu 2  running  - fmempig 5000                        
s43  - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c 1                        
s44  - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c  2                      
s45  - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c  2                     
e23  - 4G memory - 1,4 G swap - vcpu 2  running  - stress -c 2 

- Than i tried to start 3 additional guest with following result
[root@r35lp09 ?]# ./zdomain.sh
Domain e25 started

(Time: 55042.598 ms)

error: Failed to start domain e26
error: Unable to move tasks from domain cgroup to emulator cgroup in controller 0 for e26: No such process

(Time: 69792.559 ms)

Domain e27 started

(Time: 57426.563 ms)

- i took a qemu cgroup trace

- Analysis for this problem

I understand now what happens, but you won't like it ;).
The system seems to be under extreme I/O pressure and this exposes a race condition in the interaction between libvirt and qemu.
Since this is a conceptual problem which will happen under high I/O load on any kind of architecture it is necessary to open a LTC bugzilla against it.

Here's what happens: When the QEMU process is started for e26 all thread ids are contained in /sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks.
Libvirt now tries to move the CPU threads into /sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/vcpu*/tasks (this succeeds) and the remaining threads into
/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/emulator/tasks.
Now comes the race: one of the threads originally in e26/tasks has ended (which is perfectly OK) but the libvirt still wants to place it in e26/emulator/tasks but this cannot work.

Here's the relevant part of the trace...thread id 51865 is gone by the time libvirt wants to move it.

2013-04-04 09:58:08.221+0000: 51161: debug : virCgroupGetValueStr:361 : Get value /sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks
2013-04-04 09:58:08.221+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/emulator/tasks' to '51805'
2013-04-04 09:58:11.070+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/emulator/tasks' to '51809'
2013-04-04 09:58:12.740+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/emulator/tasks' to '51865'
2013-04-04 09:58:12.740+0000: 51161: debug : virCgroupSetValueStr:335 : Failed to write value '51865': No such process
2013-04-04 09:58:12.740+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks' to '51805'
2013-04-04 09:58:14.660+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks' to '51809'
2013-04-04 09:58:16.660+0000: 51161: debug : virCgroupSetValueStr:331 : Set value '/sys/fs/cgroup/cpu,cpuacct/system/libvirtd.service/libvirt/qemu/e26/tasks' to '51865'
2013-04-04 09:58:16.660+0000: 51161: debug : virCgroupSetValueStr:335 : Failed to write value '51865': No such process
2013-04-04 09:58:16.660+0000: 51161: error : virCgroupMoveTask:911 : Cannot recover cgroup /sys/fs/cgroup/cpu,cpuacct from /sys/fs/cgroup/cpu,cpuacct
2013-04-04 09:58:16.660+0000: 51161: error : qemuSetupCgroupForEmulator:713 : Unable to move tasks from domain cgroup to emulator cgroup in controller 0 for e26: No such process

Comment 1 IBM Bug Proxy 2013-05-20 15:21:01 UTC

Created attachment 750629 [details]
qemu cgroup trace

Comment 2 Daniel Berrangé 2013-05-20 15:23:51 UTC

See also upstream thread about this problem: https://www.redhat.com/archives/libvir-list/2013-May/msg01360.html

Comment 3 Eric Blake 2013-05-20 15:27:42 UTC

Known issue, and I'm working on the fix.  There's a race in libvirt (race one in this thread: https://www.redhat.com/archives/libvir-list/2013-May/msg01360.html).  Sometimes, qemu starts short-lived threads (perhaps glibc is spawning a thread to do aio work while reading a disk image); the race is that if the temporary qemu thread exits in between the time that libvirt reads two tids from the source cgroup and but only one tid remains alive at the time of the write to the destination cgroup, then the second write will fail, and libvirt is turning that failure into a catastrophic cascade that prevents the domain from starting.

It looks like the fix will be teaching libvirt to ignore failure on moving an (exited) process into another cgroup.

Comment 5 Eric Blake 2013-05-21 03:08:24 UTC

Upstream patch posted:
https://www.redhat.com/archives/libvir-list/2013-May/msg01478.html

Comment 6 IBM Bug Proxy 2013-05-21 14:59:44 UTC

------- Comment From aliguori.com 2013-05-21 14:31 EDT-------
Hi Eric,

You should CC qemu-devel if you resubmit.  These threads are our AIO pool.  It has a fixed size and we have logic to tear down idle threads and respawn threads as needed.

We could also add an option to not tear down idle threads if it made cgroup management more deterministic...

Comment 7 Daniel Berrangé 2013-05-21 17:40:15 UTC

I don't think there's any need for QEMU to change what its doing. We now iterate until the original cgroup tasks file is empty, so we can be guaranteed to move all QEMU threads, even if it spawns more while we're working. QEMU isn't spawning these threads so fast that this approach is a problem.

Comment 8 Eric Blake 2013-05-21 18:04:28 UTC

Furthermore, if all your threads are being spawned by a master thread (all helper threads share a common parent) rather than the alternative of spawning each new thread from the most-recent thread (later threads are separated from the original parent thread by intermediate threads), then the moment we have moved the parent thread, all further threads that the parent spawns will already be in the right group, at which point libvirt's looping code will generally iterate at most twice before picking up all threads, no matter how fast your master thread is spawning them.  I agree that qemu doesn't need to change its policy on thread usage at this time.

Comment 9 Eric Blake 2013-05-21 20:08:57 UTC

Patch in comment 5 is now commit 83e4c775 upstream, and I have already backported it to v0.10.2-maint (F18) and v1.0.5-maint (F19).

Comment 10 Fedora Update System 2013-06-12 22:14:15 UTC

libvirt-0.10.2.6-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/libvirt-0.10.2.6-1.fc18

Comment 11 Fedora Update System 2013-06-14 02:26:56 UTC

Package libvirt-0.10.2.6-1.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libvirt-0.10.2.6-1.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-10805/libvirt-0.10.2.6-1.fc18
then log in and leave karma (feedback).

Comment 12 IBM Bug Proxy 2013-06-19 07:10:58 UTC

------- Comment From kamaleshb.com 2013-06-19 07:07 EDT-------
Hi,

tested is successfully with libvirt-1.0.6-487.kvm.20130610.ga191a2b.s390x in fc18.

thanks

Agi

Comment 13 Fedora Update System 2013-06-25 03:24:58 UTC

libvirt-0.10.2.6-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.