Description of problem: The libvirt_lxc helper consumes 100% cpu because it sees some event on /dev/ptmx without processing this event: ----- [pid 21168] 15:56:43.966141 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=14, events=POLLIN}, {fd=26, events=POLLIN|POLLOUT}, {fd=15, events=POLLIN}], 7, 4294967295) = 1 ([{fd=26, revents=POLLHUP}]) [pid 21168] 15:56:43.966174 gettid() = 21168 [pid 21168] 15:56:43.966193 gettid() = 21168 [pid 21168] 15:56:43.966212 gettid() = 21168 [pid 21168] 15:56:43.966236 gettid() = 21168 [pid 21168] 15:56:43.966268 gettid() = 21168 [pid 21168] 15:56:43.966293 gettid() = 21168 [pid 21168] 15:56:43.966316 gettid() = 21168 [pid 21168] 15:56:43.966339 gettid() = 21168 [pid 21168] 15:56:43.966362 gettid() = 21168 [pid 21168] 15:56:43.966385 gettid() = 21168 [pid 21168] 15:56:43.966409 gettid() = 21168 [pid 21168] 15:56:43.966432 gettid() = 21168 [pid 21168] 15:56:43.966455 gettid() = 21168 [pid 21168] 15:56:43.966478 gettid() = 21168 [pid 21168] 15:56:43.966501 gettid() = 21168 [pid 21168] 15:56:43.966524 gettid() = 21168 [pid 21168] 15:56:43.966547 gettid() = 21168 [pid 21168] 15:56:43.966570 gettid() = 21168 [pid 21168] 15:56:43.966593 gettid() = 21168 [pid 21168] 15:56:43.966615 gettid() = 21168 [pid 21168] 15:56:43.966638 gettid() = 21168 [pid 21168] 15:56:43.966662 gettid() = 21168 [pid 21168] 15:56:43.966684 gettid() = 21168 [pid 21168] 15:56:43.966708 gettid() = 21168 [pid 21168] 15:56:43.966730 gettid() = 21168 [pid 21168] 15:56:43.966753 gettid() = 21168 [pid 21168] 15:56:43.966776 gettid() = 21168 [pid 21168] 15:56:43.966799 gettid() = 21168 [pid 21168] 15:56:43.966822 gettid() = 21168 [pid 21168] 15:56:43.966845 gettid() = 21168 [pid 21168] 15:56:43.966867 gettid() = 21168 [pid 21168] 15:56:43.966890 gettid() = 21168 [pid 21168] 15:56:43.966913 gettid() = 21168 [pid 21168] 15:56:43.966936 gettid() = 21168 [pid 21168] 15:56:43.966958 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=14, events=POLLIN}, {fd=26, events=POLLIN|POLLOUT}, {fd=15, events=POLLIN}], 7, 4294967295) = 1 ([{fd=26, revents=POLLHUP}]) ---- libvirt_l 21168 root 0u CHR 1,3 0t0 1028 /dev/null libvirt_l 21168 root 1w REG 253,12 12402 131418 /var/log/libvirt/lxc/XXXXX.log libvirt_l 21168 root 2w REG 253,12 12402 131418 /var/log/libvirt/lxc/XXXXX.log libvirt_l 21168 root 3r FIFO 0,8 0t0 137497 pipe libvirt_l 21168 root 4w FIFO 0,8 0t0 137497 pipe libvirt_l 21168 root 5r FIFO 0,8 0t0 139448 pipe libvirt_l 21168 root 6w FIFO 0,8 0t0 139448 pipe libvirt_l 21168 root 7u unix 0xffff8802de28d040 0t0 139449 /var/run/libvirt/lxc/XXXXX.sock libvirt_l 21168 root 8u unix 0xffff8802f5e1ba80 0t0 136795 socket libvirt_l 21168 root 9r FIFO 0,8 0t0 136829 pipe libvirt_l 21168 root 10u unix 0xffff8802f5e1ad80 0t0 136797 socket libvirt_l 21168 root 11w FIFO 0,8 0t0 136829 pipe libvirt_l 21168 root 12u CHR 10,229 0t0 12117 /dev/fuse libvirt_l 21168 root 13u CHR 5,2 0t0 2 /var/run/libvirt/lxc/XXXXX.devpts/ptmx libvirt_l 21168 root 14u a_inode 0,9 0 5802 [eventpoll] libvirt_l 21168 root 15u unix 0xffff8802ce0830c0 0t0 136830 /var/run/libvirt/lxc/XXXXX.sock libvirt_l 21168 root 26u CHR 5,2 0t0 1121 /dev/ptmx Version-Release number of selected component (if applicable): libvirt-daemon-1.0.5.5-1.fc19.x86_64
danpb does this sound familiar at all?
I encounted the same problem on a mips compute, but on x86, it works ok. to Enrico Scholz: what platform are you using ? x86 ? I found the poll in virEventPollRunOnce (vireventpoll.c), a POLLHUP event occurred for a pipe descriptor, thus the poll always returned because of it. ( the pipe broken). and in lxc/lxc_controller.c, " if (events & VIR_EVENT_HANDLE_HANGUP) { if (watch == console->hostWatch) { console->hostClosed = true; } else { console->contClosed = true; } VIR_DEBUG("Got EOF on %d %d", watch, fd); } " it seems to have handled the hangup problem, I dont't know how to going on ... above is what i got from tracking .
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollRunOnce:638 : Poll got 1 event(s) 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchTimeouts:423 : Dispatch 2 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:468 : Dispatch 7 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=0 w=1 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=1 w=2 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=2 w=3 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=3 w=4 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=4 w=5 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=5 w=6 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollDispatchHandles:496 : EVENT_POLL_DISPATCH_HANDLE: watch=6 events=8 2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleIO:974 : IO event watch=6 fd=18 events=8 fromHost=0 fromcont=1024 2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleIO:1044 : Got EOF on 6 18 2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleUpdateWatch:837 : Container watch 7=0 host watch 6=3 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=7 events=0 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 1996885056 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=6 events=3 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 1996885056 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollDispatchHandles:482 : i=7 w=8 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollCleanupTimeouts:514 : Cleanup 2 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupHandles:562 : Cleanup 8 2013-10-30 00:31:57.471+0000: 7566: debug : virEventRunDefaultImpl:270 : running default event implementation 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupTimeouts:514 : Cleanup 2 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupHandles:562 : Cleanup 8 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=0 w=1, f=3 e=1 d=0 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=1 w=2, f=5 e=1 d=0 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=2 w=3, f=7 e=1 d=0 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=3 w=4, f=9 e=1 d=0 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=4 w=5, f=13 e=1 d=0 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=5 w=6, f=18 e=5 d=0 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=6 w=7, f=12 e=0 d=0 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=7 w=8, f=14 e=1 d=0 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollCalculateTimeout:332 : Calculate expiry of 2 timers 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollCalculateTimeout:361 : Timeout at 0 due in -1 ms 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollRunOnce:627 : EVENT_POLL_RUN: nhandles=7 timeout=-1
In normal situation ,the poll would wait , but on my compute, It seems a pipe broken, so there's always hangup event, so the poll returns immediately, so an infinite loop, which causes the 100% cpu usage.
Created attachment 818259 [details] the debug log the debug log of libvirt_lxc
When this 100% cpu problem occurs, can anyone tell me if they still see a process running as a child of the 'libvirt_lxc' process ? Whether a zombie or normal process. Looking at the code, the only way I can see us getting into this problem, is if the guest closed its console (STDIO) streams, but did not exit. We should always get SIG_CHILD if the process exited, so I'm thinking perhaps the container's init process closed stdin/out/err, but kept running
(In reply to Daniel Berrange from comment #6) yes,the ps tree is the same as it is normal > When this 100% cpu problem occurs, can anyone tell me if they still see a > process running as a child of the 'libvirt_lxc' process ? Whether a zombie > or normal process. > > Looking at the code, the only way I can see us getting into this problem, is > if the guest closed its console (STDIO) streams, but did not exit. We should > always get SIG_CHILD if the process exited, so I'm thinking perhaps the > container's init process closed stdin/out/err, but kept running
jj: if you still have a process consuming 100% cpu, can you run 'strace -s 1000 -f -ff -o lxc.trace.txt -p PID-OF-LIBVIRT_LXC' and let it run for a minute or two and then attach the result to this bug.
for me, the problem disappears when I attach a console (virsh console ...). I see some new output there; after detaching things work well for a while but then the 100% cpu issue happens again (probably because something is given out on the console).
(In reply to Enrico Scholz from comment #9) > for me, the problem disappears when I attach a console (virsh console ...). > I see some new output there; after detaching things work well for a while > but then the 100% cpu issue happens again (probably because something is > given out on the console). Yep, that's the magic bit of info, which explains this all. I can see from jj's debug log that there is data received from the container that it is trying to send to the host. The host is not connected and rather than wait for it to reconnect, we're just busy waiting, hence 100% cpu.
Created attachment 818352 [details] Attempt to fix the busy wait I believe this patch will fix the problem, but I need todo more testing before sending it upstream.
(In reply to Daniel Berrange from comment #8) > jj: if you still have a process consuming 100% cpu, can you run 'strace -s > 1000 -f -ff -o lxc.trace.txt -p PID-OF-LIBVIRT_LXC' and let it run for a > minute or two and then attach the result to this bug. done, attachment .
Created attachment 818491 [details] strace strace
(In reply to Enrico Scholz from comment #9) > for me, the problem disappears when I attach a console (virsh console ...). > I see some new output there; after detaching things work well for a while > but then the 100% cpu issue happens again (probably because something is > given out on the console). comment 13, I uploaded the strace log
(In reply to Daniel Berrange from comment #11) > Created attachment 818352 [details] > Attempt to fix the busy wait > > I believe this patch will fix the problem, but I need todo more testing > before sending it upstream. after testing ,it seems fix the problem .
Upstream now: commit 5087a5a0092853702eb5e0c0297937a7859bcab3 Author: Daniel P. Berrange <berrange> Date: Fri Nov 1 16:24:30 2013 +0000 Fix busy wait loop in LXC container I/O handling Pushed to f19 and f20 libvirt maint branches
libvirt-1.0.5.8-1.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/libvirt-1.0.5.8-1.fc19
Package libvirt-1.0.5.8-1.fc19: * should fix your issue, * was pushed to the Fedora 19 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing libvirt-1.0.5.8-1.fc19' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-23453/libvirt-1.0.5.8-1.fc19 then log in and leave karma (feedback).
libvirt-1.0.5.8-1.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report.