Hide Forgot
+++ This bug was initially created as a clone of Bug #1005570 +++ Description of problem: The libvirt_lxc helper consumes 100% cpu because it sees some event on /dev/ptmx without processing this event: ----- [pid 21168] 15:56:43.966141 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=14, events=POLLIN}, {fd=26, events=POLLIN|POLLOUT}, {fd=15, events=POLLIN}], 7, 4294967295) = 1 ([{fd=26, revents=POLLHUP}]) [pid 21168] 15:56:43.966174 gettid() = 21168 [pid 21168] 15:56:43.966193 gettid() = 21168 [pid 21168] 15:56:43.966212 gettid() = 21168 [pid 21168] 15:56:43.966236 gettid() = 21168 [pid 21168] 15:56:43.966268 gettid() = 21168 [pid 21168] 15:56:43.966293 gettid() = 21168 [pid 21168] 15:56:43.966316 gettid() = 21168 [pid 21168] 15:56:43.966339 gettid() = 21168 [pid 21168] 15:56:43.966362 gettid() = 21168 [pid 21168] 15:56:43.966385 gettid() = 21168 [pid 21168] 15:56:43.966409 gettid() = 21168 [pid 21168] 15:56:43.966432 gettid() = 21168 [pid 21168] 15:56:43.966455 gettid() = 21168 [pid 21168] 15:56:43.966478 gettid() = 21168 [pid 21168] 15:56:43.966501 gettid() = 21168 [pid 21168] 15:56:43.966524 gettid() = 21168 [pid 21168] 15:56:43.966547 gettid() = 21168 [pid 21168] 15:56:43.966570 gettid() = 21168 [pid 21168] 15:56:43.966593 gettid() = 21168 [pid 21168] 15:56:43.966615 gettid() = 21168 [pid 21168] 15:56:43.966638 gettid() = 21168 [pid 21168] 15:56:43.966662 gettid() = 21168 [pid 21168] 15:56:43.966684 gettid() = 21168 [pid 21168] 15:56:43.966708 gettid() = 21168 [pid 21168] 15:56:43.966730 gettid() = 21168 [pid 21168] 15:56:43.966753 gettid() = 21168 [pid 21168] 15:56:43.966776 gettid() = 21168 [pid 21168] 15:56:43.966799 gettid() = 21168 [pid 21168] 15:56:43.966822 gettid() = 21168 [pid 21168] 15:56:43.966845 gettid() = 21168 [pid 21168] 15:56:43.966867 gettid() = 21168 [pid 21168] 15:56:43.966890 gettid() = 21168 [pid 21168] 15:56:43.966913 gettid() = 21168 [pid 21168] 15:56:43.966936 gettid() = 21168 [pid 21168] 15:56:43.966958 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=14, events=POLLIN}, {fd=26, events=POLLIN|POLLOUT}, {fd=15, events=POLLIN}], 7, 4294967295) = 1 ([{fd=26, revents=POLLHUP}]) ---- libvirt_l 21168 root 0u CHR 1,3 0t0 1028 /dev/null libvirt_l 21168 root 1w REG 253,12 12402 131418 /var/log/libvirt/lxc/XXXXX.log libvirt_l 21168 root 2w REG 253,12 12402 131418 /var/log/libvirt/lxc/XXXXX.log libvirt_l 21168 root 3r FIFO 0,8 0t0 137497 pipe libvirt_l 21168 root 4w FIFO 0,8 0t0 137497 pipe libvirt_l 21168 root 5r FIFO 0,8 0t0 139448 pipe libvirt_l 21168 root 6w FIFO 0,8 0t0 139448 pipe libvirt_l 21168 root 7u unix 0xffff8802de28d040 0t0 139449 /var/run/libvirt/lxc/XXXXX.sock libvirt_l 21168 root 8u unix 0xffff8802f5e1ba80 0t0 136795 socket libvirt_l 21168 root 9r FIFO 0,8 0t0 136829 pipe libvirt_l 21168 root 10u unix 0xffff8802f5e1ad80 0t0 136797 socket libvirt_l 21168 root 11w FIFO 0,8 0t0 136829 pipe libvirt_l 21168 root 12u CHR 10,229 0t0 12117 /dev/fuse libvirt_l 21168 root 13u CHR 5,2 0t0 2 /var/run/libvirt/lxc/XXXXX.devpts/ptmx libvirt_l 21168 root 14u a_inode 0,9 0 5802 [eventpoll] libvirt_l 21168 root 15u unix 0xffff8802ce0830c0 0t0 136830 /var/run/libvirt/lxc/XXXXX.sock libvirt_l 21168 root 26u CHR 5,2 0t0 1121 /dev/ptmx Version-Release number of selected component (if applicable): libvirt-daemon-1.0.5.5-1.fc19.x86_64 --- Additional comment from Cole Robinson on 2013-09-08 18:37:09 BST --- danpb does this sound familiar at all? --- Additional comment from jj on 2013-11-01 02:22:46 GMT --- I encounted the same problem on a mips compute, but on x86, it works ok. to Enrico Scholz: what platform are you using ? x86 ? I found the poll in virEventPollRunOnce (vireventpoll.c), a POLLHUP event occurred for a pipe descriptor, thus the poll always returned because of it. ( the pipe broken). and in lxc/lxc_controller.c, " if (events & VIR_EVENT_HANDLE_HANGUP) { if (watch == console->hostWatch) { console->hostClosed = true; } else { console->contClosed = true; } VIR_DEBUG("Got EOF on %d %d", watch, fd); } " it seems to have handled the hangup problem, I dont't know how to going on ... above is what i got from tracking . --- Additional comment from jj on 2013-11-01 02:24:29 GMT --- 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollRunOnce:638 : Poll got 1 event(s) 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchTimeouts:423 : Dispatch 2 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:468 : Dispatch 7 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=0 w=1 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=1 w=2 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=2 w=3 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=3 w=4 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=4 w=5 2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=5 w=6 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollDispatchHandles:496 : EVENT_POLL_DISPATCH_HANDLE: watch=6 events=8 2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleIO:974 : IO event watch=6 fd=18 events=8 fromHost=0 fromcont=1024 2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleIO:1044 : Got EOF on 6 18 2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleUpdateWatch:837 : Container watch 7=0 host watch 6=3 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=7 events=0 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 1996885056 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=6 events=3 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 1996885056 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollDispatchHandles:482 : i=7 w=8 2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollCleanupTimeouts:514 : Cleanup 2 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupHandles:562 : Cleanup 8 2013-10-30 00:31:57.471+0000: 7566: debug : virEventRunDefaultImpl:270 : running default event implementation 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupTimeouts:514 : Cleanup 2 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupHandles:562 : Cleanup 8 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=0 w=1, f=3 e=1 d=0 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=1 w=2, f=5 e=1 d=0 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=2 w=3, f=7 e=1 d=0 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=3 w=4, f=9 e=1 d=0 2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=4 w=5, f=13 e=1 d=0 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=5 w=6, f=18 e=5 d=0 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=6 w=7, f=12 e=0 d=0 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=7 w=8, f=14 e=1 d=0 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollCalculateTimeout:332 : Calculate expiry of 2 timers 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollCalculateTimeout:361 : Timeout at 0 due in -1 ms 2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollRunOnce:627 : EVENT_POLL_RUN: nhandles=7 timeout=-1 --- Additional comment from jj on 2013-11-01 02:28:36 GMT --- In normal situation ,the poll would wait , but on my compute, It seems a pipe broken, so there's always hangup event, so the poll returns immediately, so an infinite loop, which causes the 100% cpu usage. --- Additional comment from jj on 2013-11-01 11:48:57 GMT --- the debug log of libvirt_lxc --- Additional comment from Daniel Berrange on 2013-11-01 14:32:16 GMT --- When this 100% cpu problem occurs, can anyone tell me if they still see a process running as a child of the 'libvirt_lxc' process ? Whether a zombie or normal process. Looking at the code, the only way I can see us getting into this problem, is if the guest closed its console (STDIO) streams, but did not exit. We should always get SIG_CHILD if the process exited, so I'm thinking perhaps the container's init process closed stdin/out/err, but kept running --- Additional comment from jj on 2013-11-01 15:38:53 GMT --- (In reply to Daniel Berrange from comment #6) yes,the ps tree is the same as it is normal > When this 100% cpu problem occurs, can anyone tell me if they still see a > process running as a child of the 'libvirt_lxc' process ? Whether a zombie > or normal process. > > Looking at the code, the only way I can see us getting into this problem, is > if the guest closed its console (STDIO) streams, but did not exit. We should > always get SIG_CHILD if the process exited, so I'm thinking perhaps the > container's init process closed stdin/out/err, but kept running --- Additional comment from Daniel Berrange on 2013-11-01 15:42:13 GMT --- jj: if you still have a process consuming 100% cpu, can you run 'strace -s 1000 -f -ff -o lxc.trace.txt -p PID-OF-LIBVIRT_LXC' and let it run for a minute or two and then attach the result to this bug. --- Additional comment from Enrico Scholz on 2013-11-01 15:54:18 GMT --- for me, the problem disappears when I attach a console (virsh console ...). I see some new output there; after detaching things work well for a while but then the 100% cpu issue happens again (probably because something is given out on the console). --- Additional comment from Daniel Berrange on 2013-11-01 15:56:43 GMT --- (In reply to Enrico Scholz from comment #9) > for me, the problem disappears when I attach a console (virsh console ...). > I see some new output there; after detaching things work well for a while > but then the 100% cpu issue happens again (probably because something is > given out on the console). Yep, that's the magic bit of info, which explains this all. I can see from jj's debug log that there is data received from the container that it is trying to send to the host. The host is not connected and rather than wait for it to reconnect, we're just busy waiting, hence 100% cpu. --- Additional comment from Daniel Berrange on 2013-11-01 16:27:11 GMT --- I believe this patch will fix the problem, but I need todo more testing before sending it upstream. --- Additional comment from jj on 2013-11-01 23:06:27 GMT --- (In reply to Daniel Berrange from comment #8) > jj: if you still have a process consuming 100% cpu, can you run 'strace -s > 1000 -f -ff -o lxc.trace.txt -p PID-OF-LIBVIRT_LXC' and let it run for a > minute or two and then attach the result to this bug. done, attachment . --- Additional comment from jj on 2013-11-01 23:07:29 GMT --- strace --- Additional comment from jj on 2013-11-01 23:08:42 GMT --- (In reply to Enrico Scholz from comment #9) > for me, the problem disappears when I attach a console (virsh console ...). > I see some new output there; after detaching things work well for a while > but then the 100% cpu issue happens again (probably because something is > given out on the console). comment 13, I uploaded the strace log --- Additional comment from jj on 2013-11-02 12:03:53 GMT --- (In reply to Daniel Berrange from comment #11) > Created attachment 818352 [details] > Attempt to fix the busy wait > > I believe this patch will fix the problem, but I need todo more testing > before sending it upstream. after testing ,it seems fix the problem . --- Additional comment from Cole Robinson on 2013-11-20 15:35:44 GMT --- Upstream now: commit 5087a5a0092853702eb5e0c0297937a7859bcab3 Author: Daniel P. Berrange <berrange> Date: Fri Nov 1 16:24:30 2013 +0000 Fix busy wait loop in LXC container I/O handling Pushed to f19 and f20 libvirt maint branches
Fixed upstream by commit 5087a5a0092853702eb5e0c0297937a7859bcab3 Author: Daniel P. Berrange <berrange> Date: Fri Nov 1 16:24:30 2013 +0000 Fix busy wait loop in LXC container I/O handling
Reproduce it with libvirt-1.1.1-13.el7.x86_64 1.Start a lxc named test #virsh -c lxc:/// start test enter test , leave #while true ; do echo "a" >> test.txt ; done then exit from it. On host #top 8932 root 20 0 11736 1696 1344 R 99.8/*almost 100%*/ 0.0 5:35.46 sh #pstree -alp -libvirt_lxc,8928 --name test --console 22 --security=selinux --handshake 26 --background --veth vnet1 | |-sh,8932 | `-{libvirt_lxc},8931 2.After restart it , everything back to normally
I made a mistake in comment 4 , fixed and verified here with libvirt-1.1.1-16.el7.x86_64 1.Prepare a lxc container <domain type='lxc'> <name>test</name> <uuid>318c2db7-6dad-4f31-b217-ecad21524713</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <vcpu placement='static'>1</vcpu> <resource> <partition>/machine</partition> </resource> <os> <type arch='x86_64'>exe</type> <init>/bin/sh</init> </os> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/libexec/libvirt_lxc</emulator> <interface type='network'> <mac address='00:16:3e:4e:4c:62'/> <source network='default'/> </interface> <console type='pty'> <target type='lxc' port='0'/> </console> </devices> </domain> #virsh -c lxc:/// define test.xml #virsh -c lxc:/// start test 2.Find which console the container using then login #virsh -c lxc:/// dumpxml test ... <console type='pty' tty='/dev/pts/1'> <source path='/dev/pts/1'/> <target type='lxc' port='0'/> .... #virsh -c lxc:/// console test In container sh-4.2#while true ; do echo fill > /dev/pts/0 ; sleep 0.01 ; done Note:Make sure the command will not lead to a normal hosts cpu taking up 100% Then close the terminal to leave the sh loop run in container itself 3. #top Watch the ps libvirt_lxc's %cpu In libvirt-1.1.1-13 , it will grow up to 100% after few minutes . In libvirt-1.1.1-14 and those versions later , the cpu% will keep a normal value which as same as whose run in the host , and after awhile , it will down to 0 , except login it again. 4. You can open the debug level log , find something like cont closed = 0 , host closed != 0 , that's the key to trigger the issue According above , set VERIFIED
This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request.