1005570 – libvirt_lxc consumes 100% cpu

Bug 1005570 - libvirt_lxc consumes 100% cpu

Summary: libvirt_lxc consumes 100% cpu

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	libvirt
Sub Component:
Version:	19
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Libvirt Maintainers
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-09-08 14:04 UTC by Enrico Scholz
Modified:	2013-12-31 02:03 UTC (History)
CC List:	10 users (show)
Fixed In Version:	libvirt-1.0.5.8-1.fc19
Clone Of:
Clones:	1032705 (view as bug list)
Environment:
Last Closed:	2013-12-31 02:03:14 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
the debug log (14.25 MB, text/plain) 2013-11-01 11:48 UTC, jj	no flags	Details
Attempt to fix the busy wait (4.90 KB, patch) 2013-11-01 16:27 UTC, Daniel Berrangé	no flags	Details \| Diff
strace (12.96 MB, text/plain) 2013-11-01 23:07 UTC, jj	no flags	Details
View All

Description Enrico Scholz 2013-09-08 14:04:05 UTC

Description of problem:

The libvirt_lxc helper consumes 100% cpu because it sees some event on /dev/ptmx without processing this event:

-----
[pid 21168] 15:56:43.966141 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=14, events=POLLIN}, {fd=26, events=POLLIN|POLLOUT}, {fd=15, events=POLLIN}], 7, 4294967295) = 1 ([{fd=26, revents=POLLHUP}])
[pid 21168] 15:56:43.966174 gettid()    = 21168
[pid 21168] 15:56:43.966193 gettid()    = 21168
[pid 21168] 15:56:43.966212 gettid()    = 21168
[pid 21168] 15:56:43.966236 gettid()    = 21168
[pid 21168] 15:56:43.966268 gettid()    = 21168
[pid 21168] 15:56:43.966293 gettid()    = 21168
[pid 21168] 15:56:43.966316 gettid()    = 21168
[pid 21168] 15:56:43.966339 gettid()    = 21168
[pid 21168] 15:56:43.966362 gettid()    = 21168
[pid 21168] 15:56:43.966385 gettid()    = 21168
[pid 21168] 15:56:43.966409 gettid()    = 21168
[pid 21168] 15:56:43.966432 gettid()    = 21168
[pid 21168] 15:56:43.966455 gettid()    = 21168
[pid 21168] 15:56:43.966478 gettid()    = 21168
[pid 21168] 15:56:43.966501 gettid()    = 21168
[pid 21168] 15:56:43.966524 gettid()    = 21168
[pid 21168] 15:56:43.966547 gettid()    = 21168
[pid 21168] 15:56:43.966570 gettid()    = 21168
[pid 21168] 15:56:43.966593 gettid()    = 21168
[pid 21168] 15:56:43.966615 gettid()    = 21168
[pid 21168] 15:56:43.966638 gettid()    = 21168
[pid 21168] 15:56:43.966662 gettid()    = 21168
[pid 21168] 15:56:43.966684 gettid()    = 21168
[pid 21168] 15:56:43.966708 gettid()    = 21168
[pid 21168] 15:56:43.966730 gettid()    = 21168
[pid 21168] 15:56:43.966753 gettid()    = 21168
[pid 21168] 15:56:43.966776 gettid()    = 21168
[pid 21168] 15:56:43.966799 gettid()    = 21168
[pid 21168] 15:56:43.966822 gettid()    = 21168
[pid 21168] 15:56:43.966845 gettid()    = 21168
[pid 21168] 15:56:43.966867 gettid()    = 21168
[pid 21168] 15:56:43.966890 gettid()    = 21168
[pid 21168] 15:56:43.966913 gettid()    = 21168
[pid 21168] 15:56:43.966936 gettid()    = 21168
[pid 21168] 15:56:43.966958 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=14, events=POLLIN}, {fd=26, events=POLLIN|POLLOUT}, {fd=15, events=POLLIN}], 7, 4294967295) = 1 ([{fd=26, revents=POLLHUP}])
----

libvirt_l 21168 root    0u      CHR                1,3      0t0   1028 /dev/null
libvirt_l 21168 root    1w      REG             253,12    12402 131418 /var/log/libvirt/lxc/XXXXX.log
libvirt_l 21168 root    2w      REG             253,12    12402 131418 /var/log/libvirt/lxc/XXXXX.log
libvirt_l 21168 root    3r     FIFO                0,8      0t0 137497 pipe
libvirt_l 21168 root    4w     FIFO                0,8      0t0 137497 pipe
libvirt_l 21168 root    5r     FIFO                0,8      0t0 139448 pipe
libvirt_l 21168 root    6w     FIFO                0,8      0t0 139448 pipe
libvirt_l 21168 root    7u     unix 0xffff8802de28d040      0t0 139449 /var/run/libvirt/lxc/XXXXX.sock
libvirt_l 21168 root    8u     unix 0xffff8802f5e1ba80      0t0 136795 socket
libvirt_l 21168 root    9r     FIFO                0,8      0t0 136829 pipe
libvirt_l 21168 root   10u     unix 0xffff8802f5e1ad80      0t0 136797 socket
libvirt_l 21168 root   11w     FIFO                0,8      0t0 136829 pipe
libvirt_l 21168 root   12u      CHR             10,229      0t0  12117 /dev/fuse
libvirt_l 21168 root   13u      CHR                5,2      0t0      2 /var/run/libvirt/lxc/XXXXX.devpts/ptmx
libvirt_l 21168 root   14u  a_inode                0,9        0   5802 [eventpoll]
libvirt_l 21168 root   15u     unix 0xffff8802ce0830c0      0t0 136830 /var/run/libvirt/lxc/XXXXX.sock
libvirt_l 21168 root   26u      CHR                5,2      0t0   1121 /dev/ptmx



Version-Release number of selected component (if applicable):

libvirt-daemon-1.0.5.5-1.fc19.x86_64

Comment 1 Cole Robinson 2013-09-08 17:37:09 UTC

danpb does this sound familiar at all?

Comment 2 jj 2013-11-01 02:22:46 UTC

I encounted the same problem on a mips compute, but on x86, it works ok.

to Enrico Scholz:
what platform are you using ? x86 ?

I found the poll in virEventPollRunOnce (vireventpoll.c),  a POLLHUP event occurred for a pipe descriptor,  thus the poll always returned because of it.
( the pipe broken). 

and in lxc/lxc_controller.c,
"
    if (events & VIR_EVENT_HANDLE_HANGUP) {
        if (watch == console->hostWatch) {
            console->hostClosed = true;
        } else {
            console->contClosed = true;
        }
        VIR_DEBUG("Got EOF on %d %d", watch, fd);
    }
"
it seems to have handled the hangup problem, I dont't know how to going on ...


above is what i got from tracking .

Comment 3 jj 2013-11-01 02:24:29 UTC

2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollRunOnce:638 : Poll got 1 event(s)
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchTimeouts:423 : Dispatch 2
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:468 : Dispatch 7
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=0 w=1
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=1 w=2
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=2 w=3
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=3 w=4
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=4 w=5
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=5 w=6
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollDispatchHandles:496 : EVENT_POLL_DISPATCH_HANDLE: watch=6 events=8
2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleIO:974 : IO event watch=6 fd=18 events=8 fromHost=0 fromcont=1024
2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleIO:1044 : Got EOF on 6 18
2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleUpdateWatch:837 : Container watch 7=0 host watch 6=3
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=7 events=0
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 1996885056
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=6 events=3
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 1996885056
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollDispatchHandles:482 : i=7 w=8
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollCleanupTimeouts:514 : Cleanup 2
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupHandles:562 : Cleanup 8
2013-10-30 00:31:57.471+0000: 7566: debug : virEventRunDefaultImpl:270 : running default event implementation
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupTimeouts:514 : Cleanup 2
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupHandles:562 : Cleanup 8
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=0 w=1, f=3 e=1 d=0
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=1 w=2, f=5 e=1 d=0
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=2 w=3, f=7 e=1 d=0
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=3 w=4, f=9 e=1 d=0
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=4 w=5, f=13 e=1 d=0
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=5 w=6, f=18 e=5 d=0
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=6 w=7, f=12 e=0 d=0
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=7 w=8, f=14 e=1 d=0
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollCalculateTimeout:332 : Calculate expiry of 2 timers
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollCalculateTimeout:361 : Timeout at 0 due in -1 ms
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollRunOnce:627 : EVENT_POLL_RUN: nhandles=7 timeout=-1

Comment 4 jj 2013-11-01 02:28:36 UTC

In normal situation ,the poll would wait , but on my compute, It seems a pipe broken, so there's always hangup event, so the poll returns immediately, so an 
infinite loop, which causes the 100% cpu usage.

Comment 5 jj 2013-11-01 11:48:57 UTC

Created attachment 818259 [details]
the debug log

the debug log of libvirt_lxc

Comment 6 Daniel Berrangé 2013-11-01 14:32:16 UTC

When this 100% cpu problem occurs, can anyone tell me if they still see a process running as a child of the 'libvirt_lxc' process ? Whether a zombie or normal process.

Looking at the code, the only way I can see us getting into this problem, is if the guest closed its console (STDIO) streams, but did not exit. We should always get SIG_CHILD if the process exited, so I'm thinking perhaps the container's init process closed stdin/out/err, but kept running

Comment 7 jj 2013-11-01 15:38:53 UTC

(In reply to Daniel Berrange from comment #6)

yes，the ps tree is the same as it is normal

> When this 100% cpu problem occurs, can anyone tell me if they still see a
> process running as a child of the 'libvirt_lxc' process ? Whether a zombie
> or normal process.
> 
> Looking at the code, the only way I can see us getting into this problem, is
> if the guest closed its console (STDIO) streams, but did not exit. We should
> always get SIG_CHILD if the process exited, so I'm thinking perhaps the
> container's init process closed stdin/out/err, but kept running

Comment 8 Daniel Berrangé 2013-11-01 15:42:13 UTC

jj: if you still have a process consuming 100% cpu, can you run 'strace -s 1000 -f -ff -o lxc.trace.txt -p PID-OF-LIBVIRT_LXC' and let it run for a minute or two and then attach the result to this bug.

Comment 9 Enrico Scholz 2013-11-01 15:54:18 UTC

for me, the problem disappears when I attach a console (virsh console ...). I see some new output there; after detaching things work well for a while but then the 100% cpu issue happens again (probably because something is given out on the console).

Comment 10 Daniel Berrangé 2013-11-01 15:56:43 UTC

(In reply to Enrico Scholz from comment #9)
> for me, the problem disappears when I attach a console (virsh console ...).
> I see some new output there; after detaching things work well for a while
> but then the 100% cpu issue happens again (probably because something is
> given out on the console).

Yep, that's the magic bit of info, which explains this all.

I can see from jj's debug log that there is data received from the container that it is trying to send to the host. The host is not connected and rather than wait for it to reconnect, we're just busy waiting, hence 100% cpu.

Comment 11 Daniel Berrangé 2013-11-01 16:27:11 UTC

Created attachment 818352 [details]
Attempt to fix the busy wait

I believe this patch will fix the problem, but I need todo more testing before sending it upstream.

Comment 12 jj 2013-11-01 23:06:27 UTC

(In reply to Daniel Berrange from comment #8)
> jj: if you still have a process consuming 100% cpu, can you run 'strace -s
> 1000 -f -ff -o lxc.trace.txt -p PID-OF-LIBVIRT_LXC' and let it run for a
> minute or two and then attach the result to this bug.

done, attachment .

Comment 13 jj 2013-11-01 23:07:29 UTC

Created attachment 818491 [details]
strace

strace

Comment 14 jj 2013-11-01 23:08:42 UTC

(In reply to Enrico Scholz from comment #9)
> for me, the problem disappears when I attach a console (virsh console ...).
> I see some new output there; after detaching things work well for a while
> but then the 100% cpu issue happens again (probably because something is
> given out on the console).

comment 13, I uploaded the strace log

Comment 15 jj 2013-11-02 12:03:53 UTC

(In reply to Daniel Berrange from comment #11)
> Created attachment 818352 [details]
> Attempt to fix the busy wait
> 
> I believe this patch will fix the problem, but I need todo more testing
> before sending it upstream.

after testing ,it seems fix the problem .

Comment 16 Cole Robinson 2013-11-20 15:35:44 UTC

Upstream now:

commit 5087a5a0092853702eb5e0c0297937a7859bcab3
Author: Daniel P. Berrange <berrange>
Date:   Fri Nov 1 16:24:30 2013 +0000

    Fix busy wait loop in LXC container I/O handling

Pushed to f19 and f20 libvirt maint branches

Comment 17 Fedora Update System 2013-12-14 21:19:32 UTC

libvirt-1.0.5.8-1.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/libvirt-1.0.5.8-1.fc19

Comment 18 Fedora Update System 2013-12-16 22:55:43 UTC

Package libvirt-1.0.5.8-1.fc19:
* should fix your issue,
* was pushed to the Fedora 19 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libvirt-1.0.5.8-1.fc19'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-23453/libvirt-1.0.5.8-1.fc19
then log in and leave karma (feedback).

Comment 19 Fedora Update System 2013-12-31 02:03:14 UTC

libvirt-1.0.5.8-1.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.