Bug 1032705 - libvirt_lxc consumes 100% cpu when console client is disconnected and guest writes data
libvirt_lxc consumes 100% cpu when console client is disconnected and guest w...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt (Show other bugs)
7.0
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Daniel Berrange
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-20 10:40 EST by Daniel Berrange
Modified: 2014-06-17 20:59 EDT (History)
6 users (show)

See Also:
Fixed In Version: libvirt-1.1.1-14.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1005570
Environment:
Last Closed: 2014-06-13 06:31:57 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Daniel Berrange 2013-11-20 10:40:27 EST
+++ This bug was initially created as a clone of Bug #1005570 +++

Description of problem:

The libvirt_lxc helper consumes 100% cpu because it sees some event on /dev/ptmx without processing this event:

-----
[pid 21168] 15:56:43.966141 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=14, events=POLLIN}, {fd=26, events=POLLIN|POLLOUT}, {fd=15, events=POLLIN}], 7, 4294967295) = 1 ([{fd=26, revents=POLLHUP}])
[pid 21168] 15:56:43.966174 gettid()    = 21168
[pid 21168] 15:56:43.966193 gettid()    = 21168
[pid 21168] 15:56:43.966212 gettid()    = 21168
[pid 21168] 15:56:43.966236 gettid()    = 21168
[pid 21168] 15:56:43.966268 gettid()    = 21168
[pid 21168] 15:56:43.966293 gettid()    = 21168
[pid 21168] 15:56:43.966316 gettid()    = 21168
[pid 21168] 15:56:43.966339 gettid()    = 21168
[pid 21168] 15:56:43.966362 gettid()    = 21168
[pid 21168] 15:56:43.966385 gettid()    = 21168
[pid 21168] 15:56:43.966409 gettid()    = 21168
[pid 21168] 15:56:43.966432 gettid()    = 21168
[pid 21168] 15:56:43.966455 gettid()    = 21168
[pid 21168] 15:56:43.966478 gettid()    = 21168
[pid 21168] 15:56:43.966501 gettid()    = 21168
[pid 21168] 15:56:43.966524 gettid()    = 21168
[pid 21168] 15:56:43.966547 gettid()    = 21168
[pid 21168] 15:56:43.966570 gettid()    = 21168
[pid 21168] 15:56:43.966593 gettid()    = 21168
[pid 21168] 15:56:43.966615 gettid()    = 21168
[pid 21168] 15:56:43.966638 gettid()    = 21168
[pid 21168] 15:56:43.966662 gettid()    = 21168
[pid 21168] 15:56:43.966684 gettid()    = 21168
[pid 21168] 15:56:43.966708 gettid()    = 21168
[pid 21168] 15:56:43.966730 gettid()    = 21168
[pid 21168] 15:56:43.966753 gettid()    = 21168
[pid 21168] 15:56:43.966776 gettid()    = 21168
[pid 21168] 15:56:43.966799 gettid()    = 21168
[pid 21168] 15:56:43.966822 gettid()    = 21168
[pid 21168] 15:56:43.966845 gettid()    = 21168
[pid 21168] 15:56:43.966867 gettid()    = 21168
[pid 21168] 15:56:43.966890 gettid()    = 21168
[pid 21168] 15:56:43.966913 gettid()    = 21168
[pid 21168] 15:56:43.966936 gettid()    = 21168
[pid 21168] 15:56:43.966958 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=14, events=POLLIN}, {fd=26, events=POLLIN|POLLOUT}, {fd=15, events=POLLIN}], 7, 4294967295) = 1 ([{fd=26, revents=POLLHUP}])
----

libvirt_l 21168 root    0u      CHR                1,3      0t0   1028 /dev/null
libvirt_l 21168 root    1w      REG             253,12    12402 131418 /var/log/libvirt/lxc/XXXXX.log
libvirt_l 21168 root    2w      REG             253,12    12402 131418 /var/log/libvirt/lxc/XXXXX.log
libvirt_l 21168 root    3r     FIFO                0,8      0t0 137497 pipe
libvirt_l 21168 root    4w     FIFO                0,8      0t0 137497 pipe
libvirt_l 21168 root    5r     FIFO                0,8      0t0 139448 pipe
libvirt_l 21168 root    6w     FIFO                0,8      0t0 139448 pipe
libvirt_l 21168 root    7u     unix 0xffff8802de28d040      0t0 139449 /var/run/libvirt/lxc/XXXXX.sock
libvirt_l 21168 root    8u     unix 0xffff8802f5e1ba80      0t0 136795 socket
libvirt_l 21168 root    9r     FIFO                0,8      0t0 136829 pipe
libvirt_l 21168 root   10u     unix 0xffff8802f5e1ad80      0t0 136797 socket
libvirt_l 21168 root   11w     FIFO                0,8      0t0 136829 pipe
libvirt_l 21168 root   12u      CHR             10,229      0t0  12117 /dev/fuse
libvirt_l 21168 root   13u      CHR                5,2      0t0      2 /var/run/libvirt/lxc/XXXXX.devpts/ptmx
libvirt_l 21168 root   14u  a_inode                0,9        0   5802 [eventpoll]
libvirt_l 21168 root   15u     unix 0xffff8802ce0830c0      0t0 136830 /var/run/libvirt/lxc/XXXXX.sock
libvirt_l 21168 root   26u      CHR                5,2      0t0   1121 /dev/ptmx



Version-Release number of selected component (if applicable):

libvirt-daemon-1.0.5.5-1.fc19.x86_64

--- Additional comment from Cole Robinson on 2013-09-08 18:37:09 BST ---

danpb does this sound familiar at all?

--- Additional comment from jj on 2013-11-01 02:22:46 GMT ---

I encounted the same problem on a mips compute, but on x86, it works ok.

to Enrico Scholz:
what platform are you using ? x86 ?

I found the poll in virEventPollRunOnce (vireventpoll.c),  a POLLHUP event occurred for a pipe descriptor,  thus the poll always returned because of it.
( the pipe broken). 

and in lxc/lxc_controller.c,
"
    if (events & VIR_EVENT_HANDLE_HANGUP) {
        if (watch == console->hostWatch) {
            console->hostClosed = true;
        } else {
            console->contClosed = true;
        }
        VIR_DEBUG("Got EOF on %d %d", watch, fd);
    }
"
it seems to have handled the hangup problem, I dont't know how to going on ...


above is what i got from tracking .

--- Additional comment from jj on 2013-11-01 02:24:29 GMT ---

2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollRunOnce:638 : Poll got 1 event(s)
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchTimeouts:423 : Dispatch 2
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:468 : Dispatch 7
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=0 w=1
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=1 w=2
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=2 w=3
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=3 w=4
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=4 w=5
2013-10-30 00:31:57.469+0000: 7566: debug : virEventPollDispatchHandles:482 : i=5 w=6
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollDispatchHandles:496 : EVENT_POLL_DISPATCH_HANDLE: watch=6 events=8
2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleIO:974 : IO event watch=6 fd=18 events=8 fromHost=0 fromcont=1024
2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleIO:1044 : Got EOF on 6 18
2013-10-30 00:31:57.470+0000: 7566: debug : virLXCControllerConsoleUpdateWatch:837 : Container watch 7=0 host watch 6=3
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=7 events=0
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 1996885056
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=6 events=3
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollInterruptLocked:710 : Skip interrupt, 1 1996885056
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollDispatchHandles:482 : i=7 w=8
2013-10-30 00:31:57.470+0000: 7566: debug : virEventPollCleanupTimeouts:514 : Cleanup 2
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupHandles:562 : Cleanup 8
2013-10-30 00:31:57.471+0000: 7566: debug : virEventRunDefaultImpl:270 : running default event implementation
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupTimeouts:514 : Cleanup 2
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollCleanupHandles:562 : Cleanup 8
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=0 w=1, f=3 e=1 d=0
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=1 w=2, f=5 e=1 d=0
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=2 w=3, f=7 e=1 d=0
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=3 w=4, f=9 e=1 d=0
2013-10-30 00:31:57.471+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=4 w=5, f=13 e=1 d=0
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=5 w=6, f=18 e=5 d=0
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=6 w=7, f=12 e=0 d=0
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollMakePollFDs:391 : Prepare n=7 w=8, f=14 e=1 d=0
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollCalculateTimeout:332 : Calculate expiry of 2 timers
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollCalculateTimeout:361 : Timeout at 0 due in -1 ms
2013-10-30 00:31:57.472+0000: 7566: debug : virEventPollRunOnce:627 : EVENT_POLL_RUN: nhandles=7 timeout=-1

--- Additional comment from jj on 2013-11-01 02:28:36 GMT ---

In normal situation ,the poll would wait , but on my compute, It seems a pipe broken, so there's always hangup event, so the poll returns immediately, so an 
infinite loop, which causes the 100% cpu usage.

--- Additional comment from jj on 2013-11-01 11:48:57 GMT ---

the debug log of libvirt_lxc

--- Additional comment from Daniel Berrange on 2013-11-01 14:32:16 GMT ---

When this 100% cpu problem occurs, can anyone tell me if they still see a process running as a child of the 'libvirt_lxc' process ? Whether a zombie or normal process.

Looking at the code, the only way I can see us getting into this problem, is if the guest closed its console (STDIO) streams, but did not exit. We should always get SIG_CHILD if the process exited, so I'm thinking perhaps the container's init process closed stdin/out/err, but kept running

--- Additional comment from jj on 2013-11-01 15:38:53 GMT ---


(In reply to Daniel Berrange from comment #6)

yes,the ps tree is the same as it is normal

> When this 100% cpu problem occurs, can anyone tell me if they still see a
> process running as a child of the 'libvirt_lxc' process ? Whether a zombie
> or normal process.
> 
> Looking at the code, the only way I can see us getting into this problem, is
> if the guest closed its console (STDIO) streams, but did not exit. We should
> always get SIG_CHILD if the process exited, so I'm thinking perhaps the
> container's init process closed stdin/out/err, but kept running

--- Additional comment from Daniel Berrange on 2013-11-01 15:42:13 GMT ---

jj: if you still have a process consuming 100% cpu, can you run 'strace -s 1000 -f -ff -o lxc.trace.txt -p PID-OF-LIBVIRT_LXC' and let it run for a minute or two and then attach the result to this bug.

--- Additional comment from Enrico Scholz on 2013-11-01 15:54:18 GMT ---

for me, the problem disappears when I attach a console (virsh console ...). I see some new output there; after detaching things work well for a while but then the 100% cpu issue happens again (probably because something is given out on the console).

--- Additional comment from Daniel Berrange on 2013-11-01 15:56:43 GMT ---

(In reply to Enrico Scholz from comment #9)
> for me, the problem disappears when I attach a console (virsh console ...).
> I see some new output there; after detaching things work well for a while
> but then the 100% cpu issue happens again (probably because something is
> given out on the console).

Yep, that's the magic bit of info, which explains this all.

I can see from jj's debug log that there is data received from the container that it is trying to send to the host. The host is not connected and rather than wait for it to reconnect, we're just busy waiting, hence 100% cpu.

--- Additional comment from Daniel Berrange on 2013-11-01 16:27:11 GMT ---

I believe this patch will fix the problem, but I need todo more testing before sending it upstream.

--- Additional comment from jj on 2013-11-01 23:06:27 GMT ---

(In reply to Daniel Berrange from comment #8)
> jj: if you still have a process consuming 100% cpu, can you run 'strace -s
> 1000 -f -ff -o lxc.trace.txt -p PID-OF-LIBVIRT_LXC' and let it run for a
> minute or two and then attach the result to this bug.

done, attachment .

--- Additional comment from jj on 2013-11-01 23:07:29 GMT ---

strace

--- Additional comment from jj on 2013-11-01 23:08:42 GMT ---

(In reply to Enrico Scholz from comment #9)
> for me, the problem disappears when I attach a console (virsh console ...).
> I see some new output there; after detaching things work well for a while
> but then the 100% cpu issue happens again (probably because something is
> given out on the console).

comment 13, I uploaded the strace log

--- Additional comment from jj on 2013-11-02 12:03:53 GMT ---

(In reply to Daniel Berrange from comment #11)
> Created attachment 818352 [details]
> Attempt to fix the busy wait
> 
> I believe this patch will fix the problem, but I need todo more testing
> before sending it upstream.

after testing ,it seems fix the problem .

--- Additional comment from Cole Robinson on 2013-11-20 15:35:44 GMT ---

Upstream now:

commit 5087a5a0092853702eb5e0c0297937a7859bcab3
Author: Daniel P. Berrange <berrange@redhat.com>
Date:   Fri Nov 1 16:24:30 2013 +0000

    Fix busy wait loop in LXC container I/O handling

Pushed to f19 and f20 libvirt maint branches
Comment 2 Jiri Denemark 2013-11-26 04:18:29 EST
Fixed upstream by

commit 5087a5a0092853702eb5e0c0297937a7859bcab3
Author: Daniel P. Berrange <berrange@redhat.com>
Date:   Fri Nov 1 16:24:30 2013 +0000

    Fix busy wait loop in LXC container I/O handling
Comment 4 Luwen Su 2013-12-05 04:01:12 EST
Reproduce it with libvirt-1.1.1-13.el7.x86_64

1.Start a lxc named test
#virsh -c lxc:/// start test

enter test , leave

#while true ; do echo "a" >> test.txt ; done

then exit from it.

On host
#top
 8932 root      20   0   11736   1696   1344 R  99.8/*almost 100%*/  0.0   5:35.46 sh                                                        

                                                      
#pstree -alp 
-libvirt_lxc,8928 --name test --console 22 --security=selinux --handshake 26 --background --veth vnet1
  |   |-sh,8932
  |   `-{libvirt_lxc},8931


2.After restart it , everything back to normally
Comment 6 Luwen Su 2013-12-26 02:05:48 EST
I made a mistake in comment 4 , fixed and verified here with 
libvirt-1.1.1-16.el7.x86_64

1.Prepare a lxc container
<domain type='lxc'>
  <name>test</name>
  <uuid>318c2db7-6dad-4f31-b217-ecad21524713</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64'>exe</type>
    <init>/bin/sh</init>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/libvirt_lxc</emulator>
    <interface type='network'>
      <mac address='00:16:3e:4e:4c:62'/>
      <source network='default'/>
    </interface>
    <console type='pty'>
      <target type='lxc' port='0'/>
    </console>
  </devices>
</domain>

#virsh -c lxc:/// define test.xml
#virsh -c lxc:/// start test


2.Find which console the container using then login
#virsh -c lxc:/// dumpxml test
...  
 <console type='pty' tty='/dev/pts/1'>
      <source path='/dev/pts/1'/>
      <target type='lxc' port='0'/>
....

#virsh -c lxc:/// console test

In container

sh-4.2#while true ; do echo fill > /dev/pts/0 ; sleep 0.01 ; done 

Note:Make sure the command will not lead to a normal hosts cpu taking up 100%


Then close the terminal to leave the sh loop run in container itself


3.
#top

Watch the ps libvirt_lxc's %cpu

In libvirt-1.1.1-13 , it will grow up to 100% after few minutes .

In libvirt-1.1.1-14 and those versions later , the cpu% will keep a normal value which as same as whose run in the host , and after awhile , it will down to 0 , except login it again.


4.
You can open the debug level log , find something like 
cont closed = 0 , host closed != 0 , that's the key to trigger the issue


According above , set VERIFIED
Comment 7 Ludek Smid 2014-06-13 06:31:57 EDT
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Note You need to log in before you can comment on or make changes to this bug.