Bug 821991 - libvirt domain event handler can not receive messages after several times virKeepAliveTimer warning
libvirt domain event handler can not receive messages after several times vir...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.3
x86_64 Linux
medium Severity high
: rc
: ---
Assigned To: Michal Privoznik
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-15 23:09 EDT by weizhang
Modified: 2013-01-23 05:36 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-05-31 11:32:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
libvirtd.log (2.10 MB, application/x-bzip2)
2012-05-15 23:09 EDT, weizhang
no flags Details
libvirtd.log (1.19 MB, text/x-log)
2012-05-25 10:04 EDT, Michal Privoznik
no flags Details
libvirt_client.log (657.45 KB, text/x-log)
2012-05-25 10:05 EDT, Michal Privoznik
no flags Details

  None (edit)
Description weizhang 2012-05-15 23:09:24 EDT
Created attachment 584831 [details]
libvirtd.log

Description of problem:
When cpu is busy and domain operation can not reply for more than 30s(default), libvirt will report error like error: End of file while reading data: Input/output error, and libvirt event-python will exit, for several times loop, then libvirt event-python can not receive messages any more.

Version-Release number of selected component (if applicable):
snapshot 3
libvirt-python-0.9.10-18.el6.x86_64
libvirt-client-0.9.10-18.el6.x86_64
libvirt-0.9.10-18.el6.x86_64


How reproducible:
100% for long time execution

Steps to Reproduce:
1. change libvirtd.conf for easily catching error and restart libvirtd
keepalive_interval = 2
keepalive_count = 1

2. on another console do cp to occupy the cpu
# while true; do cp rhel6u3-1 test; rm -rf test; done

3. Do start->save->restore->destroy loop for a guest
# while true; do virsh start kvm-rhel6u3-x86_64; sleep 10; virsh save kvm-rhel6u3-x86_64 /var/lib/libvirt/images/kvm-rhel6u3-x86_64.save; virsh restore /var/lib/libvirt/images/kvm-rhel6u3-x86_64.save; virsh destroy kvm-rhel6u3-x86_64; done

4. On the third console do
# while true; do python /usr/share/doc/libvirt-python-0.9.10/events-python/event-test.py; done
to get domain operation messages
  
Actual results:
For lots of times loop, libvirt event python can not receive messages any more

Expected results:
libvirt event python can always receive messages

Additional info:
Comment 3 Michal Privoznik 2012-05-25 10:02:08 EDT
Okay, I managed to reproduce. And I think this is NOTABUG.

What's happenning here: the host is under heavy load indeed. Therefore a process, be it libvirtd, may be scheduled by kernel not so often as we want it to be. So it may fail to reply to keepalive requests and thus the connection is closed. In my case it was not something uncommon that libvirtd wasn't scheduled for 35 seconds. Therefore a python client watching for events decideded to close connection. And this is right in fact. The server was unresponsive.

From attached libvirtd.log (notice the jump from 13:20:22 to 13:20:57)
2012-05-25 13:20:21.775+0000: 31242: debug : qemuProcessHandleShutdown:632 : vm=0x7f9a3000d300
2012-05-25 13:20:22.174+0000: 31243: debug : virDomainObjRef:1623 : obj=0x7f9a3000d300 refs=3
2012-05-25 13:20:22.174+0000: 31243: debug : qemuDomainObjBeginJobInternal:753 : Starting job: destroy (async=none)
2012-05-25 13:20:57.106+0000: 31243: debug : qemuProcessStop:3763 : Shutting down VM 'f16' pid=11967 migrated=0
2012-05-25 13:20:57.107+0000: 31243: debug : qemuMonitorClose:759 : QEMU_MONITOR_CLOSE: mon=0x7f9a240243a0 refs=4
2012-05-25 13:20:57.107+0000: 31243: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f9a240243a0 refs=3

Moreover, we can see timers which are set to fire up every 2 seconds being fired after 36 seconds (again libvirtd.log):
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0x2126400 client=0x21242b0 countToDeath=1 idle=36
2012-05-25 13:20:57.616+0000: 31242: debug : virNetMessageNew:48 : msg=0x21e9340 tracked=0
2012-05-25 13:20:57.616+0000: 31242: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x21242b0
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0x2126400 client=0x21242b0 prog=1801807216 vers=1 proc=1

Even though the client was trying to send KA requests meanwhile (client log):
2012-05-25 13:20:26.206+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:26.206+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=2 idle=10
2012-05-25 13:20:31.211+0000: 11895: debug : virNetMessageNew:48 : msg=0xa46ec0 tracked=0
2012-05-25 13:20:31.211+0000: 11895: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=1 idle=15
2012-05-25 13:20:36.216+0000: 11895: debug : virNetMessageNew:48 : msg=0xa46ec0 tracked=0
2012-05-25 13:20:36.216+0000: 11895: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:41.221+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=0 idle=20
2012-05-25 13:20:41.221+0000: 11895: warning : virKeepAliveTimer:182 : No response from client 0x7f7faa50d010 after 3 keepalive messages in 20 seconds
2012-05-25 13:20:41.221+0000: 11895: debug : virNetClientClose:521 : client=0x7f7faa50d010


And in fact, server sees these requests, but it's too late to reply (libvirtd.log):
2012-05-25 13:20:57.627+0000: 31242: debug : virKeepAliveCheckMessage:428 : Got keepalive request from client 0x21242b0
2012-05-25 13:20:57.627+0000: 31242: debug : virKeepAliveScheduleResponse:143 : Scheduling keepalive response to client 0x21242b0
2012-05-25 13:20:57.627+0000: 31242: debug : virNetMessageNew:48 : msg=0x22a9490 tracked=0
2012-05-25 13:20:57.627+0000: 31242: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28

Notice that client has closed the connection at 13:20:41 but the server got scheduled at 13:20:57 (16 seconds after) when it was too late.
Comment 4 Michal Privoznik 2012-05-25 10:04:36 EDT
Created attachment 586875 [details]
libvirtd.log
Comment 5 Michal Privoznik 2012-05-25 10:05:02 EDT
Created attachment 586876 [details]
libvirt_client.log
Comment 6 Michal Privoznik 2012-05-31 11:32:04 EDT
Okay, as long as nobody objects I am closing this as NOTABUG per comment 3. If, however, somebody feels otherwise, don't hesitate to reopen.

Note You need to log in before you can comment on or make changes to this bug.