Bug 821991

Summary: libvirt domain event handler can not receive messages after several times virKeepAliveTimer warning
Product: Red Hat Enterprise Linux 6 Reporter: weizhang <weizhan>
Component: libvirtAssignee: Michal Privoznik <mprivozn>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 6.3CC: acathrow, dallan, dyasny, dyuan, honzhang, mzhan, rwu
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-31 15:32:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirtd.log
none
libvirtd.log
none
libvirt_client.log none

Description weizhang 2012-05-16 03:09:24 UTC
Created attachment 584831 [details]
libvirtd.log

Description of problem:
When cpu is busy and domain operation can not reply for more than 30s(default), libvirt will report error like error: End of file while reading data: Input/output error, and libvirt event-python will exit, for several times loop, then libvirt event-python can not receive messages any more.

Version-Release number of selected component (if applicable):
snapshot 3
libvirt-python-0.9.10-18.el6.x86_64
libvirt-client-0.9.10-18.el6.x86_64
libvirt-0.9.10-18.el6.x86_64


How reproducible:
100% for long time execution

Steps to Reproduce:
1. change libvirtd.conf for easily catching error and restart libvirtd
keepalive_interval = 2
keepalive_count = 1

2. on another console do cp to occupy the cpu
# while true; do cp rhel6u3-1 test; rm -rf test; done

3. Do start->save->restore->destroy loop for a guest
# while true; do virsh start kvm-rhel6u3-x86_64; sleep 10; virsh save kvm-rhel6u3-x86_64 /var/lib/libvirt/images/kvm-rhel6u3-x86_64.save; virsh restore /var/lib/libvirt/images/kvm-rhel6u3-x86_64.save; virsh destroy kvm-rhel6u3-x86_64; done

4. On the third console do
# while true; do python /usr/share/doc/libvirt-python-0.9.10/events-python/event-test.py; done
to get domain operation messages
  
Actual results:
For lots of times loop, libvirt event python can not receive messages any more

Expected results:
libvirt event python can always receive messages

Additional info:

Comment 3 Michal Privoznik 2012-05-25 14:02:08 UTC
Okay, I managed to reproduce. And I think this is NOTABUG.

What's happenning here: the host is under heavy load indeed. Therefore a process, be it libvirtd, may be scheduled by kernel not so often as we want it to be. So it may fail to reply to keepalive requests and thus the connection is closed. In my case it was not something uncommon that libvirtd wasn't scheduled for 35 seconds. Therefore a python client watching for events decideded to close connection. And this is right in fact. The server was unresponsive.

From attached libvirtd.log (notice the jump from 13:20:22 to 13:20:57)
2012-05-25 13:20:21.775+0000: 31242: debug : qemuProcessHandleShutdown:632 : vm=0x7f9a3000d300
2012-05-25 13:20:22.174+0000: 31243: debug : virDomainObjRef:1623 : obj=0x7f9a3000d300 refs=3
2012-05-25 13:20:22.174+0000: 31243: debug : qemuDomainObjBeginJobInternal:753 : Starting job: destroy (async=none)
2012-05-25 13:20:57.106+0000: 31243: debug : qemuProcessStop:3763 : Shutting down VM 'f16' pid=11967 migrated=0
2012-05-25 13:20:57.107+0000: 31243: debug : qemuMonitorClose:759 : QEMU_MONITOR_CLOSE: mon=0x7f9a240243a0 refs=4
2012-05-25 13:20:57.107+0000: 31243: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f9a240243a0 refs=3

Moreover, we can see timers which are set to fire up every 2 seconds being fired after 36 seconds (again libvirtd.log):
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0x2126400 client=0x21242b0 countToDeath=1 idle=36
2012-05-25 13:20:57.616+0000: 31242: debug : virNetMessageNew:48 : msg=0x21e9340 tracked=0
2012-05-25 13:20:57.616+0000: 31242: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x21242b0
2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0x2126400 client=0x21242b0 prog=1801807216 vers=1 proc=1

Even though the client was trying to send KA requests meanwhile (client log):
2012-05-25 13:20:26.206+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:26.206+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=2 idle=10
2012-05-25 13:20:31.211+0000: 11895: debug : virNetMessageNew:48 : msg=0xa46ec0 tracked=0
2012-05-25 13:20:31.211+0000: 11895: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=1 idle=15
2012-05-25 13:20:36.216+0000: 11895: debug : virNetMessageNew:48 : msg=0xa46ec0 tracked=0
2012-05-25 13:20:36.216+0000: 11895: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28
2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010
2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil)
2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640

2012-05-25 13:20:41.221+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=0 idle=20
2012-05-25 13:20:41.221+0000: 11895: warning : virKeepAliveTimer:182 : No response from client 0x7f7faa50d010 after 3 keepalive messages in 20 seconds
2012-05-25 13:20:41.221+0000: 11895: debug : virNetClientClose:521 : client=0x7f7faa50d010


And in fact, server sees these requests, but it's too late to reply (libvirtd.log):
2012-05-25 13:20:57.627+0000: 31242: debug : virKeepAliveCheckMessage:428 : Got keepalive request from client 0x21242b0
2012-05-25 13:20:57.627+0000: 31242: debug : virKeepAliveScheduleResponse:143 : Scheduling keepalive response to client 0x21242b0
2012-05-25 13:20:57.627+0000: 31242: debug : virNetMessageNew:48 : msg=0x22a9490 tracked=0
2012-05-25 13:20:57.627+0000: 31242: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28

Notice that client has closed the connection at 13:20:41 but the server got scheduled at 13:20:57 (16 seconds after) when it was too late.

Comment 4 Michal Privoznik 2012-05-25 14:04:36 UTC
Created attachment 586875 [details]
libvirtd.log

Comment 5 Michal Privoznik 2012-05-25 14:05:02 UTC
Created attachment 586876 [details]
libvirt_client.log

Comment 6 Michal Privoznik 2012-05-31 15:32:04 UTC
Okay, as long as nobody objects I am closing this as NOTABUG per comment 3. If, however, somebody feels otherwise, don't hesitate to reopen.