Bug 821991
Summary: | libvirt domain event handler can not receive messages after several times virKeepAliveTimer warning | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | weizhang <weizhan> | ||||||||
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 6.3 | CC: | acathrow, dallan, dyasny, dyuan, honzhang, mzhan, rwu | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2012-05-31 15:32:04 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Okay, I managed to reproduce. And I think this is NOTABUG. What's happenning here: the host is under heavy load indeed. Therefore a process, be it libvirtd, may be scheduled by kernel not so often as we want it to be. So it may fail to reply to keepalive requests and thus the connection is closed. In my case it was not something uncommon that libvirtd wasn't scheduled for 35 seconds. Therefore a python client watching for events decideded to close connection. And this is right in fact. The server was unresponsive. From attached libvirtd.log (notice the jump from 13:20:22 to 13:20:57) 2012-05-25 13:20:21.775+0000: 31242: debug : qemuProcessHandleShutdown:632 : vm=0x7f9a3000d300 2012-05-25 13:20:22.174+0000: 31243: debug : virDomainObjRef:1623 : obj=0x7f9a3000d300 refs=3 2012-05-25 13:20:22.174+0000: 31243: debug : qemuDomainObjBeginJobInternal:753 : Starting job: destroy (async=none) 2012-05-25 13:20:57.106+0000: 31243: debug : qemuProcessStop:3763 : Shutting down VM 'f16' pid=11967 migrated=0 2012-05-25 13:20:57.107+0000: 31243: debug : qemuMonitorClose:759 : QEMU_MONITOR_CLOSE: mon=0x7f9a240243a0 refs=4 2012-05-25 13:20:57.107+0000: 31243: debug : qemuMonitorUnref:210 : QEMU_MONITOR_UNREF: mon=0x7f9a240243a0 refs=3 Moreover, we can see timers which are set to fire up every 2 seconds being fired after 36 seconds (again libvirtd.log): 2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0x2126400 client=0x21242b0 countToDeath=1 idle=36 2012-05-25 13:20:57.616+0000: 31242: debug : virNetMessageNew:48 : msg=0x21e9340 tracked=0 2012-05-25 13:20:57.616+0000: 31242: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28 2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x21242b0 2012-05-25 13:20:57.616+0000: 31242: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0x2126400 client=0x21242b0 prog=1801807216 vers=1 proc=1 Even though the client was trying to send KA requests meanwhile (client log): 2012-05-25 13:20:26.206+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010 2012-05-25 13:20:26.206+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1 2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0 2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil) 2012-05-25 13:20:26.206+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640 2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=2 idle=10 2012-05-25 13:20:31.211+0000: 11895: debug : virNetMessageNew:48 : msg=0xa46ec0 tracked=0 2012-05-25 13:20:31.211+0000: 11895: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28 2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010 2012-05-25 13:20:31.211+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1 2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0 2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil) 2012-05-25 13:20:31.211+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640 2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=1 idle=15 2012-05-25 13:20:36.216+0000: 11895: debug : virNetMessageNew:48 : msg=0xa46ec0 tracked=0 2012-05-25 13:20:36.216+0000: 11895: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28 2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveSend:119 : Sending keepalive request to client 0x7f7faa50d010 2012-05-25 13:20:36.216+0000: 11895: debug : virKeepAliveSend:122 : RPC_KEEPALIVE_SEND: ka=0xa45510 client=0x7f7faa50d010 prog=1801807216 vers=1 proc=1 2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientSendInternal:1683 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f7faa50d010 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0 2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientIO:1541 : Outgoing message prog=1801807216 version=1 serial=0 proc=1 type=2 length=28 dispatch=(nil) 2012-05-25 13:20:36.216+0000: 11895: debug : virNetClientIO:1599 : We have the buck 0xa45640 0xa45640 2012-05-25 13:20:41.221+0000: 11895: debug : virKeepAliveTimer:166 : RPC_KEEPALIVE_TIMEOUT: ka=0xa45510 client=0x7f7faa50d010 countToDeath=0 idle=20 2012-05-25 13:20:41.221+0000: 11895: warning : virKeepAliveTimer:182 : No response from client 0x7f7faa50d010 after 3 keepalive messages in 20 seconds 2012-05-25 13:20:41.221+0000: 11895: debug : virNetClientClose:521 : client=0x7f7faa50d010 And in fact, server sees these requests, but it's too late to reply (libvirtd.log): 2012-05-25 13:20:57.627+0000: 31242: debug : virKeepAliveCheckMessage:428 : Got keepalive request from client 0x21242b0 2012-05-25 13:20:57.627+0000: 31242: debug : virKeepAliveScheduleResponse:143 : Scheduling keepalive response to client 0x21242b0 2012-05-25 13:20:57.627+0000: 31242: debug : virNetMessageNew:48 : msg=0x22a9490 tracked=0 2012-05-25 13:20:57.627+0000: 31242: debug : virNetMessageEncodePayloadEmpty:441 : Encode length as 28 Notice that client has closed the connection at 13:20:41 but the server got scheduled at 13:20:57 (16 seconds after) when it was too late. Created attachment 586875 [details]
libvirtd.log
Created attachment 586876 [details]
libvirt_client.log
|
Created attachment 584831 [details] libvirtd.log Description of problem: When cpu is busy and domain operation can not reply for more than 30s(default), libvirt will report error like error: End of file while reading data: Input/output error, and libvirt event-python will exit, for several times loop, then libvirt event-python can not receive messages any more. Version-Release number of selected component (if applicable): snapshot 3 libvirt-python-0.9.10-18.el6.x86_64 libvirt-client-0.9.10-18.el6.x86_64 libvirt-0.9.10-18.el6.x86_64 How reproducible: 100% for long time execution Steps to Reproduce: 1. change libvirtd.conf for easily catching error and restart libvirtd keepalive_interval = 2 keepalive_count = 1 2. on another console do cp to occupy the cpu # while true; do cp rhel6u3-1 test; rm -rf test; done 3. Do start->save->restore->destroy loop for a guest # while true; do virsh start kvm-rhel6u3-x86_64; sleep 10; virsh save kvm-rhel6u3-x86_64 /var/lib/libvirt/images/kvm-rhel6u3-x86_64.save; virsh restore /var/lib/libvirt/images/kvm-rhel6u3-x86_64.save; virsh destroy kvm-rhel6u3-x86_64; done 4. On the third console do # while true; do python /usr/share/doc/libvirt-python-0.9.10/events-python/event-test.py; done to get domain operation messages Actual results: For lots of times loop, libvirt event python can not receive messages any more Expected results: libvirt event python can always receive messages Additional info: