Description of problem: Do guest-ping as quickly as possible to a vm or many vms(20+ in my test), a vm will reports this "QEMU guest agent is not available due to an error" error after few minutes, and restart libvirtd can fix this issue, the guest-ping agent command can be done correctly again. Version-Release number of selected component (if applicable): the upstream master branch with top commit id 22a92eb0a79f398c88bec4bb5e59804029ad7ca0. How reproducible: Steps to Reproduce: here is the reproduce script of mine: https://github.com/aspirer/study/blob/master/qemu-guest-agent/test2.py you can shorten the sleep time at Line63 to accelerate the reproduce process: https://github.com/aspirer/study/blob/master/qemu-guest-agent/test2.py#L63 Actual results: a vm reports libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not available due to an error Expected results: all guest-ping commands to all vms are correctly done(return json data '{"return":{}}'). Additional info: qemuAgentIOProcessLine return -1 to qemuAgentIOProcessData qemuAgentIOProcess qemuAgentIO , then the error check codes of qemuAgentIO as below will call the callback function qemuProcessHandleAgentError, the qga is marked as errored one, and can not be used before we restart libvirtd; } else if (error) { void (*errorNotify)(qemuAgentPtr, virDomainObjPtr) = mon->cb->errorNotify; virDomainObjPtr vm = mon->vm; /* Make sure anyone waiting wakes up now */ virCondSignal(&mon->notify); virObjectUnlock(mon); virObjectUnref(mon); VIR_DEBUG("Triggering error callback"); (errorNotify)(mon, vm); Breakpoint 2, qemuAgentIOProcessLine (mon=0x7f17d8000e00, line=0x7f182e6958c0 "{\"return\": {}}", msg=0x0) at qemu/qemu_agent.c:341 341 if (virJSONValueObjectGetNumberUlong(obj, "return", &id) == 0) { (gdb) bt #0 qemuAgentIOProcessLine (mon=0x7f17d8000e00, line=0x7f182e6958c0 "{\"return\": {}}", msg=0x0) at qemu/qemu_agent.c:341 #1 0x00007f1822ab76b4 in qemuAgentIOProcessData (mon=0x7f17d8000e00, data=0x7f182e6958c0 "{\"return\": {}}", len=15, msg=0x0) at qemu/qemu_agent.c:384 #2 0x00007f1822ab77ca in qemuAgentIOProcess (mon=0x7f17d8000e00) at qemu/qemu_agent.c:426 #3 0x00007f1822ab7fe0 in qemuAgentIO (watch=63, fd=42, events=0, opaque=0x7f17d8000e00) at qemu/qemu_agent.c:626 #4 0x00007f182b738bed in virEventPollDispatchHandles (nfds=53, fds=0x7f182e695670) at util/vireventpoll.c:510 #5 0x00007f182b739424 in virEventPollRunOnce () at util/vireventpoll.c:660 #6 0x00007f182b737324 in virEventRunDefaultImpl () at util/virevent.c:308 #7 0x00007f182ceb7440 in virNetServerRun (srv=0x7f182e68c2b0) at rpc/virnetserver.c:1139 #8 0x00007f182ce6bb24 in main (argc=2, argv=0x7fff3bc9cb68) at libvirtd.c:1536 (gdb) l 336 * which is now processing our previous 337 * guest-sync commands. Check if this is 338 * the case and don't report an error but 339 * return silently. 340 */ 341 if (virJSONValueObjectGetNumberUlong(obj, "return", &id) == 0) { 342 VIR_DEBUG("Ignoring delayed reply to guest-sync: %llu", id); 343 ret = 0; 344 goto cleanup; 345 } (gdb) p virJSONValueObjectGetNumberUlong(obj, "return", &id) $1 = -1 (gdb) p id $2 = 5
Created attachment 888960 [details] all threads bt debug info
*** Bug 1090550 has been marked as a duplicate of this bug. ***
I found this bug is caused by the qemuAgentSend function, if it return -2(timeout), this bug will be triggered. So if a short timeout value of qemu agent command is set, such as 1 second, this bug will be triggered in one minute after my test case is run, if I change timeout value to 3 seconds, I have not get this bug after 15 minutes.
https://github.com/aspirer/study/blob/master/qemu-guest-agent/test2.py#L21, I changed the timeout value to 3 parameter at this line, this bug is not triggered after 30 minutes.
I try to change the codes like below, this bug seems like can be fixed, but I don't know there are other issues or not. hzwangpan@10-120-120-7:~/libvirt-ntse$ git diff src/qemu/qemu_agent.c diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index 678901e..17b754d 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -340,9 +340,9 @@ qemuAgentIOProcessLine(qemuAgentPtr mon, ret = 0; goto cleanup; } - + ret = 0; virReportError(VIR_ERR_INTERNAL_ERROR, _("Unexpected JSON reply '%s'"), line); } } else { virReportError(VIR_ERR_INTERNAL_ERROR,
ping!
I have the same error with libvirt 1.2.16, i don't do many ping, but i'm run custom command (with custom agent) that sometimes timeout, agent successful write response to /dev/vport3p1, and after that does not receive any messages from libvirt. After libvirt is restarted all works fine.
libvirt-1.2.17 also has this problem
Sorry this never received a timely response. I'm running libvirt.git (1.3.4 timeframe) and qemu 2.6.0-rc1. I ran the script from comment #1 with 4 running VMs: fedora 21-fedora 24. I couldn't reproduce even shrinking the timeout to one second. I changed libvirt code to randomly shrink the agent timeout every 20th agent command; those commands would fail, but then normal operation would resume for the next 19 commands. If anyone can still reproduce, please do the following: - provide versions of: libvirt, qemu, qemu guest agent inside the VM, VM distro, host distro - run libvirtd like this to get extra debugging: sudo LIBVIRT_LOG_FILTERS="1:qemu.qemu_agent 1:qemu.qemu_monitor" LIBVIRT_LOG_OUTPUTS="1:stderr" libvirtd > logfile 2>&1 Reproduce the issue, then attach that logfile to the bug
our qa member can reproduce this bug under these versions: 1) libvirt 1.2.9 2) qemu 2.1.0 3) qemu agent 1.6.1 or 2.1.0 4) vm debian with 3.2 kernel 5) host debian with 3.10 kernel but I believe this issue is libvirt bug, not a qemu or agent problem. because after patching my change in comment#5, this issue is resolved. the reproduce steps of our qa are(under openstack cloud): 1) create 20 vms or more on one host 2) reboot them with openstack CLI, it is like destroy + define/create with virsh 3) guest-ping them every 10s or less 4) after several hours(may be two or more), one or more vms guest-ping failed with this bug "guest agent is not available due to an error" you can try this, and may this bug is fixed in later version of libvirt, but I don't think so :(.
the patch isn't acceptable as is unfortunately, because while it may 'fix' the issue for you, it's essentially just having libvirt report success even though we didn't get a valid response from the guest agent. this is complicated by the fact that guest-ping doesn't return anything anyways, so it probably _appears_ things are working but there may be no communication with the agent actually happening. With your patch, does the error _never_ happen, or does the occasional command throw an error, but libvirt recovers and future commands work? An interesting bit would be to check if your patch allows a _different_ qmp command to recover and continue to work, something that returns actual info, maybe like guest-get-time Can you get some more debuginfo? sudo LIBVIRT_LOG_FILTERS="1:qemu" LIBVIRT_LOG_OUTPUTS="1:stderr" libvirtd > logfile 2>&1 That will capture all debugging info from the qemu driver into 'logfile'. Reproduce the issue, let the failures run for a couple seconds, then stop libvirtd, and post the output. With unpatched libvirt. But also a second run with your patch may be interesting too, but if so please also add an extra line in there after ret=0 like VIR_ERROR("HIT THE PATCH CONDITION.");
2016-03-21 06:47:46.167+0000: 31474: error : qemuAgentGuestSync:988 : internal error: Guest agent returned ID: 1458542864000 instea d of 1458542866167 2016-03-21 06:47:46.167+0000: 31473: error : qemuAgentIOProcessLine:349 : internal error: Unexpected JSON reply '{"return": {}}' 2016-03-21 06:47:46.167+0000: 31473: error : qemuAgentIOProcessData:386 : bingo, qemuAgentIOProcessLine <0 2016-03-21 06:47:46.167+0000: 31473: error : qemuAgentIOProcess:434 : bingo, qemuAgentIOProcessData len <0 2016-03-16 08:26:06.708+0000: 21969: error : qemuProcessHandleAgentError:181 : internal error: bingo, agentError in qemuProcessHandleAgentError 2016-03-16 08:41:16.873+0000: 21969: error : qemuProcessHandleAgentError:181 : internal error: bingo, agentError in qemuProcessHandleAgentError Libvirt of version 1.2.9 (I compiled the source code with some additional log output) and qemu of version 2.1 reproduced this problem.
Please provide the full debug output, not just the point of error, so I can see what it looks like when things are going correctly as well
It's not approriate to mark the qga as errored one,if the qga returns error once. Maybe next time, the qga will restore nomal status.The only way to eliminate the error mark is to restart the libvirtd.
(In reply to Michael Liu from comment #14) > It's not approriate to mark the qga as errored one,if the qga returns error > once. Maybe next time, the qga will restore nomal status.The only way to > eliminate the error mark is to restart the libvirtd. I assume that logic was added for a reason. It may be that if we get a bogus response from qemu-ga than all future responses are bogus, for a certain class of error. I'm not sure. Full debug output, and usage of a less trivial qemu-ga command like guest-get-time, may help determine if that's the case CCing danpb as well who added the agentError logic in the original qemu-ga support. maybe setting agentError=true on failure to parse the guest-agent response is the wrong thing, or there's something else going on here
In The libvir-list Archives,there is one patch to fix this bug: https://www.redhat.com/archives/libvir-list/2016-July/msg00161.html But for now,it's still pending.
Note, in the latest libvirt release there has been some patches in this area. Any idea if they fix your issue? Example: commit 6ba861ae361168337828430f85fb8c6b6b5b7c7d Author: Nikolay Shirokovskiy <nshirokovskiy> Date: Wed Nov 16 16:43:03 2016 +0300 qemu: agent: cleanup agent error flag correctly
Closing old bug since it was never reliably reproduceable and there have been many code changes since the version against which it was reported.