1090551 – QEMU guest agent is not available due to an error

Bug 1090551 - QEMU guest agent is not available due to an error

Summary: QEMU guest agent is not available due to an error

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Virtualization Tools
Classification:	Community
Component:	libvirt
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Libvirt Maintainers
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1090550 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-04-23 15:02 UTC by Wangpan
Modified:	2020-04-17 12:22 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-04-17 12:22:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
all threads bt debug info (10.55 KB, text/plain) 2014-04-23 15:08 UTC, Wangpan	no flags	Details
View All

Description Wangpan 2014-04-23 15:02:59 UTC

Description of problem:
Do guest-ping as quickly as possible to a vm or many vms(20+ in my test), a vm will reports this "QEMU guest agent is not available due to an error" error after few minutes, and restart libvirtd can fix this issue, the guest-ping agent command can be done correctly again.

Version-Release number of selected component (if applicable):
the upstream master branch with top commit id 22a92eb0a79f398c88bec4bb5e59804029ad7ca0.

How reproducible:


Steps to Reproduce:
here is the reproduce script of mine:
https://github.com/aspirer/study/blob/master/qemu-guest-agent/test2.py
you can shorten the sleep time at Line63 to accelerate the reproduce process:
https://github.com/aspirer/study/blob/master/qemu-guest-agent/test2.py#L63

Actual results:
a vm reports libvirt: QEMU Driver error : Guest agent is not responding: QEMU guest agent is not available due to an error

Expected results:
all guest-ping commands to all vms are correctly done(return json data '{"return":{}}').

Additional info:
qemuAgentIOProcessLine return -1 to qemuAgentIOProcessData qemuAgentIOProcess qemuAgentIO , then the error check codes of qemuAgentIO as below will call the callback function qemuProcessHandleAgentError, the qga is marked as errored one, and can not be used before we restart libvirtd;
} else if (error) {
        void (*errorNotify)(qemuAgentPtr, virDomainObjPtr)
            = mon->cb->errorNotify;
        virDomainObjPtr vm = mon->vm;

        /* Make sure anyone waiting wakes up now */
        virCondSignal(&mon->notify);
        virObjectUnlock(mon);
        virObjectUnref(mon);
        VIR_DEBUG("Triggering error callback");
        (errorNotify)(mon, vm);

Breakpoint 2, qemuAgentIOProcessLine (mon=0x7f17d8000e00, line=0x7f182e6958c0 "{\"return\": {}}", msg=0x0) at qemu/qemu_agent.c:341
341                 if (virJSONValueObjectGetNumberUlong(obj, "return", &id) == 0) {
(gdb) bt
#0  qemuAgentIOProcessLine (mon=0x7f17d8000e00, line=0x7f182e6958c0 "{\"return\": {}}", msg=0x0) at qemu/qemu_agent.c:341
#1  0x00007f1822ab76b4 in qemuAgentIOProcessData (mon=0x7f17d8000e00, data=0x7f182e6958c0 "{\"return\": {}}", len=15, msg=0x0) at qemu/qemu_agent.c:384
#2  0x00007f1822ab77ca in qemuAgentIOProcess (mon=0x7f17d8000e00) at qemu/qemu_agent.c:426
#3  0x00007f1822ab7fe0 in qemuAgentIO (watch=63, fd=42, events=0, opaque=0x7f17d8000e00) at qemu/qemu_agent.c:626
#4  0x00007f182b738bed in virEventPollDispatchHandles (nfds=53, fds=0x7f182e695670) at util/vireventpoll.c:510
#5  0x00007f182b739424 in virEventPollRunOnce () at util/vireventpoll.c:660
#6  0x00007f182b737324 in virEventRunDefaultImpl () at util/virevent.c:308
#7  0x00007f182ceb7440 in virNetServerRun (srv=0x7f182e68c2b0) at rpc/virnetserver.c:1139
#8  0x00007f182ce6bb24 in main (argc=2, argv=0x7fff3bc9cb68) at libvirtd.c:1536

(gdb) l
336                  * which is now processing our previous
337                  * guest-sync commands. Check if this is
338                  * the case and don't report an error but
339                  * return silently.
340                  */
341                 if (virJSONValueObjectGetNumberUlong(obj, "return", &id) == 0) {
342                     VIR_DEBUG("Ignoring delayed reply to guest-sync: %llu", id);
343                     ret = 0;
344                     goto cleanup;
345                 }
(gdb) p virJSONValueObjectGetNumberUlong(obj, "return", &id)
$1 = -1
(gdb) p id
$2 = 5

Comment 1 Wangpan 2014-04-23 15:08:49 UTC

Created attachment 888960 [details]
all threads bt debug info

Comment 2 Peter Krempa 2014-04-23 15:21:31 UTC

*** Bug 1090550 has been marked as a duplicate of this bug. ***

Comment 3 Wangpan 2014-04-28 03:35:32 UTC

I found this bug is caused by the qemuAgentSend function, if it return -2(timeout), this bug will be triggered.
So if a short timeout value of qemu agent command is set, such as 1 second, this bug will be triggered in one minute after my test case is run, if I change timeout value to 3 seconds, I have not get this bug after 15 minutes.

Comment 4 Wangpan 2014-04-28 04:57:43 UTC

https://github.com/aspirer/study/blob/master/qemu-guest-agent/test2.py#L21, I changed the timeout value to 3 parameter at this line, this bug is not triggered after 30 minutes.

Comment 5 Wangpan 2014-04-28 08:22:48 UTC

I try to change the codes like below, this bug seems like can be fixed, but I don't know there are other issues or not.

hzwangpan@10-120-120-7:~/libvirt-ntse$ git diff src/qemu/qemu_agent.c
diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c
index 678901e..17b754d 100644
--- a/src/qemu/qemu_agent.c
+++ b/src/qemu/qemu_agent.c
@@ -340,9 +340,9 @@ qemuAgentIOProcessLine(qemuAgentPtr mon,
                 ret = 0;
                 goto cleanup;
             }
-
+            ret = 0;
             virReportError(VIR_ERR_INTERNAL_ERROR,
                            _("Unexpected JSON reply '%s'"), line);
         }
     } else {
         virReportError(VIR_ERR_INTERNAL_ERROR,

Comment 6 Vasiliy G Tolstov 2015-10-28 11:08:47 UTC

ping!

Comment 7 Vasiliy G Tolstov 2015-10-28 11:10:44 UTC

I have the same error with libvirt 1.2.16, i don't do many ping, but i'm run custom command (with custom agent) that sometimes timeout, agent successful write response to /dev/vport3p1, and after that does not receive any messages from libvirt. After libvirt is restarted all works fine.

Comment 8 Luo Gangyi 2016-01-25 01:45:40 UTC

libvirt-1.2.17 also has this problem

Comment 9 Cole Robinson 2016-04-13 21:11:39 UTC

Sorry this never received a timely response.

I'm running libvirt.git (1.3.4 timeframe) and qemu 2.6.0-rc1. I ran the script from comment #1 with 4 running VMs: fedora 21-fedora 24. I couldn't reproduce even shrinking the timeout to one second.

I changed libvirt code to randomly shrink the agent timeout every 20th agent command; those commands would fail, but then normal operation would resume for the next 19 commands.

If anyone can still reproduce, please do the following:

- provide versions of: libvirt, qemu, qemu guest agent inside the VM, VM distro, host distro
- run libvirtd like this to get extra debugging: sudo LIBVIRT_LOG_FILTERS="1:qemu.qemu_agent 1:qemu.qemu_monitor" LIBVIRT_LOG_OUTPUTS="1:stderr" libvirtd > logfile 2>&1

Reproduce the issue, then attach that logfile to the bug

Comment 10 Wangpan 2016-04-15 08:29:20 UTC

our qa member can reproduce this bug under these versions:
1) libvirt 1.2.9
2) qemu 2.1.0
3) qemu agent 1.6.1 or 2.1.0
4) vm debian with 3.2 kernel
5) host debian with 3.10 kernel

but I believe this issue is libvirt bug, not a qemu or agent problem.
because after patching my change in comment#5, this issue is resolved.

the reproduce steps of our qa are(under openstack cloud):
1) create 20 vms or more on one host
2) reboot them with openstack CLI, it is like destroy + define/create with virsh
3) guest-ping them every 10s or less
4) after several hours(may be two or more), one or more vms guest-ping failed with this bug "guest agent is not available due to an error"

you can try this, and may this bug is fixed in later version of libvirt, but I don't think so :(.

Comment 11 Cole Robinson 2016-04-15 14:13:33 UTC

the patch isn't acceptable as is unfortunately, because while it may 'fix' the issue for you, it's essentially just having libvirt report success even though we didn't get a valid response from the guest agent. this is complicated by the fact that guest-ping doesn't return anything anyways, so it probably _appears_ things are working but there may be no communication with the agent actually happening.

With your patch, does the error _never_ happen, or does the occasional command throw an error, but libvirt recovers and future commands work? An interesting bit would be to check if your patch allows a _different_ qmp command to recover and continue to work, something that returns actual info, maybe like guest-get-time

Can you get some more debuginfo?

sudo LIBVIRT_LOG_FILTERS="1:qemu" LIBVIRT_LOG_OUTPUTS="1:stderr" libvirtd > logfile 2>&1

That will capture all debugging info from the qemu driver into 'logfile'. Reproduce the issue, let the failures run for a couple seconds, then stop libvirtd, and post the output. With unpatched libvirt. But also a second run with your patch may be interesting too, but if so please also add an extra line in there after ret=0 like

VIR_ERROR("HIT THE PATCH CONDITION.");

Comment 12 QiangGuan 2016-04-18 03:02:36 UTC

2016-03-21 06:47:46.167+0000: 31474: error : qemuAgentGuestSync:988 : internal error: Guest agent returned ID: 1458542864000 instea
d of 1458542866167
2016-03-21 06:47:46.167+0000: 31473: error : qemuAgentIOProcessLine:349 : internal error: Unexpected JSON reply '{"return": {}}'
2016-03-21 06:47:46.167+0000: 31473: error : qemuAgentIOProcessData:386 : bingo, qemuAgentIOProcessLine <0 
2016-03-21 06:47:46.167+0000: 31473: error : qemuAgentIOProcess:434 : bingo, qemuAgentIOProcessData len <0 

2016-03-16 08:26:06.708+0000: 21969: error : qemuProcessHandleAgentError:181 : internal error: bingo, agentError in qemuProcessHandleAgentError
2016-03-16 08:41:16.873+0000: 21969: error : qemuProcessHandleAgentError:181 : internal error: bingo, agentError in qemuProcessHandleAgentError


Libvirt of version 1.2.9 (I compiled the source code with some additional log output) and qemu of version 2.1 reproduced this problem.

Comment 13 Cole Robinson 2016-04-18 21:19:49 UTC

Please provide the full debug output, not just the point of error, so I can see what it looks like when things are going correctly as well

Comment 14 Michael Liu 2016-05-28 07:16:47 UTC

It's not approriate to mark the qga as errored one,if the qga returns error once. Maybe next time, the qga will restore nomal status.The only way to eliminate the error mark is to restart the libvirtd.

Comment 15 Cole Robinson 2016-05-31 21:56:08 UTC

(In reply to Michael Liu from comment #14)
> It's not approriate to mark the qga as errored one,if the qga returns error
> once. Maybe next time, the qga will restore nomal status.The only way to
> eliminate the error mark is to restart the libvirtd.

I assume that logic was added for a reason. It may be that if we get a bogus response from qemu-ga than all future responses are bogus, for a certain class of error. I'm not sure.

Full debug output, and usage of a less trivial qemu-ga command like guest-get-time, may help determine if that's the case

CCing danpb as well who added the agentError logic in the original qemu-ga support. maybe setting agentError=true on failure to parse the guest-agent response is the wrong thing, or there's something else going on here

Comment 16 BCKernel 2016-08-09 06:46:11 UTC

In The libvir-list Archives，there is one patch to fix this bug:
https://www.redhat.com/archives/libvir-list/2016-July/msg00161.html
But for now,it's still pending.

Comment 17 Cole Robinson 2016-12-22 19:45:28 UTC

Note, in the latest libvirt release there has been some patches in this area. Any idea if they fix your issue? Example:

commit 6ba861ae361168337828430f85fb8c6b6b5b7c7d
Author: Nikolay Shirokovskiy <nshirokovskiy>
Date:   Wed Nov 16 16:43:03 2016 +0300

    qemu: agent: cleanup agent error flag correctly

Comment 18 Daniel Berrangé 2020-04-17 12:22:04 UTC

Closing old bug since it was never reliably reproduceable and there have been many code changes since the version against which it was reported.

Note You need to log in before you can comment on or make changes to this bug.