It is possible to call some QEMU-GA commands using libvirt interface (e.g. virDomainInterfaceAddresses) but all the commands block until the call to QEMU-GA finishes and it is not possible to specify a timeout. This is problematic for example in situations when the guest is under load when the call can take some time to finish. For the duration the libvirt domain is locked and any other interaction with libvirt is impossible. The libvirt interface should be extended to allow setting a timeout to the guest agent calls.
(In reply to Tomáš Golembiovský from comment #0) > It is possible to call some QEMU-GA commands using libvirt interface (e.g. > virDomainInterfaceAddresses) but all the commands block until the call to > QEMU-GA finishes and it is not possible to specify a timeout. This is > problematic for example in situations when the guest is under load when the > call can take some time to finish. For the duration the libvirt domain is > locked and any other interaction with libvirt is impossible. This is a bug in its own right. In most cases the only other APIs that should be blocked are those which also need to use the guest agent. The majority of domain APIs should be allowed, as talking to the guest agent doesn't change libvirt's view of the guest domain's state in most cases. > The libvirt interface should be extended to allow setting a timeout to the > guest agent calls.
(In reply to Daniel Berrangé from comment #1) > (In reply to Tomáš Golembiovský from comment #0) > > It is possible to call some QEMU-GA commands using libvirt interface (e.g. > > virDomainInterfaceAddresses) but all the commands block until the call to > > QEMU-GA finishes and it is not possible to specify a timeout. This is > > problematic for example in situations when the guest is under load when the > > call can take some time to finish. For the duration the libvirt domain is > > locked and any other interaction with libvirt is impossible. > > This is a bug in its own right. In most cases the only other APIs that > should be blocked are those which also need to use the guest agent. The > majority of domain APIs should be allowed, as talking to the guest agent > doesn't change libvirt's view of the guest domain's state in most cases. As far as I can tell (from observation as well as testing), virDomainInterfaceAddresses() does not actually block other commands on the domain, because it only queries information from the agent (i.e. it calls qemuDomainObjBeginAgentJob() rather than qemuDomainObjBeginJobWithAgent()). For example, I can do the following: - use virsh from two different terminals - instrument guest so that the agent commands don't respond immediately (e.g. set a breakpoint on qmp_guest_network_get_interfaces()) - in first terminal: call 'domifaddr --source agent $domain' - virsh waits indefinitely for a reply which is blocked by the breakpoint set above - in second terminal: call 'setmem $domain 3G' - command completes immediately and memory is updated If you have different results, please let me know. On the other hand, qemuDomainGetFSInfo() *does* block other commands while it is running. For example, if you repeat the above procedure for 'domfsinfo $domain', the 'setmem' command will block indefinitely until the agent command completes. This is because 'domfsinfo' starts a job for both the agent and the domain. We do so because virDomainGetFSInfo() attempts to match the filesystems returned by the agent command to disks in the domain's xml definition. I believe this behavior is correct because it prevents the domain definition from changing while we query the filesystem information. > > The libvirt interface should be extended to allow setting a timeout to the > > guest agent calls. So I guess this is the important part of this bug, but it is also a bit tricky. I'll send a proposal patch upstream for discussion.
(In reply to Jonathon Jongsma from comment #2) > On the other hand, qemuDomainGetFSInfo() *does* block other commands while > it is running. For example, if you repeat the above procedure for 'domfsinfo > $domain', the 'setmem' command will block indefinitely until the agent > command completes. This is because 'domfsinfo' starts a job for both the > agent and the domain. We do so because virDomainGetFSInfo() attempts to > match the filesystems returned by the agent command to disks in the domain's > xml definition. I believe this behavior is correct because it prevents the > domain definition from changing while we query the filesystem information. I don't think that level of locking is really necessary. Matching up guest disks to host XML is only ever a best effort thing. There's no strong guarantee we'll successfully match. Given that I think it is not necessary to acquire a job on the domain while the agent is running. We should just do the matching with whatever domain XML exists after we get the data back from the guest. If someone has changed storage in the meantime, so be it, we'll simply not match those changed entries. It is not *ever* acceptable for a guest agent command to block execution of any other part of the libvirt mgmt API, except for other guest agent commands, as this is a denial of service on the mgmt app. > > > The libvirt interface should be extended to allow setting a timeout to the > > > guest agent calls. > > So I guess this is the important part of this bug, but it is also a bit > tricky. I'll send a proposal patch upstream for discussion.
(In reply to Daniel Berrangé from comment #3) > I don't think that level of locking is really necessary. Matching up guest > disks to host XML is only ever a best effort thing. There's no strong > guarantee we'll successfully match. OK, I can submit a patch making that change, but... > It is not *ever* acceptable for a guest agent command to block execution of > any other part of the libvirt mgmt API, except for other guest agent > commands, as this is a denial of service on the mgmt app. If this is really a hard-and-fast rule, that basically means that everywhere that we use qemuDomainObjBeginJobWithAgent() is a bug? At the moment, we have 7 such cases: - qemuDomainShutdownFlags() - qemuDomainReboot() - qemuDomainSetVcpusFlags() - qemuDomainPMSuspendForDuration() - qemuDomainSetTime() - qemuDomainGetFSInfo() - qemuDomainGetGuestInfo()
Yes, I'd consider them all bugs, because we have to assume that the agent can be malicious / hostile and thus protect against a DoS when using the agent.
New API has been added to upstream libvirt (virDomainAgentSetResponseTimeout()) commit 95f5ac9ae52455e9da47afc95fa31c9456ac27ae Author: Jonathon Jongsma <jjongsma> Date: Wed Nov 13 16:06:09 2019 -0600
Test with libvirt-5.10.0-1.module+el8.2.0+5135+ed3b2489.x86_64 # virsh guest-agent-timeout --help NAME guest-agent-timeout - Set the guest agent timeout SYNOPSIS guest-agent-timeout <domain> [--timeout <number>] DESCRIPTION Set the number of seconds to wait for a response from the guest agent. OPTIONS [--domain] <string> domain name, id or uuid --timeout <number> timeout seconds. Hi, Jonathon It seems I can not get the current timeout of guest agent. Please have a check
Hi, Jonathon The timeout can be set even the guest agent is not configured # virsh domtime demo error: argument unsupported: QEMU guest agent is not configured # virsh guest-agent-timeout demo --timeout 2147483637 # echo $? 0 Please help to check whether this is designed like this
(In reply to Lili Zhu from comment #10) > Hi, Jonathon > It seems I can not get the current timeout of guest agent. Please have a > check Yes, there is no way to query the current agent timeout. Tomas, do you have a need to query the timeout? If there is a need for this, maybe we should open a new bug, but I don't see a compelling reason to introduce this functionality at the moment. (In reply to Lili Zhu from comment #11) > Hi, Jonathon > > The timeout can be set even the guest agent is not configured > > # virsh domtime demo > error: argument unsupported: QEMU guest agent is not configured > > # virsh guest-agent-timeout demo --timeout 2147483637 > > # echo $? > 0 > > Please help to check whether this is designed like this Yes, it is intentional that we can set the agent timeout even when an agent is not configured. If an agent is connected in the future, the timeout will automatically apply.
(In reply to Jonathon Jongsma from comment #12) > (In reply to Lili Zhu from comment #10) > > Hi, Jonathon > > It seems I can not get the current timeout of guest agent. Please have a > > check > > Yes, there is no way to query the current agent timeout. Tomas, do you have > a need to query the timeout? If there is a need for this, maybe we should > open a new bug, but I don't see a compelling reason to introduce this > functionality at the moment. That's fine by me. I will open a new bug if I need the functionality. > > > (In reply to Lili Zhu from comment #11) > > Hi, Jonathon > > > > The timeout can be set even the guest agent is not configured > > > > # virsh domtime demo > > error: argument unsupported: QEMU guest agent is not configured > > > > # virsh guest-agent-timeout demo --timeout 2147483637 > > > > # echo $? > > 0 > > > > Please help to check whether this is designed like this > > Yes, it is intentional that we can set the agent timeout even when an agent > is not configured. If an agent is connected in the future, the timeout will > automatically apply. Sounds correct to me.
Test with: libvirt-6.0.0-12.el8.x86_64 qemu-kvm-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64 qemu-guest-agent-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64 Senario 1: 1. login into the guest, check the qemu-guest-agent service # virsh console rhel8.2 Connected to domain rhel8.2 Escape character is ^] [root@bootp-73-227-181 ~]# systemctl status qemu-guest-agent ● qemu-guest-agent.service - QEMU Guest Agent Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; disabled; > Active: active (running) since Wed 2020-03-18 12:19:09 CST; 2h 10min ago Main PID: 859 (qemu-ga) Tasks: 1 Memory: 3.2M CGroup: /system.slice/qemu-guest-agent.service └─859 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-por> Mar 18 12:19:09 localhost.localdomain systemd[1]: Started QEMU Guest Agent. 2. set a breakpoint on qmp_guest_get_fsinfo # gdb -p `pidof qemu-ga` ..... (gdb) br qmp_guest_get_fsinfo Breakpoint 1 at 0x555accf67750 (gdb) c Continuing. 3. check filesystem of guest in another terminal # date; virsh domfsinfo rhel8.2; date Wed Mar 18 02:53:18 EDT 2020 <==== virsh hangs indefinitely virsh hang there indefinitely 4. continue run qemu-guest-agent again, set timeout to -2 # virsh guest-agent-timeout rhel8.2 --timeout -2 # date; virsh domfsinfo rhel8.2; date Wed Mar 18 02:55:58 EDT 2020 <==== virsh hangs indefinitely 5. continue run qemu-guest-agent again, set timeout to -1 # virsh guest-agent-timeout rhel8.2 --timeout -1 # date; virsh domfsinfo rhel8.2; date Wed Mar 18 02:59:47 EDT 2020 error: Unable to get filesystem information error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 02:59:52 EDT 2020 Hence, the default timeout of guest agent is 5 seconds 6. continue run qemu-guest-agent again, set timeout to 0 # virsh guest-agent-timeout rhel8.2 --timeout 0 # date; virsh domfsinfo rhel8.2; date Wed Mar 18 03:00:29 EDT 2020 error: Unable to get filesystem information error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 03:00:29 EDT 2020 7. continue run qemu-guest-agent again, set timeout to 1 # virsh guest-agent-timeout rhel8.2 --timeout 1 # date; virsh domfsinfo rhel8.2; date Wed Mar 18 03:06:54 EDT 2020 error: Unable to get filesystem information error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 03:06:55 EDT 2020 8. continue run qemu-guest-agent again, set timeout to 60 # virsh guest-agent-timeout rhel8.2 --timeout 60 # date; virsh domfsinfo rhel8.2; date Wed Mar 18 03:24:53 EDT 2020 error: Unable to get filesystem information error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 03:25:53 EDT 2020 8. set timeout to 2147483647 # virsh guest-agent-timeout rhel8.2 --timeout 2147483647 9. try to set timeout to other values out of range # virsh guest-agent-timeout rhel8.2 --timeout 2147483648 error: Numeric value '2147483648' for <timeout> option is malformed or out of range # virsh guest-agent-timeout rhel8.2 --timeout -3 error: invalid argument: guest agent timeout '-3' is less than the minimum '-2'
Test with: libvirt-6.0.0-12.el8.x86_64 qemu-kvm-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64 qemu-guest-agent-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64 Senario 2: 1. prepare a guest without guest agent 2. set the agent time firstly # virsh guest-agent-timeout rhel8.2 --timeout 35 3. hotplug an agent device # cat agent.xml <channel type='unix'> <source mode='bind' /> <target type='virtio' name='org.qemu.guest_agent.0'/> <address type='virtio-serial'/> </channel> # virsh attach-device rhel8.2 agent.xml Device attached successfully # virsh dumpxml rhel8.2 |grep agent -A3 -B1 <channel type='unix'> <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-2-rhel8.2/org.qemu.guest_agent.0'/> <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/> <alias name='channel1'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> 4. login into the guest, check the qemu-guest-agent service # virsh console rhel8.2 Connected to domain rhel8.2 Escape character is ^] [root@bootp-73-227-181 ~]# systemctl status qemu-guest-agent ● qemu-guest-agent.service - QEMU Guest Agent Loaded: loaded (/usr/lib/systemd/system/qemu-guest-agent.service; disabled; > Active: active (running) since Wed 2020-03-18 12:19:09 CST; 2h 10min ago Main PID: 859 (qemu-ga) Tasks: 1 Memory: 3.2M CGroup: /system.slice/qemu-guest-agent.service └─859 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-por> Mar 18 12:19:09 localhost.localdomain systemd[1]: Started QEMU Guest Agent. 5. set a breakpoint on qmp_guest_get_fsinfo # gdb -p `pidof qemu-ga` ..... (gdb) br qmp_guest_get_fsinfo Breakpoint 1 at 0x555accf67750 (gdb) c Continuing. 6. check filesystem of guest in another terminal # date; virsh domfsinfo rhel8.2 ; date Wed Mar 18 04:51:03 EDT 2020 error: Unable to get filesystem information error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 04:51:38 EDT 2020 The agent timeout set previously works when the agent appears
Test with: libvirt-6.0.0-12.el8.x86_64 qemu-kvm-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64 qemu-guest-agent-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64 Senario 2: set a breakpoint on other guest agent cmds # gdb -p `pidof qemu-ga` ..... (gdb) br qmp_guest_fsfreeze_freeze Breakpoint 2 at 0x563594d64cd0 (gdb) c Continuing. then Set agent timeout to default value, # virsh guest-agent-timeout rhel8.2 --timeout -1 Tested other guest agent cmds 1. domfsfreeze # date; virsh domfsfreeze rhel8.2 ; date Wed Mar 18 05:26:08 EDT 2020 error: Unable to freeze filesystems error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 05:26:13 EDT 2020 2. domfsthaw # date; virsh domfsthaw rhel8.2 ; date Wed Mar 18 05:28:24 EDT 2020 error: Unable to thaw filesystems error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 05:28:29 EDT 2020 3. domtime # date; virsh domtime rhel8.2; date Wed Mar 18 06:05:33 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 06:05:38 EDT 2020 # date; virsh domtime rhel8.2 --sync ; date Wed Mar 18 22:16:21 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 22:16:26 EDT 2020 4. domfstrim # date; virsh domfstrim rhel8.2; date Wed Mar 18 06:06:48 EDT 2020 error: Unable to invoke fstrim error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 06:06:53 EDT 2020 5. guestinfo # date; virsh guestinfo rhel8.2 ; date Wed Mar 18 08:06:14 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 08:06:19 EDT 2020 # date; virsh guestinfo rhel8.2 --user; date Wed Mar 18 22:25:57 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 22:26:02 EDT 2020 # date; virsh guestinfo rhel8.2 --os; date Wed Mar 18 22:27:25 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 22:27:30 EDT 2020 # date; virsh guestinfo rhel8.2 --timezone; date Wed Mar 18 22:28:17 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 22:28:22 EDT 2020 6. guestvcpus # date; virsh guestvcpus rhel8.2; date Wed Mar 18 08:23:42 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 08:23:47 EDT 2020 7. setvcpus --guest # date; virsh setvcpus rhel8.2 --guest --count 1 ; date Wed Mar 18 08:49:55 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 08:50:00 EDT 2020 8. domhostname # date; virsh domhostname rhel8.2 ; date Wed Mar 18 22:03:29 EDT 2020 error: failed to get hostname error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 22:03:34 EDT 2020 9. domifaddr --source agent # date; virsh domifaddr rhel8.2 --source agent ; date Wed Mar 18 22:20:38 EDT 2020 error: Failed to query for interfaces addresses error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 22:20:43 EDT 2020 10. set-user-password # date; virsh set-user-password rhel8.2 --user root --password 123456; date Wed Mar 18 22:22:45 EDT 2020 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 22:22:50 EDT 2020
Test with: libvirt-6.0.0-12.el8.x86_64 qemu-kvm-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64 qemu-guest-agent-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64 Senarion 4: 1. login into the guest, set breakpoint on qemu_guest_sync [root@dell-per740-24 ~]# virsh console rhel8.2 Connected to domain rhel8.2 Escape character is ^] Red Hat Enterprise Linux 8.2 Beta (Ootpa) Kernel 4.18.0-187.el8.x86_64 on an x86_64 bootp-73-227-181 login: root Password: Last login: Thu Mar 19 11:06:45 on ttyS0 [root@bootp-73-227-181 ~]# gdb -p `pidof qemu-ga` (gdb) br qmp_guest_sync Breakpoint 1 at 0x5623b468a7a0 (gdb) c Continuing. Breakpoint 1, 0x00005623b468a7a0 in qmp_guest_sync () 2. set the agent time to 1s, # virsh guest-agent-timeout rhel8.2 --timeout 1 3. try to get filesystem of guest, will hit the breakpoint set above # date; virsh domfsinfo rhel8.2 ; date Wed Mar 18 23:24:00 EDT 2020 error: Unable to get filesystem information error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 23:24:05 EDT 2020 4. set the agent time to 10s, # virsh guest-agent-timeout rhel8.2 --timeout 10 5. try to get filesystem of guest, will hit the breakpoint set above # date; virsh domfsinfo rhel8.2 ; date Wed Mar 18 23:28:44 EDT 2020 error: Unable to get filesystem information error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 23:28:49 EDT 2020 It can be seen, 1) when set agent timeout longer than 5s, the actual timeout for guest-sync is 5s. It is Okay. 2) when set agent timeout short than 5s, the actual timeout for guest-sync is still 5s, which seems to be not consistent with the patch description.
Another thing I feel confused, 1. set the breakpoint on qmp_guest_shutdown, (gdb) br qmp_guest_shutdown Breakpoint 3 at 0x5623b468e680 (gdb) c Continuing. Breakpoint 3, 0x00005623b468e680 in qmp_guest_shutdown () 2. try to shutdown guest under agent mode # date; virsh shutdown rhel8.2 --mode agent; date Wed Mar 18 23:38:11 EDT 2020 error: Failed to shutdown domain rhel8.2 error: Guest agent is not responding: Guest agent not available for now Wed Mar 18 23:39:11 EDT 2020 The timeout for shutdown is 60s. There maybe some reason for not applying the same timeout mechanism to shutdown API, what's it?
Hi, Jonathon Please help to check Comment #17 and Comment #18. Thanks very much.
(In reply to Lili Zhu from comment #17) > It can be seen, > 1) when set agent timeout longer than 5s, the actual timeout for guest-sync > is 5s. It is Okay. > 2) when set agent timeout short than 5s, the actual timeout for guest-sync > is still 5s, which > seems to be not consistent with the patch description. You're right. I've sent a small patch upstream to fix this bug. (In reply to Lili Zhu from comment #18) > The timeout for shutdown is 60s. > There maybe some reason for not applying the same timeout mechanism to > shutdown API, what's it? Yes, the shutdown timeout is special because the qemu agent does not send a normal response message to this command. We basically just have to wait for the VM to exit.
Filed another Bug# 1823309 to track the issue in Comment# 17 As other part of the testing results match with expected results, mark the bug as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2017