Bug 1552092

Summary:	request to qemu-guest-agent prevent other calls to qemu monitor
Product:	Red Hat Enterprise Linux 7	Reporter:	Roman Hodain <rhodain>
Component:	libvirt	Assignee:	Michal Privoznik <mprivozn>
Status:	CLOSED ERRATA	QA Contact:	Yanqiu Zhang <yanqzhan>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	fjin, gveitmic, michal.skrivanek, mkalinin, mprivozn, pkrempa, pstehlik, rhodain, xuzhang, yafu, yanqzhan
Target Milestone:	rc	Keywords:	Upstream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	libvirt-4.5.0-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-30 09:53:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1613514

Description Roman Hodain 2018-03-06 13:45:14 UTC

Description of problem:
When a qemu-guest-agent call it triggered (in this case fs freeze), qemu monitor calls cannot be triggered (in this case remoteDispatchConnectGetAllDomainStats)

Version-Release number of selected component (if applicable):

    libvirt-daemon-driver-qemu-3.2.0-14.el7_4.7.x86_64
    libvirt-daemon-3.2.0-14.el7_4.7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Generate a lot of writes to storage on the guest in order to generated a log of dirty pages which has to be writtent to the disk.
2. Freeze the FS and try to get the stats from the monitor

Actual results:
The call to the monitor is delayed 

Expected results:
The call to the monitor is started immediately.

Additional info:
This issue was triggered in the from RHV.

vdsm logs show:
2018-03-06 13:06:19,644+0100 INFO  (jsonrpc/3) [virt.vm] (vmId='93451efa-fef2-4154-b44e-be1cb07e726c') Freezing guest filesystems (vm:3473)
...
2018-03-06 13:07:49,891+0100 INFO  (jsonrpc/3) [virt.vm] (vmId='93451efa-fef2-4154-b44e-be1cb07e726c') 2 guest filesystems frozen (vm:3488)

You can see that the operation took ~30 sec. Another thread is blocked during that time:

2018-03-06 13:06:37,712+0100 INFO  (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/22 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkSampler object at 0x38428d0> at 0x3858090> timeout=7.5, duration=7 at 0x3ce9a10> discarded task#=1004 at 0x3acbed0> (executor:348)

That causes the VM to be marked as non responding in the RHV portal.

The libvirt logs show that the Freeze was called:

2018-03-06 12:06:19.688+0000: 30714: debug : virJSONValueToString:1904 : result={"execute":"guest-fsfreeze-freeze"}
2018-03-06 12:06:19.688+0000: 30714: debug : qemuAgentCommand:1153 : Send command '{"execute":"guest-fsfreeze-freeze"}' for write, seconds = -2


Then the domain stats are called

2018-03-06 12:06:30.216+0000: 30715: debug : virThreadJobSet:96 : Thread 30715 (virNetServerHandleJob) is now running job remoteDispatchConnectGetAllDomainStats

and it times out because of the freeze FS

2018-03-06 12:07:00.216+0000: 30715: warning : qemuDomainObjBeginJobInternal:3827 : Cannot start job (query, none) for domain test-serial; current job is (modify, none) owned by (30714 remoteDispatchDomainFSFreeze, 0 <null>) for (40s, 0s)

During the time when the operation takes place the libvirt reports errors on domcontrol

    error: internal (locking) error

Comment 1 Peter Krempa 2018-03-06 14:14:32 UTC

Not sure we can do much about this. The 'modify' type job while the guest-agent is executing is correct, since the API is modifying state of the VM. On the other hand it should not be a problem to query the actual monitor in this case.

Comment 3 Daniel Berrangé 2018-03-06 14:21:58 UTC

It feels a bit strange to me that we need to prevent changes to the entire virDomainObjPtr while a guest agent command is taking place. 

No matter whether fsfreeze command passes/fails, it is not modifying any aspect of the QEMU guest state that we care about in virDomainObjPtr. All we do is check whether the guest is running or not. 

At least for the virDomainFSFreeze/Thaw APIs I don't see a strong reason why we need to hold the job lock for duration of the agent command in these cases. 

The virDomainSnapshotCreate API is harder though, as that's using the job lock to protect a whole lot of work related to the snapshot, not merely the agent command execution.

For virConnectGetAllDomainStats() perhaps we need to think about some concept like a flag added to the API "VIR_DOMAIN_GET_STATS_BEST_EFFORT" which would allow the app to tell libvirt to just silently not report stats that aren't quickly available due to job lock being held ?

Comment 6 Peter Krempa 2018-03-06 14:35:09 UTC

(In reply to Daniel Berrange from comment #3)
> For virConnectGetAllDomainStats() perhaps we need to think about some
> concept like a flag added to the API "VIR_DOMAIN_GET_STATS_BEST_EFFORT"
> which would allow the app to tell libvirt to just silently not report stats
> that aren't quickly available due to job lock being held ?

Well it's not as easy, since we don't annotate most of the places as allowing concurrent jobs (unless it's an async job). In this particular case it'd be possible since we are talking on a different monitor, but generally it's need some auditing.

Comment 7 Fangge Jin 2018-03-28 10:52:07 UTC

Reproduced with libvirt-3.9.0-14.el7_5.2.x86_64.

Steps:
1.Start a guest

2. Add memory stress in guest

3. Do snapshot and "virsh domstats":
# virsh snapshot-create-as foo  --memspec file=/tmp/sn12.mem --diskspec vda --name sn12 & date;virsh domstats foo;date
[1] 11353
Wed Mar 28 06:48:52 EDT 2018
Domain snapshot sn12 created
Domain: 'foo'
  state.state=1
  state.reason=5
  cpu.time=182688791670
  cpu.user=6480000000
  cpu.system=21650000000
  balloon.current=824000
  balloon.maximum=1024000
  balloon.swap_in=0
  balloon.swap_out=0
  balloon.major_fault=188
  balloon.minor_fault=161033
  balloon.unused=871608
  balloon.available=990712
  balloon.usable=819384
  balloon.last-update=1522230762
  balloon.rss=1055388
  vcpu.current=2
  vcpu.maximum=4
  vcpu.0.state=1
  vcpu.0.time=97670000000
  vcpu.0.wait=0
  vcpu.1.state=1
  vcpu.1.time=55470000000
  vcpu.1.wait=0
  net.count=1
  net.0.name=vnet0
  net.0.rx.bytes=2927519
  net.0.rx.pkts=4004
  net.0.rx.errs=0
  net.0.rx.drop=0
  net.0.tx.bytes=73963
  net.0.tx.pkts=1099
  net.0.tx.errs=0
  net.0.tx.drop=0
  block.count=1
  block.0.name=vda
  block.0.path=/nfs/RHEL-7.5-x86_64-latest.sn12
  block.0.rd.reqs=121401
  block.0.rd.bytes=3790876160
  block.0.rd.times=2379883463253
  block.0.wr.reqs=25336
  block.0.wr.bytes=5261135872
  block.0.wr.times=48958875859986
  block.0.fl.reqs=2818
  block.0.fl.times=22346022905
  block.0.allocation=0
  block.0.capacity=10737418240
  block.0.physical=200704

[1]+  Done                    virsh snapshot-create-as foo --memspec file=/tmp/sn12.mem --diskspec vda --name sn12

Comment 8 Michal Privoznik 2018-06-07 11:59:54 UTC

Patches proposed upstream:

https://www.redhat.com/archives/libvir-list/2018-June/msg00546.html

Comment 9 Michal Privoznik 2018-06-15 08:19:28 UTC

v2:

https://www.redhat.com/archives/libvir-list/2018-June/msg01251.html

Comment 10 Michal Privoznik 2018-06-19 05:12:28 UTC

To POST:

9486ed9071 (HEAD -> master, origin/master, origin/HEAD, domStats_v2) virsh: Introduce --nowait to domstats
0afbeb3740 Introduce VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT
f38ea75b99 qemu_domain: Introduce qemuDomainObjBeginJobNowait
af0715bb61 qemuDomainObjBeginJobInternal: Remove spurious @ret assignment
dd92df93f4 qemu_domain: Document qemuDomainObjBeginJob

v4.4.0-275-g9486ed9071

Comment 12 Yanqiu Zhang 2018-07-17 14:58:32 UTC

Verify with:
libvirt-4.5.0-3.el7.x86_64
qemu-kvm-rhev-2.12.0-7.el7.x86_64

Steps to verify:
1.# virsh domstats --help|grep nowait
    domstats [--state]... [--nowait]...
    --nowait         report only stats that are accessible instantly

2. # man virsh
        domstats [--raw] [--enforce] [--backing] [--nowait] ...
...
           When collecting stats libvirtd may wait for some time if there's already another job running on given domain for it to finish.  This may cause unnecessary delay in delivering stats. Using
           --nowait suppresses this behaviour. On the other hand some statistics might be missing for such domain.

3.If --nowait not used, as comment 7, domstats will return info after a time interval, here it is about 17s, and domstats prints fully.
# virsh snapshot-create-as V-7.6  --memspec file=/tmp/sp4.mem --diskspec vda --name sp4 & date;virsh domstats V-7.6;date
[1] 19063
Tue Jul 17 02:51:00 EDT 2018
Domain snapshot sp4 created
Domain: 'V-7.6'
  state.state=1
  state.reason=5
  cpu.time=233607095188
  cpu.user=7820000000
  cpu.system=40160000000
  balloon.current=1048576
  balloon.maximum=1048576
  balloon.swap_in=0
  balloon.swap_out=0
  balloon.major_fault=192
  balloon.minor_fault=153997
  balloon.unused=897976
  balloon.available=1014976
  balloon.usable=848624
  balloon.last-update=1531809592
  balloon.rss=1195312
  vcpu.current=1
  vcpu.maximum=1
  vcpu.0.state=1
  vcpu.0.time=187580000000
  vcpu.0.wait=0
  net.count=1
  net.0.name=vnet0
  net.0.rx.bytes=15076943
  net.0.rx.pkts=4489
  net.0.rx.errs=0
  net.0.rx.drop=0
  net.0.tx.bytes=207189
  net.0.tx.pkts=3007
  net.0.tx.errs=0
  net.0.tx.drop=0
  block.count=1
  block.0.name=vda
  block.0.path=/var/lib/libvirt/images/V-7.6.sp4
  block.0.rd.reqs=171597
  block.0.rd.bytes=5586383872
  block.0.rd.times=29725084055
  block.0.wr.reqs=24804
  block.0.wr.bytes=7596663808
  block.0.wr.times=1540675454432
  block.0.fl.reqs=201
  block.0.fl.times=13847416259
  block.0.allocation=327680
  block.0.capacity=10737418240
  block.0.physical=266240

[1]+  Done                    virsh snapshot-create-as V-7.6 --memspec file=/tmp/sp4.mem --diskspec vda --name sp4
Tue Jul 17 02:51:17 EDT 2018


4.If use --nowait, the time interval is about 0s, and it report only stats that are accessible instantly.
# virsh snapshot-create-as V-7.6  --memspec file=/tmp/sp7.mem --diskspec vda --name sp7 & date;virsh domstats V-7.6 --nowait;date
[1] 22690
Tue Jul 17 04:36:59 EDT 2018
Domain: 'V-7.6'
  state.state=1
  state.reason=5
  cpu.time=1654444213784
  cpu.user=19350000000
  cpu.system=98160000000
  balloon.current=1048576
  balloon.maximum=1048576
  vcpu.current=1
  vcpu.maximum=1
  vcpu.0.state=1
  vcpu.0.time=1543480000000
  vcpu.0.wait=0
  net.count=1
  net.0.name=vnet0
  net.0.rx.bytes=15220637
  net.0.rx.pkts=7190
  net.0.rx.errs=0
  net.0.rx.drop=0
  net.0.tx.bytes=214497
  net.0.tx.pkts=3092
  net.0.tx.errs=0
  net.0.tx.drop=0
  block.count=1
  block.0.name=vda
  block.0.path=/var/lib/libvirt/images/V-7.6.sp6

Tue Jul 17 04:36:59 EDT 2018
[root@hp-*** qemu]# Domain snapshot sp7 created

[1]+  Done                    virsh snapshot-create-as V-7.6 --memspec file=/tmp/sp7.mem --diskspec vda --name sp7

5.Diff the qemuDomainObj msg in the two libvirtd.log between step3 and step4, the scenario without'--nowait' has following extra logs:
...
 debug : qemuDomainObjBeginJobInternal:6479 : Waiting for async job (vm=0x7fc8c0000b10 name=V-7.6)
…
 debug : qemuDomainObjBeginJobInternal:6507 : Started job: query (async=none vm=0x7fc8c0000b10 name=V-7.6)
 debug : qemuDomainObjEnterMonitorInternal:6899 : Entering monitor (mon=0x7fc8e0016f40 vm=0x7fc8c0000b10 name=V-7.6)
 debug : qemuDomainObjExitMonitorInternal:6922 : Exited monitor (mon=0x7fc8e0016f40 vm=0x7fc8c0000b10 name=V-7.6)
 debug : qemuDomainObjEnterMonitorInternal:6899 : Entering monitor (mon=0x7fc8e0016f40 vm=0x7fc8c0000b10 name=V-7.6)
 debug : qemuDomainObjExitMonitorInternal:6922 : Exited monitor (mon=0x7fc8e0016f40 vm=0x7fc8c0000b10 name=V-7.6)
 debug : qemuDomainObjEndJob:6783 : Stopping job: query (async=none vm=0x7fc8c0000b10 name=V-7.6)

Comment 15 Yanqiu Zhang 2018-08-06 06:32:47 UTC

Hi Roman,

The issue fixed by patches is verified in comment 12. Do you think it's acceptable and if a rhv bug is needed to apply this "--no-wait" option as default in rhv?

Thx.

Comment 16 Michal Privoznik 2018-08-06 07:18:30 UTC

(In reply to yanqzhan from comment #15)
> Hi Roman,
> 
> The issue fixed by patches is verified in comment 12. Do you think it's
> acceptable and if a rhv bug is needed to apply this "--no-wait" option as
> default in rhv?
> 
> Thx.

From libvirt's perspective it makes sense to use it. Esp. if vdsm is querying for stats every 2 seconds AND is brittle when it comes to hiccups. The reason why it isn't the default in libvirt is that it changes API semantic. If your application was okay with slightly delayed (but complete) stats, with an update all of a sudden you'd get an incomplete (but immediate) stats. That is the reason why we can not switch the default behaviour.

Comment 17 Yanqiu Zhang 2018-08-06 10:02:40 UTC

Thanks Michal.


Hi Pavel,

Pls refer to comment 16:
If RHV product needs complete stats, and can accept slightly delay, keep default call("virsh domstats") is ok;
If RHV product only needs a immediate return to make sure the guest is running well, but not the complete stats，a RHV bug is suggested to open to apply 'virsh domstats --nowait' as default call in RHV. 

Which one do you think is needed for RHV pls?

Thx.

Comment 18 Marina Kalinin 2018-08-07 19:33:12 UTC

So, I filed the RFE for RHV to consume this change: bz#1613514
And this is basically based on another urgent RHV bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1572801#c28

But if you are asking, I am not sure as of today, once we have the fsfreeze problem resolved, we really need this fix in RHV.

Michal / Pavel / Roman / everyone else,
For your consideration:
https://bugzilla.redhat.com/show_bug.cgi?id=1613514

Comment 19 Michal Skrivanek 2018-08-08 04:56:59 UTC

(In reply to Michal Privoznik from comment #16)
> From libvirt's perspective it makes sense to use it. Esp. if vdsm is
> querying for stats every 2 seconds AND is brittle when it comes to hiccups.
> The reason why it isn't the default in libvirt is that it changes API
> semantic. If your application was okay with slightly delayed (but complete)
> stats, with an update all of a sudden you'd get an incomplete (but
> immediate) stats. That is the reason why we can not switch the default
> behaviour.

We cannot change core behavior based on such amvague description. How incomplete those stats could be? Which of them are optional? Will they be missing or contain stale values? What is a “slight delay”?

Comment 20 Michal Privoznik 2018-08-09 06:33:34 UTC

(In reply to Michal Skrivanek from comment #19)
> (In reply to Michal Privoznik from comment #16)
> > From libvirt's perspective it makes sense to use it. Esp. if vdsm is
> > querying for stats every 2 seconds AND is brittle when it comes to hiccups.
> > The reason why it isn't the default in libvirt is that it changes API
> > semantic. If your application was okay with slightly delayed (but complete)
> > stats, with an update all of a sudden you'd get an incomplete (but
> > immediate) stats. That is the reason why we can not switch the default
> > behaviour.
> 
> We cannot change core behavior based on such amvague description. How
> incomplete those stats could be? Which of them are optional? Will they be
> missing or contain stale values? What is a “slight delay”?

So whenever stats api is issued libvirt gathers info from various places, e.g. cgroups, /proc, ovs, etc. But some require talking to qemu monitor (e.g. balloon info). And for those we have to mutually exclude other threads which are trying to talk on monitor too. The flag I added merely does this: if we are fetching stats and as part of that we want to talk to qemu but there's already somebody else (possibly staled) skip to next family of stats.

Therefore, you might be missing stats that require talking to qemu.
Slight delay means (in the light of what I described above) the time it takes for stats querying thread to get to talk on the monitor to qemu.

BTW: as part of the fix for this bug I've broken down the qemu monitor lock into  two smaller locks to enhance parallelism. For instance, talking on guest agent monitor can now run in parallel with talking on qemu monitor.

Comment 21 Michal Skrivanek 2018-08-09 06:56:56 UTC

(In reply to Michal Privoznik from comment #20)
> So whenever stats api is issued libvirt gathers info from various places,
> e.g. cgroups, /proc, ovs, etc. But some require talking to qemu monitor
> (e.g. balloon info). And for those we have to mutually exclude other threads
> which are trying to talk on monitor too. The flag I added merely does this:
> if we are fetching stats and as part of that we want to talk to qemu but
> there's already somebody else (possibly staled) skip to next family of stats.

so it is omitted then, correct? If that concurrent monitor access gets stuck (as it does in the original bug) it's going to skip over indefinitely, I assume?

> Therefore, you might be missing stats that require talking to qemu.

These are the ones we use in oVirt:
    libvirt.VIR_DOMAIN_STATS_STATE |
    libvirt.VIR_DOMAIN_STATS_CPU_TOTAL |
    libvirt.VIR_DOMAIN_STATS_BALLOON |
    libvirt.VIR_DOMAIN_STATS_VCPU |
    libvirt.VIR_DOMAIN_STATS_INTERFACE |
    libvirt.VIR_DOMAIN_STATS_BLOCK

can you qualify them? Or refer to the right place where we can see what could potentially be missing and what we can depend on?

> Slight delay means (in the light of what I described above) the time it
> takes for stats querying thread to get to talk on the monitor to qemu.

right, but that's the same delay as today (when there's no concurrent monitor call), right? I assume we do not skip the monitor call when there's no conflict and so it's going to take the same time with or without the new flag.

> BTW: as part of the fix for this bug I've broken down the qemu monitor lock
> into  two smaller locks to enhance parallelism. For instance, talking on
> guest agent monitor can now run in parallel with talking on qemu monitor.

great, that could be quite a significant improvement. Does it apply to all the calls (with or without the new flag)?

Comment 22 Michal Privoznik 2018-08-09 07:54:49 UTC

(In reply to Michal Skrivanek from comment #21)
> (In reply to Michal Privoznik from comment #20)
> > So whenever stats api is issued libvirt gathers info from various places,
> > e.g. cgroups, /proc, ovs, etc. But some require talking to qemu monitor
> > (e.g. balloon info). And for those we have to mutually exclude other threads
> > which are trying to talk on monitor too. The flag I added merely does this:
> > if we are fetching stats and as part of that we want to talk to qemu but
> > there's already somebody else (possibly staled) skip to next family of stats.
> 
> so it is omitted then, correct? If that concurrent monitor access gets stuck
> (as it does in the original bug) it's going to skip over indefinitely, I
> assume?

Well, this was the reasoning for splitting the mutex. In the original report problem was that one mutex was shared between agent and monitor accesses. Therefore, when guest-fs-freeze was executed (which can take a long time to finish) all monitor accesses had to wait. Now that the mutex is split this is no longer the case. In addition to that, new flag was introduced to stats APIs so that even when a thread is stuck in monitor (this hasn't been observed though. yet) mgmt app can work around this by issuing the stats API with the flag set.

> 
> > Therefore, you might be missing stats that require talking to qemu.
> 
> These are the ones we use in oVirt:
>     libvirt.VIR_DOMAIN_STATS_STATE |
Always in.
>     libvirt.VIR_DOMAIN_STATS_CPU_TOTAL |
Always in.
>     libvirt.VIR_DOMAIN_STATS_BALLOON |
Might miss some fields that require monitor access (swap_in, swap_out, major_fault, minor_fault, unused, available, rss, last-update, usable, disk_caches)
>     libvirt.VIR_DOMAIN_STATS_VCPU |
Always in.
>     libvirt.VIR_DOMAIN_STATS_INTERFACE |
Always in.
>     libvirt.VIR_DOMAIN_STATS_BLOCK
Might be left out completely.
> 
> can you qualify them? Or refer to the right place where we can see what
> could potentially be missing and what we can depend on?

You want to look at qemuDomainGetStatsWorkers[] (in qemu_drier.c) and one by one inspect if the callbacks listed there require HAVE_JOB() in order to fetch stats. For instance, qemuDomainGetStatsState() doesn't but qemuDomainGetStatsBlock does.

> 
> > Slight delay means (in the light of what I described above) the time it
> > takes for stats querying thread to get to talk on the monitor to qemu.
> 
> right, but that's the same delay as today (when there's no concurrent
> monitor call), right? I assume we do not skip the monitor call when there's
> no conflict and so it's going to take the same time with or without the new
> flag.
> 

Correct.

> > BTW: as part of the fix for this bug I've broken down the qemu monitor lock
> > into  two smaller locks to enhance parallelism. For instance, talking on
> > guest agent monitor can now run in parallel with talking on qemu monitor.
> 
> great, that could be quite a significant improvement. Does it apply to all
> the calls (with or without the new flag)?

Yes it does.

Comment 24 errata-xmlrpc 2018-10-30 09:53:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113