Bug 1613514 - send --nowait to libvirt when we collect qemu stats, to consume bz#1552092
Summary: send --nowait to libvirt when we collect qemu stats, to consume bz#1552092
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.2.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.4.3
: 4.4.3
Assignee: Steven Rosenberg
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On: 1552092 1870500
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-07 17:34 UTC by Marina Kalinin
Modified: 2023-09-07 19:17 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
This enhancement adds the ‘nowait’ option to the domain stats to help avoid instances of non-responsiveness in the VDSM. As a result, libvirt now receives the ‘nowait’ option to avoid non-responsiveness.
Clone Of:
Environment:
Last Closed: 2020-11-24 13:09:18 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
qiyuan: needinfo-


Attachments (Terms of Use)
VDSM Log Created during Create Snapshot (20.95 KB, application/gzip)
2019-11-05 10:31 UTC, Steven Rosenberg
no flags Details
GuestOS console after continuing (12.06 KB, image/png)
2020-01-08 10:10 UTC, Steven Rosenberg
no flags Details
VDSM Log with test results (214.83 KB, application/gzip)
2020-01-08 18:37 UTC, Steven Rosenberg
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3372411 0 None None None 2018-08-07 20:44:36 UTC
Red Hat Product Errata RHSA-2020:5179 0 None None None 2020-11-24 13:10:34 UTC
oVirt gerrit 104334 0 master MERGED vdsm: Added nowait flag to the domain stats 2021-02-02 18:26:06 UTC
oVirt gerrit 109649 0 None MERGED qga: add context for setting libvirt timeout 2021-02-02 18:26:08 UTC

Description Marina Kalinin 2018-08-07 17:34:27 UTC
This is a request to consider sending --nowait to libvirt when collecting qemu stats via qemu monitor. This --nowait option will be available with bz#1552092.

Why RHV should care? 
As explained in KCS: https://access.redhat.com/solutions/3372411
Or in BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1552092

qemu monitor will not reply if a qemu-guest-agent call is running.
This may make RHV guest to be reported non-responding in RHV UI and prevent different operations on the guest, for instance - opening a guest console.

It can also cause a life snapshot or live merge to fail or never succeed, as per bz#1572801 (see comment bz#1572801#28).

Comment 1 Marina Kalinin 2018-08-07 20:47:13 UTC
For additional details for this request also check these 3 comments:
https://bugzilla.redhat.com/show_bug.cgi?id=1552092#c15

~~~
If RHV product needs complete stats, and can accept slightly delay, keep default call("virsh domstats") is ok;
If RHV product only needs a immediate return to make sure the guest is running well, but not the complete stats,a RHV bug is suggested to open to apply 'virsh domstats --nowait' as default call in RHV. 
~~~

Comment 3 Ryan Barry 2019-01-21 14:54:07 UTC
Re-targeting to 4.3.1 since it is missing a patch, an acked blocker flag, or both

Comment 7 Steven Rosenberg 2019-10-29 13:53:14 UTC
I see you worked on the issue in the libvirt side [1]

It seems the vdsm uses the value from the stats as referenced in Michal Skrivanek's comments [2]

I added the nowait value to the Bulk list of stats that are sent:

    libvirt.VIR_DOMAIN_STATS_STATE |
    libvirt.VIR_DOMAIN_STATS_CPU_TOTAL |
    libvirt.VIR_DOMAIN_STATS_BALLOON |
    libvirt.VIR_DOMAIN_STATS_VCPU |
    libvirt.VIR_DOMAIN_STATS_INTERFACE |
    libvirt.VIR_DOMAIN_STATS_BLOCK |
    libvirt.VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT

I added it to the flags as well:

flags = libvirt.VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING | libvirt.VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT 

I am not sure how to test it, but find that when I create a snapshot in the engine (the advised mode for testing this) and try to edit the VM, the editing does fail until the snapshot creation functionality completes. Afterwards one can edit the VM. Maybe you can advise what I should expect and how to verify that libvirt is received the correct value.

I do see that the nowait option is supported on my Host:

virsh domstats --help|grep nowait
    domstats [--state] [--cpu-total] [--balloon] [--vcpu] [--interface] [--block] [--perf] [--list-active] [--list-inactive] [--list-persistent] [--list-transient] [--list-running] [--list-paused] [--list-shutoff] [--list-other] [--raw] [--enforce] [--backing] [--nowait] [<domain>]...
    --nowait         report only stats that are accessible instantly

Versions:

[root@sla-sheldon vdsm]# yum list | grep libvirt
libvirt-bash-completion.x86_64           4.5.0-10.el7_6.12
libvirt-client.x86_64                    4.5.0-10.el7_6.12
libvirt-daemon.x86_64                    4.5.0-10.el7_6.12

yum list | grep qemu-kvm
qemu-kvm-common-ev.x86_64                10:2.12.0-18.el7_6.7.1
qemu-kvm-ev.x86_64                       10:2.12.0-18.el7_6.7.1
qemu-kvm.x86_64                          10:1.5.3-160.el7_6.3

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1552092
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1552092#c21

Comment 8 Michal Privoznik 2019-10-29 14:12:45 UTC
(In reply to Steven Rosenberg from comment #7)
>

As I've replied in e-mail you've sent me earlier, this might be a libvirt bug. Does 'virsh domstats --nowait $dom' get stuck too?

Comment 9 Steven Rosenberg 2019-10-29 15:09:06 UTC
(In reply to Michal Privoznik from comment #8)
> (In reply to Steven Rosenberg from comment #7)
> >
> 
> As I've replied in e-mail you've sent me earlier, this might be a libvirt
> bug. Does 'virsh domstats --nowait $dom' get stuck too?

No that seems to work:

virsh # domstats --nowait VM2

Gives the information very quickly whether one specifies --nowait or not when creating the snapshot or not. 

Also there is no blocking when the VM is running and we edit and save while resting a snapshot.

The issue is when the VM is not running and we create a snapshot and then attempt to edit and save.

Comment 10 Michal Privoznik 2019-10-30 13:37:56 UTC
(In reply to Steven Rosenberg from comment #9)
> (In reply to Michal Privoznik from comment #8)
> > (In reply to Steven Rosenberg from comment #7)
> > >
> > 
> > As I've replied in e-mail you've sent me earlier, this might be a libvirt
> > bug. Does 'virsh domstats --nowait $dom' get stuck too?
> 

Let's continue the discussion here instead of private e-mails. It's easier for others to find information this way.

> No that seems to work:
> 
> virsh # domstats --nowait VM2
> 
> Gives the information very quickly whether one specifies --nowait or not
> when creating the snapshot or not. 
> 

Ah, so this suggests that the bottleneck / deadlock does not occur in libvirt.

> Also there is no blocking when the VM is running and we edit and save while
> resting a snapshot.
> 
> The issue is when the VM is not running and we create a snapshot and then
> attempt to edit and save.

When a domain is not running then libvirt spawns qemu-img to create a snapshot. This may overload host with an I/O which may result in longer response times.
However, we need to make sure this is the case. So when you're doing a snapshot on an offline domain, does 'virsh domstats' block too? And what about 'virsh dumpxml'?
What I'm trying to find out is where the bottleneck lies. Whether libvirt is slow to respond or vdsm has some internal lock that is preventing it from allowing a domain edit whilst a snapshot takes place.
To distinguish these two, we need to see how virsh / pythong bindings behave. If virsh is responsive then the problem lies in vdsm. If, however, virsh is not responsive then the problem lies in libvirt (it has its own locking which might be too rough and prevent any other APIs to run on the same domain while snaphot operation is in place). Or this may be purely the I/O overload problem - might be worth tracking that while creating the snapshot.

Comment 11 Steven Rosenberg 2019-10-31 11:44:53 UTC
(In reply to Michal Privoznik from comment #10)
> (In reply to Steven Rosenberg from comment #9)
> > (In reply to Michal Privoznik from comment #8)
> > > (In reply to Steven Rosenberg from comment #7)
> > > >
> > > 
> > > As I've replied in e-mail you've sent me earlier, this might be a libvirt
> > > bug. Does 'virsh domstats --nowait $dom' get stuck too?
> > 
> 
> Let's continue the discussion here instead of private e-mails. It's easier
> for others to find information this way.
> 
> > No that seems to work:
> > 
> > virsh # domstats --nowait VM2
> > 
> > Gives the information very quickly whether one specifies --nowait or not
> > when creating the snapshot or not. 
> > 
> 
> Ah, so this suggests that the bottleneck / deadlock does not occur in
> libvirt.
> 
> > Also there is no blocking when the VM is running and we edit and save while
> > resting a snapshot.
> > 
> > The issue is when the VM is not running and we create a snapshot and then
> > attempt to edit and save.
> 
> When a domain is not running then libvirt spawns qemu-img to create a
> snapshot. This may overload host with an I/O which may result in longer
> response times.
> However, we need to make sure this is the case. So when you're doing a
> snapshot on an offline domain, does 'virsh domstats' block too? And what
> about 'virsh dumpxml'?
> What I'm trying to find out is where the bottleneck lies. Whether libvirt is
> slow to respond or vdsm has some internal lock that is preventing it from
> allowing a domain edit whilst a snapshot takes place.
> To distinguish these two, we need to see how virsh / pythong bindings
> behave. If virsh is responsive then the problem lies in vdsm. If, however,
> virsh is not responsive then the problem lies in libvirt (it has its own
> locking which might be too rough and prevent any other APIs to run on the
> same domain while snaphot operation is in place). Or this may be purely the
> I/O overload problem - might be worth tracking that while creating the
> snapshot.

'virsh dumpxml' requires a domain so it is only when the VM is running which does not block. It seems the locking is in the engine and not related to this change.
Thank you.

Comment 12 Steven Rosenberg 2019-11-04 12:53:32 UTC
(In reply to Michal Privoznik from comment #10)
> (In reply to Steven Rosenberg from comment #9)
> > (In reply to Michal Privoznik from comment #8)
> > > (In reply to Steven Rosenberg from comment #7)
> > > >
> > > 
> > > As I've replied in e-mail you've sent me earlier, this might be a libvirt
> > > bug. Does 'virsh domstats --nowait $dom' get stuck too?
> > 
> 
> Let's continue the discussion here instead of private e-mails. It's easier
> for others to find information this way.
> 
> > No that seems to work:
> > 
> > virsh # domstats --nowait VM2
> > 
> > Gives the information very quickly whether one specifies --nowait or not
> > when creating the snapshot or not. 
> > 
> 
> Ah, so this suggests that the bottleneck / deadlock does not occur in
> libvirt.
> 
> > Also there is no blocking when the VM is running and we edit and save while
> > resting a snapshot.
> > 
> > The issue is when the VM is not running and we create a snapshot and then
> > attempt to edit and save.
> 
> When a domain is not running then libvirt spawns qemu-img to create a
> snapshot. This may overload host with an I/O which may result in longer
> response times.
> However, we need to make sure this is the case. So when you're doing a
> snapshot on an offline domain, does 'virsh domstats' block too? And what
> about 'virsh dumpxml'?
> What I'm trying to find out is where the bottleneck lies. Whether libvirt is
> slow to respond or vdsm has some internal lock that is preventing it from
> allowing a domain edit whilst a snapshot takes place.
> To distinguish these two, we need to see how virsh / pythong bindings
> behave. If virsh is responsive then the problem lies in vdsm. If, however,
> virsh is not responsive then the problem lies in libvirt (it has its own
> locking which might be too rough and prevent any other APIs to run on the
> same domain while snaphot operation is in place). Or this may be purely the
> I/O overload problem - might be worth tracking that while creating the
> snapshot.

We would like to have a definitive way of testing this through the VDSM over and above just creating a snapshot, which seems to work the same with and without the flag. 

What would you recommend where without the NOWAIT flags sent via the VDSM we would experience a wait and with the flags there would be no wait.

Comment 13 Michal Privoznik 2019-11-05 08:27:30 UTC
(In reply to Steven Rosenberg from comment #12)
> We would like to have a definitive way of testing this through the VDSM over
> and above just creating a snapshot, which seems to work the same with and
> without the flag. 

Does vdsm use libvirt for creating snapshots? Or simply, how is a snapshot created? Can you share debug logs - I will try to make some sense out of them and come up with something.

> 
> What would you recommend where without the NOWAIT flags sent via the VDSM we
> would experience a wait and with the flags there would be no wait.

You mean, what are expected timings for gathering statistics with and without nowait flag?
Well, with the flag the API should return almost instantly, i.e. couple of tens of milliseconds (there is some data transfer involved after all).
Without the flag, the API can return almost instantly, but it can also return after couple of tens of seconds (qemu is queried multiple times on the monitor, and it depends how quickly it will respond; also the stats gathering thread will compete with other threads interested in the domain).

Comment 14 Steven Rosenberg 2019-11-05 10:31:03 UTC
Created attachment 1632921 [details]
VDSM Log Created during Create Snapshot

VDSM Log Created During Create Snapshot

Comment 15 Steven Rosenberg 2019-11-05 10:32:09 UTC
(In reply to Michal Privoznik from comment #13)
> (In reply to Steven Rosenberg from comment #12)
> > We would like to have a definitive way of testing this through the VDSM over
> > and above just creating a snapshot, which seems to work the same with and
> > without the flag. 
> 
> Does vdsm use libvirt for creating snapshots? Or simply, how is a snapshot
> created? Can you share debug logs - I will try to make some sense out of
> them and come up with something.
> 
> > 
> > What would you recommend where without the NOWAIT flags sent via the VDSM we
> > would experience a wait and with the flags there would be no wait.
> 
> You mean, what are expected timings for gathering statistics with and
> without nowait flag?
> Well, with the flag the API should return almost instantly, i.e. couple of
> tens of milliseconds (there is some data transfer involved after all).
> Without the flag, the API can return almost instantly, but it can also
> return after couple of tens of seconds (qemu is queried multiple times on
> the monitor, and it depends how quickly it will respond; also the stats
> gathering thread will compete with other threads interested in the domain).

Please see Comment 14 for the VDSM Log.

Comment 16 Michal Privoznik 2019-11-05 12:55:35 UTC
(In reply to Steven Rosenberg from comment #15)
> (In reply to Michal Privoznik from comment #13)
> > (In reply to Steven Rosenberg from comment #12)
> > > We would like to have a definitive way of testing this through the VDSM over
> > > and above just creating a snapshot, which seems to work the same with and
> > > without the flag. 
> > 
> > Does vdsm use libvirt for creating snapshots? Or simply, how is a snapshot
> > created? Can you share debug logs - I will try to make some sense out of
> > them and come up with something.
> > 
> > > 
> > > What would you recommend where without the NOWAIT flags sent via the VDSM we
> > > would experience a wait and with the flags there would be no wait.
> > 
> > You mean, what are expected timings for gathering statistics with and
> > without nowait flag?
> > Well, with the flag the API should return almost instantly, i.e. couple of
> > tens of milliseconds (there is some data transfer involved after all).
> > Without the flag, the API can return almost instantly, but it can also
> > return after couple of tens of seconds (qemu is queried multiple times on
> > the monitor, and it depends how quickly it will respond; also the stats
> > gathering thread will compete with other threads interested in the domain).
> 
> Please see Comment 14 for the VDSM Log.

I meant libvirtd logs, but it doesn't matter. So you're doing snapshots (I guess those are LVM snapshots?) and you want a way to test whether NOWAIT flag gives any improvement during that? Well, as I've said earlier, you can use virsh to measure time spent in the API. But I'm afraid I don't know vdsm that cosely to tell you what exactly you need to do.

Also, you claimed earlier that the domain doesn't exist (at libvirt level) when snapshot is being created. So I'm not sure I understand how NOWAIT would affect anything - the domain doesn't exist and hence libvirt can't get its stats.

Also, I've just caught it, in comment 7 you have the following piece of code:

    libvirt.VIR_DOMAIN_STATS_STATE |
    libvirt.VIR_DOMAIN_STATS_CPU_TOTAL |
    libvirt.VIR_DOMAIN_STATS_BALLOON |
    libvirt.VIR_DOMAIN_STATS_VCPU |
    libvirt.VIR_DOMAIN_STATS_INTERFACE |
    libvirt.VIR_DOMAIN_STATS_BLOCK |
    libvirt.VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT

I don't know what it is exactly, but you shouldn't mix VIR_DOMAIN_* flags with VIR_CONNECT_* flags. I mean, the API call should look like this:

conn.getAllDomainStats(stats=libvirt.VIR_DOMAIN_STATS_STATE |
                             libvirt.VIR_DOMAIN_STATS_CPU_TOTAL,
                       flags=libvirt.VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT)


Note that VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT value is not valid stats value.

Comment 17 Steven Rosenberg 2019-11-06 08:45:53 UTC
(In reply to Michal Privoznik from comment #16)
> (In reply to Steven Rosenberg from comment #15)
> > (In reply to Michal Privoznik from comment #13)
> > > (In reply to Steven Rosenberg from comment #12)
> > > > We would like to have a definitive way of testing this through the VDSM over
> > > > and above just creating a snapshot, which seems to work the same with and
> > > > without the flag. 
> > > 
> > > Does vdsm use libvirt for creating snapshots? Or simply, how is a snapshot
> > > created? Can you share debug logs - I will try to make some sense out of
> > > them and come up with something.
> > > 
> > > > 
> > > > What would you recommend where without the NOWAIT flags sent via the VDSM we
> > > > would experience a wait and with the flags there would be no wait.
> > > 
> > > You mean, what are expected timings for gathering statistics with and
> > > without nowait flag?
> > > Well, with the flag the API should return almost instantly, i.e. couple of
> > > tens of milliseconds (there is some data transfer involved after all).
> > > Without the flag, the API can return almost instantly, but it can also
> > > return after couple of tens of seconds (qemu is queried multiple times on
> > > the monitor, and it depends how quickly it will respond; also the stats
> > > gathering thread will compete with other threads interested in the domain).
> > 
> > Please see Comment 14 for the VDSM Log.
> 
> I meant libvirtd logs, but it doesn't matter. So you're doing snapshots (I
> guess those are LVM snapshots?) and you want a way to test whether NOWAIT
> flag gives any improvement during that? Well, as I've said earlier, you can
> use virsh to measure time spent in the API. But I'm afraid I don't know vdsm
> that cosely to tell you what exactly you need to do.
> 
> Also, you claimed earlier that the domain doesn't exist (at libvirt level)
> when snapshot is being created. So I'm not sure I understand how NOWAIT
> would affect anything - the domain doesn't exist and hence libvirt can't get
> its stats.
> 
> Also, I've just caught it, in comment 7 you have the following piece of code:
> 
>     libvirt.VIR_DOMAIN_STATS_STATE |
>     libvirt.VIR_DOMAIN_STATS_CPU_TOTAL |
>     libvirt.VIR_DOMAIN_STATS_BALLOON |
>     libvirt.VIR_DOMAIN_STATS_VCPU |
>     libvirt.VIR_DOMAIN_STATS_INTERFACE |
>     libvirt.VIR_DOMAIN_STATS_BLOCK |
>     libvirt.VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT
> 
> I don't know what it is exactly, but you shouldn't mix VIR_DOMAIN_* flags
> with VIR_CONNECT_* flags. I mean, the API call should look like this:
> 
> conn.getAllDomainStats(stats=libvirt.VIR_DOMAIN_STATS_STATE |
>                              libvirt.VIR_DOMAIN_STATS_CPU_TOTAL,
>                       
> flags=libvirt.VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT)
> 
> 
> Note that VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT value is not valid stats
> value.

Actually I tested it with both stat and flags, Milan confirmed we just need it in the flags.

Comment 30 Steven Rosenberg 2019-12-19 09:49:25 UTC
In reviewing the libvirt code, the nowait seems to be used only to decide whether to perform a wait via the virCondWaitUntil function is another job is already pending as per the nowait flag checking here:

https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_domain.c#L9411

and here:

https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_domain.c#L9420

when the functions that check for job statuses return false such as here for the qemuDomainNestedJobAllowed allowed function:

https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_domain.c#L9314-L9320


and here for the qemuDomainObjCanSetJob function:

https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_domain.c#L9328-L9337


Based upon this the nowait will just return as opposed to performing a wait and retrying within the function.

Therefore testing was performed via competing jobs running simultaneously as follows:

1. running the stress within the guest OS @ 512M.
2. Running the domstats in a loop with nowait on the Host
3. Running the domstats in a loop without nowait on the Host
4. Running the create snapshot on the command line on the Host.

The results were that one time both domstats failed with the following errors:

error: Disconnected from qemu:///system due to end of file
error: End of file while reading data: Input/output error
31
error: failed to connect to the hypervisor
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock-ro': Connection refused
32


Further testing succeeded without error. During the create snap shot there was a visible pause in the domstats polling, with nowait domstats recovering slightly sooner than without the nowait which would have been waiting in the wait function until the previous jobs ended while the nowait would have returned and retried within the libvirt.

Comment 31 Steven Rosenberg 2020-01-05 14:05:40 UTC
Further testing of nowait with libvirt:

1. Added no wait code to the vdsm of two hosts.
2. Ran stress within the guest host.
3. Ran domstat in a loop on the command line of both Hosts.
4. Performed migrations between the hosts successfully.
5. Created a snap shot of the running VM successfully.
6. Cloned the VM from the snap shot successfully.

Note: These operations were locking operations and no statistics were lost affecting the functionality.

Further code review:

Within the libvirt, the functionality for fetching all of the stats at once is via the qemuConnectGetAllDomainStats function [1]. The HAVE_JOB macro only checks if a job is already pending [2] [3]. The only difference between skipping the VM (not a given stat) until the next polling interval or waiting is wither we set no wait or not [4]. So to clarify, we do not skip an individual statistic with no wait, we skip all of the stats if there is a pending job.

In the VDSM, the bulk stats supported are[5]:


libvirt.VIR_DOMAIN_STATS_STATE 
libvirt.VIR_DOMAIN_STATS_CPU_TOTAL 
libvirt.VIR_DOMAIN_STATS_BALLOON
libvirt.VIR_DOMAIN_STATS_VCPU 
libvirt.VIR_DOMAIN_STATS_INTERFACE
libvirt.VIR_DOMAIN_STATS_BLOCK

The VDSM receives them all at once [6]

Therefore, the only issue is when all of the stats can be received and whether any statistic can change states where a state is lost. If this can occur, it can occur either when skipping a given VM until the next polling time for no wait as well as during the wait time without no wait. Therefore, no wait does not seem to be a factor in whether a state may change multiple times before its statistics are sent. If there is a race condition that could cause a loss of a state, the scenario would not be dependent upon the no wait nor wait states. 


[1] https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_driver.c#L21579
[2] https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_driver.c#L20609 
[3] https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_driver.c#L21631
[4] https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_driver.c#L21643-L21646
[5] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/sampling.py#L387-L394
[6] https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/sampling.py#L428-L429

Comment 32 Milan Zamazal 2020-01-06 08:59:33 UTC
(In reply to Steven Rosenberg from comment #31)

> So to clarify, we do not skip
> an individual statistic with no wait, we skip all of the stats if there is a
> pending job.

I think libvirt actually still provides most of the stats that don't require the job. Gathering of the stats is basically the same, just some of the parts are skipped when QEMU_DOMAIN_STATS_HAVE_JOB is not present in domflags. So I assume we can receive incomplete stats and it's important to know what happens in such a case.

Comment 33 Steven Rosenberg 2020-01-06 09:05:37 UTC
(In reply to Milan Zamazal from comment #32)
> (In reply to Steven Rosenberg from comment #31)
> 
> > So to clarify, we do not skip
> > an individual statistic with no wait, we skip all of the stats if there is a
> > pending job.
> 
> I think libvirt actually still provides most of the stats that don't require
> the job. Gathering of the stats is basically the same, just some of the
> parts are skipped when QEMU_DOMAIN_STATS_HAVE_JOB is not present in
> domflags. So I assume we can receive incomplete stats and it's important to
> know what happens in such a case.

Well, it is not clear how to simulate the incomplete stats scenario and as per my previous comments if it could happen it would happen for either nowait or without nowait (where the wait would be within the function). If you have more insight on what test would simulate incomplete stats, please advise accordingly.

Comment 34 Milan Zamazal 2020-01-06 09:39:40 UTC
Michal, do you know about an easy way to make a job, either real or fake, that would block virConnectGetAllDomainStats for a significant amount of time?

Comment 35 Michal Skrivanek 2020-01-06 09:44:28 UTC
(In reply to Milan Zamazal from comment #32)
> I think libvirt actually still provides most of the stats that don't require
> the job. 

actually, from the list we use that's really just STATE, CPU_TOTAL, INTERFACE.
We would be missing VCPU, BALLOON, BLOCK. IIUC the code then all of them at once.

Comment 36 Milan Zamazal 2020-01-06 09:58:26 UTC
(In reply to Michal Skrivanek from comment #35)

> We would be missing VCPU, BALLOON, BLOCK. IIUC the code then all of them at
> once.

IIUC then not all of them, as already explained in e-mail communication. But it's more important to identify first what happens when stats are missing generally and whether there are any stats that are absolutely required in Vdsm or Engine. Simply removing the stats above from the result obtained from virConnectGetAllDomainStats call in Vdsm code might be a good start.

Comment 37 Michal Privoznik 2020-01-06 10:25:41 UTC
(In reply to Milan Zamazal from comment #34)
> Michal, do you know about an easy way to make a job, either real or fake,
> that would block virConnectGetAllDomainStats for a significant amount of
> time?

I think sending SIGSTOP to the qemu process should simulate this well enough.

Comment 38 Steven Rosenberg 2020-01-07 19:04:21 UTC
(In reply to Michal Privoznik from comment #37)
> (In reply to Milan Zamazal from comment #34)
> > Michal, do you know about an easy way to make a job, either real or fake,
> > that would block virConnectGetAllDomainStats for a significant amount of
> > time?
> 
> I think sending SIGSTOP to the qemu process should simulate this well enough.

It this as follows:

1. I ran the stress within the Guest OS
2. I ran polling of the domain stats in a script
3. I create a snap shot
4. I stop the qemu-kvm process
5. I continue the qemu-kvm process

What I found was that if a continue the process within a minute, the snapshot is created successfully. If I try to wait two minutes or more before continuing, the snapshot creation fails. After multiple times of failure, the snapshot creation will be broken even when turning off the domain stats polling and stress. I needed to stop the engine and the VM for the snapshot creation to work properly.

Therefore, yes stopping the qemu-kvm process for more than 1 minute does break the system. 

Is further investigation (analysis of the logs to validate the cause) worthwhile? Is this a real world scenario? I assume there are some timeouts involved which would account for much of the failures. The lack of recovery may be a concern, but this may be a state reset issue that may be time consuming to narrow down.

Concerning the suggestion in Comment 36, I assume blocking some stats would break the functionality such as creating snapshots. The question may be what would we do to compensate other than to just timeout and fail the task (such as failing the create snapshot if that is the functionality under test).

Comment 39 Milan Zamazal 2020-01-08 08:34:58 UTC
Steven, the point to check is not what bad happens when a QEMU process is stopped, but what happens in Vdsm & Engine when some stats are missing as discussed above. Stopping a QEMU process or another trick should simulate the missing stats, which should ease the examination. The things to check are e.g. whether there is some crash in Vdsm due to a missing stats item, broken handling of the stats in Engine, failing storage functionality, etc.

Comment 40 Steven Rosenberg 2020-01-08 10:04:28 UTC
(In reply to Milan Zamazal from comment #39)
> Steven, the point to check is not what bad happens when a QEMU process is
> stopped, but what happens in Vdsm & Engine when some stats are missing as
> discussed above. Stopping a QEMU process or another trick should simulate
> the missing stats, which should ease the examination. The things to check
> are e.g. whether there is some crash in Vdsm due to a missing stats item,
> broken handling of the stats in Engine, failing storage functionality, etc.

OK, so as per my previous comment, the failure in the vdsm is due to a timeout:

2020-01-08 11:53:13,900+0200 ERROR (jsonrpc/7) [virt.vm] (vmId='5a4bdbf4-7cf1-4631-912e-689ec80e8eab') Unable to take snapshot (vm:4167)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2620, in snapshotCreateXML
    if ret is None:raise libvirtError('virDomainSnapshotCreateXML() failed', dom=self)
libvirt.libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4164, in snapshot
    self._dom.snapshotCreateXML(snapxml, snapFlags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 109, in f
    raise toe
vdsm.virt.virdomain.TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)
2020-01-08 11:53:13,900+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='5a4bdbf4-7cf1-4631-912e-689ec80e8eab') Enabling drive monitoring (drivemonitor:56)
2020-01-08 11:53:13,901+0200 INFO  (jsonrpc/7) [vdsm.api] START teardownImage(sdUUID='797bd8a1-625d-443d-b10f-cec4ed74b185', spUUID='baed6eba-1f55-11ea-9ef1-8c16450ea519', imgUUID='9a524772-e3df-4108-b3c2-8539b6901b4f', volUUID=None) from=::ffff:10.35.0.142,39560, flow_id=f5b3bd03-e282-4c23-83f8-7992f9302180, task_id=0df34609-f590-43fc-9e35-4d57b0d6588d (api:48)

The errors in the engine reflect a failure after that:

2020-01-08 11:53:13,906+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [f5b3bd03-e282-4c23-83f8-7992f9302180] Failed in 'SnapshotVDS' method
2020-01-08 11:53:13,921+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-61) [f5b3bd03-e282-4c23-83f8-7992f9302180] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM Host1 command SnapshotVDS failed: Snapshot failed



Then the snapshot creation is aborted.

Comment 41 Steven Rosenberg 2020-01-08 10:10:35 UTC
Created attachment 1650628 [details]
GuestOS console after continuing

The attached screen shot reflects the time the process was inactive between the time the stop was performed and the continue. The continue was run after the create snapshot failed.

Also this is what the domain stats returns when the qeum-kvm process is stopped:

virsh -r domstats VMTest --nowait
Domain: 'VMTest'
  state.state=1
  state.reason=5
  cpu.time=8619153066141
  cpu.user=1420300000000
  cpu.system=1300930000000
  balloon.current=1048576
  balloon.maximum=1048576
  vcpu.current=2
  vcpu.maximum=32
  vcpu.0.state=1
  vcpu.0.time=2630320000000
  vcpu.0.wait=0
  vcpu.1.state=1
  vcpu.1.time=2714580000000
  vcpu.1.wait=0
  net.count=1
  net.0.name=vnet0
  net.0.rx.bytes=60520571
  net.0.rx.pkts=902380
  net.0.rx.errs=0
  net.0.rx.drop=0
  net.0.tx.bytes=309121
  net.0.tx.pkts=3711
  net.0.tx.errs=0
  net.0.tx.drop=0
  block.count=2
  block.0.name=sdc
  block.1.name=sda
  block.1.path=/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/040e79d4-5f62-4a52-b5fd-6dd8afbff18c/5b01f998-c7e7-4e66-89f2-f5475b34b147


It does not look like any of the statistics were lost during the time the process was stopped.

Comment 42 Steven Rosenberg 2020-01-08 10:11:41 UTC
(In reply to Steven Rosenberg from comment #41)
> Created attachment 1650628 [details]
> GuestOS console after continuing
> 
> The attached screen shot reflects the time the process was inactive between
> the time the stop was performed and the continue. The continue was run after
> the create snapshot failed.
> 
> Also this is what the domain stats returns when the qeum-kvm process is
> stopped:
> 
> virsh -r domstats VMTest --nowait
> Domain: 'VMTest'
>   state.state=1
>   state.reason=5
>   cpu.time=8619153066141
>   cpu.user=1420300000000
>   cpu.system=1300930000000
>   balloon.current=1048576
>   balloon.maximum=1048576
>   vcpu.current=2
>   vcpu.maximum=32
>   vcpu.0.state=1
>   vcpu.0.time=2630320000000
>   vcpu.0.wait=0
>   vcpu.1.state=1
>   vcpu.1.time=2714580000000
>   vcpu.1.wait=0
>   net.count=1
>   net.0.name=vnet0
>   net.0.rx.bytes=60520571
>   net.0.rx.pkts=902380
>   net.0.rx.errs=0
>   net.0.rx.drop=0
>   net.0.tx.bytes=309121
>   net.0.tx.pkts=3711
>   net.0.tx.errs=0
>   net.0.tx.drop=0
>   block.count=2
>   block.0.name=sdc
>   block.1.name=sda
>  
> block.1.path=/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:
> _pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/
> 040e79d4-5f62-4a52-b5fd-6dd8afbff18c/5b01f998-c7e7-4e66-89f2-f5475b34b147
> 
> 
> It does not look like any of the statistics were lost during the time the
> process was stopped.


Please refer to this information as well.

Comment 43 Milan Zamazal 2020-01-08 10:23:11 UTC
(In reply to Steven Rosenberg from comment #41)

> Also this is what the domain stats returns when the qeum-kvm process is
> stopped:
> 
> virsh -r domstats VMTest --nowait
> Domain: 'VMTest'
>   state.state=1
>   state.reason=5
>   cpu.time=8619153066141
>   cpu.user=1420300000000
>   cpu.system=1300930000000
>   balloon.current=1048576
>   balloon.maximum=1048576
>   vcpu.current=2
>   vcpu.maximum=32
>   vcpu.0.state=1
>   vcpu.0.time=2630320000000
>   vcpu.0.wait=0
>   vcpu.1.state=1
>   vcpu.1.time=2714580000000
>   vcpu.1.wait=0
>   net.count=1
>   net.0.name=vnet0
>   net.0.rx.bytes=60520571
>   net.0.rx.pkts=902380
>   net.0.rx.errs=0
>   net.0.rx.drop=0
>   net.0.tx.bytes=309121
>   net.0.tx.pkts=3711
>   net.0.tx.errs=0
>   net.0.tx.drop=0
>   block.count=2
>   block.0.name=sdc
>   block.1.name=sda
>  
> block.1.path=/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:
> _pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/
> 040e79d4-5f62-4a52-b5fd-6dd8afbff18c/5b01f998-c7e7-4e66-89f2-f5475b34b147

Great, so the stats I suspected as missing in the preceding e-mail communication are indeed missing. Now the question is whether it can cause any problem, see Comment 39. What happens to snapshots is not important here, unless it is related to stats missing in the report.

Comment 44 Steven Rosenberg 2020-01-08 10:28:35 UTC
(In reply to Milan Zamazal from comment #43)
> (In reply to Steven Rosenberg from comment #41)
> 
> > Also this is what the domain stats returns when the qeum-kvm process is
> > stopped:
> > 
> > virsh -r domstats VMTest --nowait
> > Domain: 'VMTest'
> >   state.state=1
> >   state.reason=5
> >   cpu.time=8619153066141
> >   cpu.user=1420300000000
> >   cpu.system=1300930000000
> >   balloon.current=1048576
> >   balloon.maximum=1048576
> >   vcpu.current=2
> >   vcpu.maximum=32
> >   vcpu.0.state=1
> >   vcpu.0.time=2630320000000
> >   vcpu.0.wait=0
> >   vcpu.1.state=1
> >   vcpu.1.time=2714580000000
> >   vcpu.1.wait=0
> >   net.count=1
> >   net.0.name=vnet0
> >   net.0.rx.bytes=60520571
> >   net.0.rx.pkts=902380
> >   net.0.rx.errs=0
> >   net.0.rx.drop=0
> >   net.0.tx.bytes=309121
> >   net.0.tx.pkts=3711
> >   net.0.tx.errs=0
> >   net.0.tx.drop=0
> >   block.count=2
> >   block.0.name=sdc
> >   block.1.name=sda
> >  
> > block.1.path=/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:
> > _pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/
> > 040e79d4-5f62-4a52-b5fd-6dd8afbff18c/5b01f998-c7e7-4e66-89f2-f5475b34b147
> 
> Great, so the stats I suspected as missing in the preceding e-mail
> communication are indeed missing. Now the question is whether it can cause
> any problem, see Comment 39. What happens to snapshots is not important
> here, unless it is related to stats missing in the report.



OK, I see with the nowait, stats are missing. This is without the nowait which when the qemu-kvm is stopped the domstats does not return until after the continue:

[root@sla-sheldon sampling]# virsh -r domstats VMTest
Domain: 'VMTest'
  state.state=1
  state.reason=5
  cpu.time=10408549977943
  cpu.user=1487130000000
  cpu.system=1336560000000
  balloon.current=1048576
  balloon.maximum=1048576
  balloon.swap_in=13119180
  balloon.swap_out=11624432
  balloon.major_fault=468937
  balloon.minor_fault=14331684
  balloon.unused=61376
  balloon.available=769204
  balloon.usable=22532
  balloon.last-update=1578478802
  balloon.rss=1030260
  vcpu.current=2
  vcpu.maximum=32
  vcpu.0.state=1
  vcpu.0.time=3472570000000
  vcpu.0.wait=0
  vcpu.1.state=1
  vcpu.1.time=3557100000000
  vcpu.1.wait=0
  net.count=1
  net.0.name=vnet0
  net.0.rx.bytes=61667900
  net.0.rx.pkts=917935
  net.0.rx.errs=0
  net.0.rx.drop=0
  net.0.tx.bytes=322033
  net.0.tx.pkts=3861
  net.0.tx.errs=0
  net.0.tx.drop=0
  block.count=2
  block.0.name=sdc
  block.0.rd.reqs=18
  block.0.rd.bytes=404
  block.0.rd.times=40481
  block.0.wr.reqs=0
  block.0.wr.bytes=0
  block.0.wr.times=0
  block.0.fl.reqs=0
  block.0.fl.times=0
  block.0.allocation=0
  block.0.physical=0
  block.1.name=sda
  block.1.path=/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/040e79d4-5f62-4a52-b5fd-6dd8afbff18c/5b01f998-c7e7-4e66-89f2-f5475b34b147
  block.1.rd.reqs=511482
  block.1.rd.bytes=16216216576
  block.1.rd.times=2589150671134
  block.1.wr.reqs=37502
  block.1.wr.bytes=11995513856
  block.1.wr.times=7144934782907
  block.1.fl.reqs=10704940
  block.1.fl.times=133231216335
  block.1.allocation=13293400064
  block.1.capacity=16106127360
  block.1.physical=3502891008


So the nowait does return immediately and with limited stats. 

The results are if the continue is performed within say a minute, the process does seem to recover and the missing stats do not seem to break the system. 
Once there is a timeout the create snapshot is aborted. So this seems to answer Comment 39.

Comment 45 Milan Zamazal 2020-01-08 10:39:48 UTC
(In reply to Steven Rosenberg from comment #44)

> The results are if the continue is performed within say a minute, the
> process does seem to recover and the missing stats do not seem to break the
> system.

Could you please elaborate on how exactly you have checked that nothing breaks? We must be careful, some problems may occur after longer time and the storage team should confirm that the missing block stats can't cause problems with storage handling.

Comment 46 Steven Rosenberg 2020-01-08 12:56:00 UTC
(In reply to Milan Zamazal from comment #45)
> (In reply to Steven Rosenberg from comment #44)
> 
> > The results are if the continue is performed within say a minute, the
> > process does seem to recover and the missing stats do not seem to break the
> > system.
> 
> Could you please elaborate on how exactly you have checked that nothing
> breaks? We must be careful, some problems may occur after longer time and
> the storage team should confirm that the missing block stats can't cause
> problems with storage handling.

Could you review Comment 44 and verify if the missing block stats during the return from nowait that appear say in the domain stats when nowait is not included will have any affect on the storage handling as per Milan's request?

Comment 47 Steven Rosenberg 2020-01-08 18:32:18 UTC
(In reply to Milan Zamazal from comment #45)
> (In reply to Steven Rosenberg from comment #44)
> 
> > The results are if the continue is performed within say a minute, the
> > process does seem to recover and the missing stats do not seem to break the
> > system.
> 
> Could you please elaborate on how exactly you have checked that nothing
> breaks? We must be careful, some problems may occur after longer time and
> the storage team should confirm that the missing block stats can't cause
> problems with storage handling.

OK, it is not clear what a long time is, maybe you can elaborate for further testing.

This is the testing I performed:

1. I modified the vdsm's VMBulkstatsMonitor function to print the return stats after the getAllDomainStats function call accordingly [1]
2. I checked the vdsm log file and verified that the stats are printing normally via the new log [2] 
3. I stopped the qemu-kvm process
4. I observe that there are warnings [3] 
5. I observe that the nowait does return a sub set of stats [4] 
6. After a few minutes I allow the qemu-kvm process to continue.
7. I observe that the normal amount of stats are printed to the log again [5]
8. I create a snapshot which succeeds.
9. I repeat the procedure with similar results.
10. I delete the snapshot which succeeds.
11. I repeated the previous test from Comment 38, stopping the qemu-kvm process during the creation of a snapshot and continuing within a minute. 
12. The create snapshot succeeds.
13. During the time the qemu-kvm process was stopped, not stats are received by the vdsm [6]
14. Reviewing the logs after the qemu-kvm process was continued, the stats return to normal [7]

Therefore the testing shows that the partial stats [4] do not seem to break the system, they are received and processed while the qemu-kvm process is down.

When the VM is locked and performing a process, it seems no stats are received at all [6]

[1]  bulk_stats = self._conn.getAllDomainStats(
         stats=self._stats_types, flags=flags)
     self._log.info("DEBUGGING STATS: bulk stats: %r...", bulk_stats)

[2] 2020-01-08 19:38:54,840+0200 INFO  (periodic/1) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: [(<libvirt.virDomain object at 0x7f2c940a6828>, {'state.state': 1, 'state.reason': 5, 'cpu.time': 66454537207418, 'cpu.user': 4160670000000, 'cpu.system': 2966690000000, 'balloon.current': 1048576, 'balloon.maximum': 1048576, 'balloon.swap_in': 108147860, 'balloon.swap_out': 87138392, 'balloon.major_fault': 4170222, 'balloon.minor_fault': 48919030, 'balloon.unused': 55788, 'balloon.available': 769204, 'balloon.usable': 24136, 'balloon.last-update': 1578505134, 'balloon.rss': 1035940, 'vcpu.current': 2, 'vcpu.maximum': 32, 'vcpu.0.state': 1, 'vcpu.0.time': 29611820000000, 'vcpu.0.wait': 0, 'vcpu.1.state': 1, 'vcpu.1.time': 29697400000000, 'vcpu.1.wait': 0, 'net.count': 1, 'net.0.name': 'vnet0', 'net.0.rx.bytes': 98674451, 'net.0.rx.pkts': 1367088, 'net.0.rx.errs': 0, 'net.0.rx.drop': 0, 'net.0.tx.bytes': 492660, 'net.0.tx.pkts': 5975, 'net.0.tx.errs': 0, 'net.0.tx.drop': 0, 'block.count': 2, 'block.0.name': 'sdc', 'block.0.rd.reqs': 18, 'block.0.rd.bytes': 404, 'block.0.rd.times': 40481, 'block.0.wr.reqs': 0, 'block.0.wr.bytes': 0, 'block.0.wr.times': 0, 'block.0.fl.reqs': 0, 'block.0.fl.times': 0, 'block.0.allocation': 0, 'block.0.physical': 0, 'block.1.name': 'sda', 'block.1.path': '/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/040e79d4-5f62-4a52-b5fd-6dd8afbff18c/1c69f0dd-6a61-495b-9e2d-5ebc3b9d9ea3', 'block.1.rd.reqs': 4426791, 'block.1.rd.bytes': 127238597632, 'block.1.rd.times': 19482466902774, 'block.1.wr.reqs': 254574, 'block.1.wr.bytes': 89391591936, 'block.1.wr.times': 40078326356706, 'block.1.fl.reqs': 100186075, 'block.1.fl.times': 486525545003, 'block.1.allocation': 303038464, 'block.1.capacity': 16106127360, 'block.1.physical': 304177152})]... (sampling:425)


[3] 2020-01-08 19:40:24,690+0200 WARN  (qgapoller/3) [virt.periodic.VmDispatcher] could not run <function QemuGuestAgentPoller.start.<locals>.per_vm_operation.<locals>.<lambda> at 0x7f2cbc063a60> on ['5a4bdbf4-7cf1-4631-912e-689ec80e8eab'] (periodic:289)
2020-01-08 19:40:24,704+0200 WARN  (qgapoller/0) [virt.periodic.VmDispatcher] could not run <function QemuGuestAgentPoller.start.<locals>.per_vm_operation.<locals>.<lambda> at 0x7f2cbc0639d8> on ['5a4bdbf4-7cf1-4631-912e-689ec80e8eab'] (periodic:289)
2020-01-08 19:40:24,710+0200 WARN  (qgapoller/1) [virt.periodic.VmDispatcher] could not run <function QemuGuestAgentPoller.start.<locals>.per_vm_operation.<locals>.<lambda> at 0x7f2cbc063ae8> on ['5a4bdbf4-7cf1-4631-912e-689ec80e8eab'] (periodic:289)
2020-01-08 19:40:24,821+0200 WARN  (periodic/1) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.UpdateVolumes'> on ['5a4bdbf4-7cf1-4631-912e-689ec80e8eab'] (periodic:289)


[4] 2020-01-08 19:40:39,837+0200 INFO  (periodic/1) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: [(<libvirt.virDomain object at 0x7f2cbc02e9b0>, {'state.state': 1, 'state.reason': 5, 'cpu.time': 66605030190665, 'cpu.user': 4172660000000, 'cpu.system': 2972800000000, 'balloon.current': 1048576, 'balloon.maximum': 1048576, 'vcpu.current': 2, 'vcpu.maximum': 32, 'vcpu.0.state': 1, 'vcpu.0.time': 29678780000000, 'vcpu.0.wait': 0, 'vcpu.1.state': 1, 'vcpu.1.time': 29764360000000, 'vcpu.1.wait': 0, 'net.count': 1, 'net.0.name': 'vnet0', 'net.0.rx.bytes': 98765165, 'net.0.rx.pkts': 1368498, 'net.0.rx.errs': 0, 'net.0.rx.drop': 0, 'net.0.tx.bytes': 493062, 'net.0.tx.pkts': 5980, 'net.0.tx.errs': 0, 'net.0.tx.drop': 0, 'block.count': 2, 'block.0.name': 'sdc', 'block.1.name': 'sda', 'block.1.path': '/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/040e79d4-5f62-4a52-b5fd-6dd8afbff18c/1c69f0dd-6a61-495b-9e2d-5ebc3b9d9ea3'})]... (sampling:425)

[5] 2020-01-08 19:44:24,842+0200 INFO  (periodic/2) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: [(<libvirt.virDomain object at 0x7f2c9411a7b8>, {'state.state': 1, 'state.reason': 5, 'cpu.time': 66756207353544, 'cpu.user': 4179490000000, 'cpu.system': 2976880000000, 'balloon.current': 1048576, 'balloon.maximum': 1048576, 'balloon.swap_in': 108639492, 'balloon.swap_out': 87697104, 'balloon.major_fault': 4190390, 'balloon.minor_fault': 49125143, 'balloon.unused': 48636, 'balloon.available': 769204, 'balloon.usable': 45124, 'balloon.last-update': 1578505464, 'balloon.rss': 1048908, 'vcpu.current': 2, 'vcpu.maximum': 32, 'vcpu.0.state': 1, 'vcpu.0.time': 29749060000000, 'vcpu.0.wait': 0, 'vcpu.1.state': 1, 'vcpu.1.time': 29834430000000, 'vcpu.1.wait': 0, 'net.count': 1, 'net.0.name': 'vnet0', 'net.0.rx.bytes': 98975956, 'net.0.rx.pkts': 1371656, 'net.0.rx.errs': 0, 'net.0.rx.drop': 818, 'net.0.tx.bytes': 494900, 'net.0.tx.pkts': 6001, 'net.0.tx.errs': 0, 'net.0.tx.drop': 0, 'block.count': 2, 'block.0.name': 'sdc', 'block.0.rd.reqs': 18, 'block.0.rd.bytes': 404, 'block.0.rd.times': 40481, 'block.0.wr.reqs': 0, 'block.0.wr.bytes': 0, 'block.0.wr.times': 0, 'block.0.fl.reqs': 0, 'block.0.fl.times': 0, 'block.0.allocation': 0, 'block.0.physical': 0, 'block.1.name': 'sda', 'block.1.path': '/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/040e79d4-5f62-4a52-b5fd-6dd8afbff18c/1c69f0dd-6a61-495b-9e2d-5ebc3b9d9ea3',


[6] 2020-01-08 20:17:24,862+0200 INFO  (periodic/0) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: []... (sampling:425)

[7] 2020-01-08 19:28:24,836+0200 INFO  (periodic/0) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: [(<libvirt.virDomain object at 0x7f2c9413b9e8>, {'state.state': 1, 'state.reason': 5, 'cpu.time': 65330583101645, 'cpu.user': 4113070000000, 'cpu.system': 2943820000000, 'balloon.current': 1048576, 'balloon.maximum': 1048576, 'balloon.swap_in': 106304088, 'balloon.swap_out': 85772272, 'balloon.major_fault': 4092964, 'balloon.minor_fault': 48354738, 'balloon.unused': 60100, 'balloon.available': 769204, 'balloon.usable': 34176, 'balloon.last-update': 1578504500, 'balloon.rss': 1036380, 'vcpu.current': 2, 'vcpu.maximum': 32, 'vcpu.0.state': 1, 'vcpu.0.time': 29086520000000, 'vcpu.0.wait': 0, 'vcpu.1.state': 1, 'vcpu.1.time': 29172020000000, 'vcpu.1.wait': 0, 'net.count': 1, 'net.0.name': 'vnet0', 'net.0.rx.bytes': 98013257, 'net.0.rx.pkts': 1356975, 'net.0.rx.errs': 0, 'net.0.rx.drop': 0, 'net.0.tx.bytes': 489332, 'net.0.tx.pkts': 5935, 'net.0.tx.errs': 0, 'net.0.tx.drop': 0, 'block.count': 2, 'block.0.name': 'sdc', 'block.0.rd.reqs': 18, 'block.0.rd.bytes': 404, 'block.0.rd.times': 40481, 'block.0.wr.reqs': 0, 'block.0.wr.bytes': 0, 'block.0.wr.times': 0, 'block.0.fl.reqs': 0, 'block.0.fl.times': 0, 'block.0.allocation': 0, 'block.0.physical': 0, 'block.1.name': 'sda', 'block.1.path': '/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/040e79d4-5f62-4a52-b5fd-6dd8afbff18c/5b01f998-c7e7-4e66-89f2-f5475b34b147', 'block.1.rd.reqs': 4348007, 'block.1.rd.bytes': 125299288064, 'block.1.rd.times': 18992369953435, 'block.1.wr.reqs': 248532, 'block.1.wr.bytes': 87992325120, 'block.1.wr.times': 39537595944654, 'block.1.fl.reqs': 98603796, 'block.1.fl.times': 475717383426, 'block.1.allocation': 13293580288, 'block.1.capacity': 16106127360, 'block.1.physical': 3757133824})]... (sampling:425)

Comment 48 Steven Rosenberg 2020-01-08 18:37:49 UTC
Created attachment 1650808 [details]
VDSM Log with test results

The logs with the results described in Comment 47.

Comment 49 Milan Zamazal 2020-01-09 12:23:39 UTC
> OK, it is not clear what a long time is, maybe you can elaborate for further testing.

I'm not sure, perhaps >10 minutes would be long enough, but I'd simply leave it in the reduced stay for an hour and see if there is still everything fine.

> Therefore the testing shows that the partial stats [4] do not seem to break the system, they are received and processed while the qemu-kvm process is down.

Did you check in Engine and Vdsm logs that there are no related errors reported?

> When the VM is locked and performing a process, it seems no stats are received at all [6]

Interesting. The question is what to do in such a case. Perhaps attempting again and if the stats are still empty than returning an error?

I'd additionally suggest looking into Engine code, what happens with the obtained stats there. We should know that, it will also help us with other decisions related to nowait, such as whether it would be any useful to indicate to Engine that the stats are incomplete. Also running OST on a modified Vdsm that returns only the reduced set of keys all the time might be a useful test.

I talked to Nir and he confirmed that the stats are not used in Vdsm storage code. But he doesn't know if Engine uses them for anything important, so you should look into the code and/or ask Engine storage developers.

Comment 50 Steven Rosenberg 2020-01-09 14:05:34 UTC
(In reply to Milan Zamazal from comment #49)
> > OK, it is not clear what a long time is, maybe you can elaborate for further testing.
> 
> I'm not sure, perhaps >10 minutes would be long enough, but I'd simply leave
> it in the reduced stay for an hour and see if there is still everything fine.
> 
> > Therefore the testing shows that the partial stats [4] do not seem to break the system, they are received and processed while the qemu-kvm process is down.
> 
> Did you check in Engine and Vdsm logs that there are no related errors
> reported?
> 
> > When the VM is locked and performing a process, it seems no stats are received at all [6]
> 
> Interesting. The question is what to do in such a case. Perhaps attempting
> again and if the stats are still empty than returning an error?
> 
> I'd additionally suggest looking into Engine code, what happens with the
> obtained stats there. We should know that, it will also help us with other
> decisions related to nowait, such as whether it would be any useful to
> indicate to Engine that the stats are incomplete. Also running OST on a
> modified Vdsm that returns only the reduced set of keys all the time might
> be a useful test.
> 
> I talked to Nir and he confirmed that the stats are not used in Vdsm storage
> code. But he doesn't know if Engine uses them for anything important, so you
> should look into the code and/or ask Engine storage developers.


The engine just basically has timeout errors, judging from the time it seems only during the create snapshot test, not during the other tests:

020-01-08 19:48:36,075+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host '10.35.0.35', last response arrived 287154 ms ago.
2020-01-08 19:48:36,075+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] No interaction with host '10.35.0.34' for 20000 ms.
2020-01-08 19:48:36,075+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host '10.35.0.34', last response arrived 282571 ms ago.
2020-01-08 19:48:36,075+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] No interaction with host '10.35.0.34' for 20000 ms.
2020-01-08 19:48:36,075+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host '10.35.0.34', last response arrived 282679 ms ago.
2020-01-08 19:48:36,075+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] No interaction with host '10.35.0.35' for 20000 ms.
2020-01-08 19:48:36,075+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host '10.35.0.35', last response arrived 284847 ms ago.
2020-01-08 19:48:37,088+02 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [ba0774ea-d657-4814-8ab6-6d87fbe0ec77] Command 'RemoveSnapshot' (id: 'bda6e9f8-f2bb-4a56-a31a-121fde09058c') waiting on child command id: '2155fed9-2825-44da-af99-d53151b1ec5a' type:'RemoveSnapshotSingleDiskLive' to complete
2020-01-08 19:48:37,928+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.35.0.35
2020-01-08 19:48:37,929+02 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2020-01-08 19:48:37,930+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] Exception during connection
2020-01-08 19:48:37,931+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] Command 'SpmStatusVDSCommand(HostName = Host1, SpmStatusVDSCommandParameters:{hostId='8e8c9454-3c22-44ec-a429-1b995bb27dfe', storagePoolId='baed6eba-1f55-11ea-9ef1-8c16450ea519'})' execution failed: java.net.SocketException: Network is unreachable

I will try to do the test for an hour or so before continuing the qemu-kvm accordingly.

Comment 51 Steven Rosenberg 2020-01-12 12:38:27 UTC
(In reply to Milan Zamazal from comment #49)
> > OK, it is not clear what a long time is, maybe you can elaborate for further testing.
> 
> I'm not sure, perhaps >10 minutes would be long enough, but I'd simply leave
> it in the reduced stay for an hour and see if there is still everything fine.
> 
> > Therefore the testing shows that the partial stats [4] do not seem to break the system, they are received and processed while the qemu-kvm process is down.
> 
> Did you check in Engine and Vdsm logs that there are no related errors
> reported?
> 
> > When the VM is locked and performing a process, it seems no stats are received at all [6]
> 
> Interesting. The question is what to do in such a case. Perhaps attempting
> again and if the stats are still empty than returning an error?
> 
> I'd additionally suggest looking into Engine code, what happens with the
> obtained stats there. We should know that, it will also help us with other
> decisions related to nowait, such as whether it would be any useful to
> indicate to Engine that the stats are incomplete. Also running OST on a
> modified Vdsm that returns only the reduced set of keys all the time might
> be a useful test.
> 
> I talked to Nir and he confirmed that the stats are not used in Vdsm storage
> code. But he doesn't know if Engine uses them for anything important, so you
> should look into the code and/or ask Engine storage developers.

Further testing entailed stopping the qemu-kvm for about an hour. Within that time the limited stats were logged [1]. as opposed to when the qemu-kvm process is running [2].

After continuing the qemu-kvm, I was able to create and remove a snapshot showing that the system recovered successfully. 

[1] 2020-01-12 12:44:27,890+0200 INFO  (periodic/1) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: [(<libvirt.virDomain object at 0x7f495c2bd860>, {'state.state': 1, 'state.reason': 5, 'cpu.time': 744691289490290, 'cpu.user': 37046340000000, 'cpu.system': 23946970000000, 'balloon.current': 1048576, 'balloon.maximum': 1048576, 'vcpu.current': 2, 'vcpu.maximum': 32, 'vcpu.0.state': 1, 'vcpu.0.time': 345484570000000, 'vcpu.0.wait': 0, 'vcpu.1.state': 1, 'vcpu.1.time': 345615880000000, 'vcpu.1.wait': 0, 'net.count': 1, 'net.0.name': 'vnet0', 'net.0.rx.bytes': 441349462, 'net.0.rx.pkts': 6458267, 'net.0.rx.errs': 0, 'net.0.rx.drop': 53143, 'net.0.tx.bytes': 1980320, 'net.0.tx.pkts': 23975, 'net.0.tx.errs': 0, 'net.0.tx.drop': 0, 'block.count': 2, 'block.0.name': 'sdc', 'block.1.name': 'sda', 'block.1.path': '/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/040e79d4-5f62-4a52-b5fd-6dd8afbff18c/978695fa-d430-403e-af73-8446b7d27040'})]... (sampling:425)



[2] 2020-01-12 14:35:27,994+0200 INFO  (periodic/3) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: [(<libvirt.virDomain object at 0x7f495c2cbf28>, {'state.state': 1, 'state.reason': 5, 'cpu.time': 750487671675854, 'cpu.user': 37336070000000, 'cpu.system': 24156890000000, 'balloon.current': 1048576, 'balloon.maximum': 1048576, 'balloon.swap_in': 1212734144, 'balloon.swap_out': 1002810364, 'balloon.major_fault': 51632662, 'balloon.minor_fault': 451941072, 'balloon.unused': 73976, 'balloon.available': 769204, 'balloon.usable': 37976, 'balloon.last-update': 1578832527, 'balloon.rss': 1036120, 'vcpu.current': 2, 'vcpu.maximum': 32, 'vcpu.0.state': 1, 'vcpu.0.time': 348184940000000, 'vcpu.0.wait': 0, 'vcpu.1.state': 1, 'vcpu.1.time': 348316660000000, 'vcpu.1.wait': 0, 'net.count': 1, 'net.0.name': 'vnet0', 'net.0.rx.bytes': 444630062, 'net.0.rx.pkts': 6506201, 'net.0.rx.errs': 0, 'net.0.rx.drop': 117771, 'net.0.tx.bytes': 2009082, 'net.0.tx.pkts': 24318, 'net.0.tx.errs': 0, 'net.0.tx.drop': 0, 'block.count': 2, 'block.0.name': 'sdc', 'block.0.rd.reqs': 18, 'block.0.rd.bytes': 404, 'block.0.rd.times': 40481, 'block.0.wr.reqs': 0, 'block.0.wr.bytes': 0, 'block.0.wr.times': 0, 'block.0.fl.reqs': 0, 'block.0.fl.times': 0, 'block.0.allocation': 0, 'block.0.physical': 0, 'block.1.name': 'sda', 'block.1.path': '/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/797bd8a1-625d-443d-b10f-cec4ed74b185/images/040e79d4-5f62-4a52-b5fd-6dd8afbff18c/5b01f998-c7e7-4e66-89f2-f5475b34b147', 'block.1.rd.reqs': 54855621, 'block.1.rd.bytes': 1432697240064, 'block.1.rd.times': 239370012814018, 'block.1.wr.reqs': 6706696, 'block.1.wr.bytes': 1027886173184, 'block.1.wr.times': 488072237532809, 'block.1.fl.reqs': 1194923012, 'block.1.fl.times': 5369490751387, 'block.1.allocation': 13308526592, 'block.1.capacity': 16106127360, 'block.1.physical': 3764207616})]... (sampling:425)

Comment 52 Steven Rosenberg 2020-01-12 12:48:46 UTC
(In reply to Milan Zamazal from comment #49)
> > OK, it is not clear what a long time is, maybe you can elaborate for further testing.
> 
> I'm not sure, perhaps >10 minutes would be long enough, but I'd simply leave
> it in the reduced stay for an hour and see if there is still everything fine.
> 
> > Therefore the testing shows that the partial stats [4] do not seem to break the system, they are received and processed while the qemu-kvm process is down.
> 
> Did you check in Engine and Vdsm logs that there are no related errors
> reported?
> 
> > When the VM is locked and performing a process, it seems no stats are received at all [6]
> 
> Interesting. The question is what to do in such a case. Perhaps attempting
> again and if the stats are still empty than returning an error?
> 
> I'd additionally suggest looking into Engine code, what happens with the
> obtained stats there. We should know that, it will also help us with other
> decisions related to nowait, such as whether it would be any useful to
> indicate to Engine that the stats are incomplete. Also running OST on a
> modified Vdsm that returns only the reduced set of keys all the time might
> be a useful test.
> 
> I talked to Nir and he confirmed that the stats are not used in Vdsm storage
> code. But he doesn't know if Engine uses them for anything important, so you
> should look into the code and/or ask Engine storage developers.

I reviewed the VmAnalyzer update statistics code [1] as well as the StorageDomainDynamic [2] as per Milan's request, but did not find the statistics returned in these instances represented the dom stats the vdsm receives from the libvirt / qemu-kvm modules. Maybe Tal can provide further answers.  



[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L893
[2] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/domain/StorageDomainCommandBase.java#L466

Comment 53 Milan Zamazal 2020-01-14 15:46:41 UTC
I've made a testing patch for OST, which emulates missing stats all the time: https://gerrit.ovirt.org/106290/. Unfortunately OST has failed due to an unrelated reason on it, I'll try to re-run it later.

Comment 54 Tal Nisan 2020-01-15 10:20:22 UTC
(In reply to Steven Rosenberg from comment #52)
> (In reply to Milan Zamazal from comment #49)
> > > OK, it is not clear what a long time is, maybe you can elaborate for further testing.
> > 
> > I'm not sure, perhaps >10 minutes would be long enough, but I'd simply leave
> > it in the reduced stay for an hour and see if there is still everything fine.
> > 
> > > Therefore the testing shows that the partial stats [4] do not seem to break the system, they are received and processed while the qemu-kvm process is down.
> > 
> > Did you check in Engine and Vdsm logs that there are no related errors
> > reported?
> > 
> > > When the VM is locked and performing a process, it seems no stats are received at all [6]
> > 
> > Interesting. The question is what to do in such a case. Perhaps attempting
> > again and if the stats are still empty than returning an error?
> > 
> > I'd additionally suggest looking into Engine code, what happens with the
> > obtained stats there. We should know that, it will also help us with other
> > decisions related to nowait, such as whether it would be any useful to
> > indicate to Engine that the stats are incomplete. Also running OST on a
> > modified Vdsm that returns only the reduced set of keys all the time might
> > be a useful test.
> > 
> > I talked to Nir and he confirmed that the stats are not used in Vdsm storage
> > code. But he doesn't know if Engine uses them for anything important, so you
> > should look into the code and/or ask Engine storage developers.
> 
> I reviewed the VmAnalyzer update statistics code [1] as well as the
> StorageDomainDynamic [2] as per Milan's request, but did not find the
> statistics returned in these instances represented the dom stats the vdsm
> receives from the libvirt / qemu-kvm modules. Maybe Tal can provide further
> answers.  
> 
> 
> 
> [1]
> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/
> vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/
> VmAnalyzer.java#L893
> [2]
> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/
> bll/src/main/java/org/ovirt/engine/core/bll/storage/domain/
> StorageDomainCommandBase.java#L466

The storage domain statistic are mostly around available and used size, nothing that should be important for the running of a VM unless we've reached critical free size limit

Comment 55 Milan Zamazal 2020-01-17 14:25:54 UTC
(In reply to Tal Nisan from comment #54)
 
> The storage domain statistic are mostly around available and used size,
> nothing that should be important for the running of a VM unless we've
> reached critical free size limit

Tal, what do you mean by "unless we've reached critical free size limit"? If the stats are unavailable, Engine can't react to such situations in any case, the difference with using NOWAIT is that Engine gets incomplete stats rather than no stats. If it is not worse than having no stats at all (such as something crashing on a missing item), we should be fine.

There is just one caveat: If incomplete stats are available, Engine may present the VM as OK to the user despite reaching storage limits or ballooning not working. We should probably indicate in Engine that the VM may not be behaving as expected when this happens. Steven, please try to address that in your patches.

Other than that, I'd say if OST passes with incomplete stats (it seems to be still generally broken), we can add the NOWAIT flag.

Comment 56 Tal Nisan 2020-03-08 12:32:22 UTC
(In reply to Milan Zamazal from comment #55)
> (In reply to Tal Nisan from comment #54)
>  
> > The storage domain statistic are mostly around available and used size,
> > nothing that should be important for the running of a VM unless we've
> > reached critical free size limit
> 
> Tal, what do you mean by "unless we've reached critical free size limit"? If
> the stats are unavailable, Engine can't react to such situations in any
> case, the difference with using NOWAIT is that Engine gets incomplete stats
> rather than no stats. If it is not worse than having no stats at all (such
> as something crashing on a missing item), we should be fine.
> 
> There is just one caveat: If incomplete stats are available, Engine may
> present the VM as OK to the user despite reaching storage limits or
> ballooning not working. We should probably indicate in Engine that the VM
> may not be behaving as expected when this happens. Steven, please try to
> address that in your patches.
> 
> Other than that, I'd say if OST passes with incomplete stats (it seems to be
> still generally broken), we can add the NOWAIT flag.

Those stats are relevant when creating a new disk for instance, they are not critical in a way that something bad can happen if they are not returned in some of the statistic collection cycles

Comment 57 Qin Yuan 2020-07-01 15:59:44 UTC
Tested with:
vdsm-4.40.20-1.el8ev.x86_64
libvirt-6.0.0-24.module+el8.2.1+6997+c666f621.x86_64
qemu-kvm-4.2.0-25.module+el8.2.1+6985+9fd9d514.x86_64
ovirt-engine-4.4.1.5-0.17.el8ev.noarch

Steps & results:
Pre:
1. Modified the vdsm's VMBulkstatsMonitor function to print the return stats
   self._log.info("DEBUGGING STATS: bulk stats: %r...", bulk_stats)
2. Set LiveSnapshotPerformFreezeInEngine=true on engine via engine-config

Part 1, generate lots of dirty pages on VM
Steps:
1. Create and run a VM(rhel8.2)
2. Run the following cmds/script in VM to generate a lot of writes to storage on the guest in order to generate a lot of dirty pages.
1) modify dirty page settings in VM:
   # vi /etc/sysctl.conf
   vm.dirty_background_ratio=90
   vm.dirty_ratio=90
2) reload dirty page settings:
   # sysctl -p
3) generate writes:
   #!/bin/bash

   for ((i=0; i<100; i++))
   do
       dd if=/dev/zero of=$i.txt bs=1M count=40
   done
3. Create a live snapshot without memory

Results:
1. It took about 2 mins to freeze guest filesystems
2. During freezing, VMBulkstatsMonitor got full stats
3. During snapshot creation, VM status changed from Up to NotResponding for some time, then changed back to Up
4. Snapshot creation succeeded.

vdsm log:
2020-07-01 11:43:47,447+0000 INFO  (jsonrpc/4) [virt.vm] (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Freezing guest filesystems (vm:3881)
...
2020-07-01 11:43:58,022+0000 INFO  (periodic/2) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: [(<libvirt.virDomain object at 0x7f2b8c5aa940>, {'state.state': 1, 'state.reason': 1, 'cpu.time': 34983704708, 'cpu.user': 2130000000, 'cpu.system': 4770000000, 'cpu.cache.monitor.count': 0, 'balloon.current': 25165824, 'balloon.maximum': 25165824, 'balloon.swap_in': 0, 'balloon.swap_out': 0, 'balloon.major_fault': 736, 'balloon.minor_fault': 386046, 'balloon.unused': 19965972, 'balloon.available': 24509528, 'balloon.usable': 23935688, 'balloon.last-update': 1593603833, 'balloon.disk_caches': 4271764, 'balloon.hugetlb_pgalloc': 0, 'balloon.hugetlb_pgfail': 0, 'balloon.rss': 5302932, 'vcpu.current': 1, 'vcpu.maximum': 16, 'vcpu.0.state': 1, 'vcpu.0.time': 32670000000, 'vcpu.0.wait': 0, 'net.count': 1, 'net.0.name': 'vnet0', 'net.0.rx.bytes': 235147, 'net.0.rx.pkts': 3312, 'net.0.rx.errs': 0, 'net.0.rx.drop': 0, 'net.0.tx.bytes': 48678, 'net.0.tx.pkts': 286, 'net.0.tx.errs': 0, 'net.0.tx.drop': 0, 'block.count': 2, 'block.0.name': 'sdc', 'block.0.rd.reqs': 21, 'block.0.rd.bytes': 446, 'block.0.rd.times': 39978, 'block.0.wr.reqs': 0, 'block.0.wr.bytes': 0, 'block.0.wr.times': 0, 'block.0.fl.reqs': 0, 'block.0.fl.times': 0, 'block.1.name': 'sda', 'block.1.path': '/rhev/data-center/mnt/10.x.x.x:_home_nfs_data1/02320133-cbae-41da-88ea-f9a1f636fb73/images/d0defb1a-3a43-4a9d-ad53-6a3e4fb58094/ab4081ef-3d66-495e-805a-c2e3be178a16', 'block.1.backingIndex': 1, 'block.1.rd.reqs': 8283, 'block.1.rd.bytes': 211631104, 'block.1.rd.times': 12207405067, 'block.1.wr.reqs': 586, 'block.1.wr.bytes': 407975424, 'block.1.wr.times': 450986630852, 'block.1.fl.reqs': 82, 'block.1.fl.times': 2729876, 'block.1.allocation': 41582698496, 'block.1.capacity': 53687091200, 'block.1.physical': 20503556096})]... (sampling:452)
...
2020-07-01 11:44:21,872+0000 DEBUG (vmchannels) [virt.vm] (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Guest connection timed out (guestagent:569)
2020-07-01 11:44:22,200+0000 DEBUG (vdsm.Scheduler) [Executor] Starting worker qgapoller/4 (executor:286)
2020-07-01 11:44:22,200+0000 DEBUG (qgapoller/4) [Executor] START thread <Thread(qgapoller/4, started daemon 139825571145472)> (func=<bound method _Worker._run of <Worker name=qgapoller/4 waiting task#=0 at 0x7f2b8c799cf8>>, args=(), kwargs={}) (concurrent:258)
2020-07-01 11:44:22,200+0000 WARN  (vdsm.Scheduler) [Executor] executor state: count=5 workers={<Worker name=qgapoller/1 waiting task#=45 at 0x7f2bcc2fc240>, <Worker name=qgapoller/0 running <Task discardable <Operation action=<bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f2bcc36cc88>> at 0x7f2bcc2fc198> timeout=30, duration=30.00 at 0x7f2bd491f828> discarded task#=45 at 0x7f2bcc34ae80>, <Worker name=qgapoller/4 waiting task#=0 at 0x7f2b8c799cf8>, <Worker name=qgapoller/3 waiting task#=44 at 0x7f2bcc2fc358>, <Worker name=qgapoller/2 waiting task#=45 at 0x7f2bcc2fc1d0>} (executor:213)
2020-07-01 11:44:22,202+0000 INFO  (qgapoller/0) [vds] Failed to get guest info for vm=7835bc4a-bcf4-4f48-bacc-8d1907045ba7, error: Timed out during operation: cannot acquire state change lock (held by agent=remoteDispatchDomainFSFreeze) (qemuguestagent:400)
2020-07-01 11:44:22,202+0000 INFO  (qgapoller/0) [Executor] Worker was discarded (executor:305)
2020-07-01 11:44:22,202+0000 INFO  (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=qgapoller/0 waiting discarded task#=46 at 0x7f2bcc34ae80> (executor:355)
...
2020-07-01 11:45:32,995+0000 INFO  (jsonrpc/4) [virt.vm] (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') 2 guest filesystems frozen (vm:3896)
...
2020-07-01 11:45:43,836+0000 INFO  (snap_abort/65649373) [virt.vm] (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Snapshot job already completed (snapshot:651)
...
2020-07-01 11:45:48,479+0000 INFO  (jsonrpc/1) [virt.vm] (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Thawing guest filesystems (vm:3904)
2020-07-01 11:45:48,486+0000 INFO  (jsonrpc/1) [virt.vm] (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') 2 guest filesystems thawed (vm:3919)

engine log:
2020-07-02 03:43:36,328+08 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engine-Thread-23338) [802f28cf-7ee7-4f66-8d3e-32dc75c8491f] Running command: CreateSnapshotForVmCommand internal: false. Entities affected :  ID: 7835bc4a-bcf4-4f48-bacc-8d1907045ba7 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER
2020-07-02 03:43:36,343+08 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-23338) [802f28cf-7ee7-4f66-8d3e-32dc75c8491f] EVENT_ID: FREEZE_VM_INITIATED(10,766), Freeze of guest filesystems on VM test_vm was initiated.
2020-07-02 03:44:06,774+08 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM up-stream-bug command Get Host Capabilities failed: Message timeout which can be caused by communication issues
2020-07-02 03:44:06,774+08 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2020-07-02 03:44:21,962+08 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-11) [] VM '7835bc4a-bcf4-4f48-bacc-8d1907045ba7'(test_vm) moved from 'Up' --> 'NotResponding'
...
2020-07-02 03:44:52,171+08 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-93) [] VM '7835bc4a-bcf4-4f48-bacc-8d1907045ba7'(test_vm) moved from 'NotResponding' --> 'Up'
2020-07-02 03:45:21,916+08 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-23338) [802f28cf-7ee7-4f66-8d3e-32dc75c8491f] EVENT_ID: FREEZE_VM_SUCCESS(10,767), Guest filesystems on VM test_vm have been frozen successfully.
...
2020-07-02 03:45:37,786+08 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'snap' creation for VM 'test_vm' has been completed.


Part 2, stop qemu-kvm process
Steps:
1. Create and run a VM(rhel8.2)
2. Stop qemu-kvm process by running `kill -s STOP $pid`

Results:
1. VM turned to NotResponding after stop qemu-kvm process
2. `virsh -r domstats test_vm --nowait` can return a partial stats

vdsm log:
2020-07-01 13:21:19,673+0000 DEBUG (vdsm.Scheduler) [Executor] Starting worker periodic/5 (executor:286)
2020-07-01 13:21:19,673+0000 DEBUG (periodic/5) [Executor] START thread <Thread(periodic/5, started daemon 139821829834496)> (func=<bound method _Worker._run of <Worker name=periodic/5 waiting task#=0 at 0x7f2b8c5de4a8>>, args=(), kwargs={}) (concurrent:258)
2020-07-01 13:21:19,673+0000 WARN  (vdsm.Scheduler) [Executor] executor state: count=5 workers={<Worker name=periodic/3 waiting task#=1287 at 0x7f2b8c799470>, <Worker name=periodic/5 waiting task#=0 at 0x7f2b8c5de4a8>, <Worker name=periodic/2 waiting task#=1305 at 0x7f2b8c799320>, <Worker name=periodic/4 waiting task#=213 at 0x7f2b8c70c7b8>, <Worker name=periodic/1 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f2b8c7997f0> at 0x7f2b8c7998d0> timeout=7.5, duration=7.50 at 0x7f2b8c681b38> discarded task#=1297 at 0x7f2b8c7993c8>} (executor:213)
2020-07-01 13:21:19,674+0000 INFO  (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/1 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f2b8c7997f0> at 0x7f2b8c7998d0> timeout=7.5, duration=7.50 at 0x7f2b8c681b38> discarded task#=1297 at 0x7f2b8c7993c8> (executor:355)
2020-07-01 13:21:19,675+0000 DEBUG (periodic/5) [Executor] Worker started (executor:298)
...
2020-07-01 13:21:27,174+0000 DEBUG (periodic/4) [virt.sampling.VMBulkstatsMonitor] sampled timestamp 6788.159510452 elapsed 0.001 acquired False domains 0 (sampling:451)
2020-07-01 13:21:27,174+0000 INFO  (periodic/4) [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: []... (sampling:452)
2020-07-01 13:21:27,941+0000 DEBUG (vmchannels) [virt.vm] (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Guest connection timed out (guestagent:569)
...
2020-07-01 13:22:06,310+0000 WARN  (jsonrpc/3) [virt.vm] (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') monitor became unresponsive (command timeout, age=69.13788864900016) (vm:5536)

engine log:
2020-07-02 05:21:55,089+08 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-98) [] VM '7835bc4a-bcf4-4f48-bacc-8d1907045ba7'(test_vm) moved from 'Up' --> 'NotResponding'
2020-07-02 05:21:55,113+08 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-98) [] EVENT_ID: VM_NOT_RESPONDING(126), VM test_vm is not responding.
2020-07-02 05:22:07,953+08 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM up-stream-bug command Get Host Capabilities failed: Message timeout which can be caused by communication issues
2020-07-02 05:22:07,953+08 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) [] Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

# virsh -r domstats test_vm --nowait
Domain: 'test_vm'
  state.state=1
  state.reason=1
  cpu.time=31507798523
  cpu.user=2000000000
  cpu.system=3930000000
  cpu.cache.monitor.count=0
  balloon.current=25165824
  balloon.maximum=25165824
  vcpu.current=1
  vcpu.maximum=16
  vcpu.0.state=1
  vcpu.0.time=29220000000
  vcpu.0.wait=0
  net.count=1
  net.0.name=vnet0
  net.0.rx.bytes=346878
  net.0.rx.pkts=4077
  net.0.rx.errs=0
  net.0.rx.drop=25268
  net.0.tx.bytes=0
  net.0.tx.pkts=0
  net.0.tx.errs=0
  net.0.tx.drop=0
  block.count=2
  block.0.name=sdc
  block.1.name=sda
  block.1.path=/rhev/data-center/mnt/10.x.x.x:_home_nfs_data1/02320133-cbae-41da-88ea-f9a1f636fb73/images/d0defb1a-3a43-4a9d-ad53-6a3e4fb58094/ab4081ef-3d66-495e-805a-c2e3be178a16
  block.1.backingIndex=1


Questions:
1. In Part 1 test, is it normal behavior that the vm became NotResonding for some time and then came bake to Up again?
2. In Part 2 test, is it correct that VMBulkstatsMonitor was discarded when qemu-kvm was stopped? In comment #47 step 5, it could get a partial stats when qemu-kvm was stopped.

Comment 59 Steven Rosenberg 2020-07-01 16:56:36 UTC
(In reply to Qin Yuan from comment #57)
> Tested with:
> vdsm-4.40.20-1.el8ev.x86_64
> libvirt-6.0.0-24.module+el8.2.1+6997+c666f621.x86_64
> qemu-kvm-4.2.0-25.module+el8.2.1+6985+9fd9d514.x86_64
> ovirt-engine-4.4.1.5-0.17.el8ev.noarch
> 
> Steps & results:
> Pre:
> 1. Modified the vdsm's VMBulkstatsMonitor function to print the return stats
>    self._log.info("DEBUGGING STATS: bulk stats: %r...", bulk_stats)
> 2. Set LiveSnapshotPerformFreezeInEngine=true on engine via engine-config
> 
> Part 1, generate lots of dirty pages on VM
> Steps:
> 1. Create and run a VM(rhel8.2)
> 2. Run the following cmds/script in VM to generate a lot of writes to
> storage on the guest in order to generate a lot of dirty pages.
> 1) modify dirty page settings in VM:
>    # vi /etc/sysctl.conf
>    vm.dirty_background_ratio=90
>    vm.dirty_ratio=90
> 2) reload dirty page settings:
>    # sysctl -p
> 3) generate writes:
>    #!/bin/bash
> 
>    for ((i=0; i<100; i++))
>    do
>        dd if=/dev/zero of=$i.txt bs=1M count=40
>    done
> 3. Create a live snapshot without memory
> 
> Results:
> 1. It took about 2 mins to freeze guest filesystems
> 2. During freezing, VMBulkstatsMonitor got full stats
> 3. During snapshot creation, VM status changed from Up to NotResponding for
> some time, then changed back to Up
> 4. Snapshot creation succeeded.
> 
> vdsm log:
> 2020-07-01 11:43:47,447+0000 INFO  (jsonrpc/4) [virt.vm]
> (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Freezing guest filesystems
> (vm:3881)
> ...
> 2020-07-01 11:43:58,022+0000 INFO  (periodic/2)
> [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats:
> [(<libvirt.virDomain object at 0x7f2b8c5aa940>, {'state.state': 1,
> 'state.reason': 1, 'cpu.time': 34983704708, 'cpu.user': 2130000000,
> 'cpu.system': 4770000000, 'cpu.cache.monitor.count': 0, 'balloon.current':
> 25165824, 'balloon.maximum': 25165824, 'balloon.swap_in': 0,
> 'balloon.swap_out': 0, 'balloon.major_fault': 736, 'balloon.minor_fault':
> 386046, 'balloon.unused': 19965972, 'balloon.available': 24509528,
> 'balloon.usable': 23935688, 'balloon.last-update': 1593603833,
> 'balloon.disk_caches': 4271764, 'balloon.hugetlb_pgalloc': 0,
> 'balloon.hugetlb_pgfail': 0, 'balloon.rss': 5302932, 'vcpu.current': 1,
> 'vcpu.maximum': 16, 'vcpu.0.state': 1, 'vcpu.0.time': 32670000000,
> 'vcpu.0.wait': 0, 'net.count': 1, 'net.0.name': 'vnet0', 'net.0.rx.bytes':
> 235147, 'net.0.rx.pkts': 3312, 'net.0.rx.errs': 0, 'net.0.rx.drop': 0,
> 'net.0.tx.bytes': 48678, 'net.0.tx.pkts': 286, 'net.0.tx.errs': 0,
> 'net.0.tx.drop': 0, 'block.count': 2, 'block.0.name': 'sdc',
> 'block.0.rd.reqs': 21, 'block.0.rd.bytes': 446, 'block.0.rd.times': 39978,
> 'block.0.wr.reqs': 0, 'block.0.wr.bytes': 0, 'block.0.wr.times': 0,
> 'block.0.fl.reqs': 0, 'block.0.fl.times': 0, 'block.1.name': 'sda',
> 'block.1.path':
> '/rhev/data-center/mnt/10.x.x.x:_home_nfs_data1/02320133-cbae-41da-88ea-
> f9a1f636fb73/images/d0defb1a-3a43-4a9d-ad53-6a3e4fb58094/ab4081ef-3d66-495e-
> 805a-c2e3be178a16', 'block.1.backingIndex': 1, 'block.1.rd.reqs': 8283,
> 'block.1.rd.bytes': 211631104, 'block.1.rd.times': 12207405067,
> 'block.1.wr.reqs': 586, 'block.1.wr.bytes': 407975424, 'block.1.wr.times':
> 450986630852, 'block.1.fl.reqs': 82, 'block.1.fl.times': 2729876,
> 'block.1.allocation': 41582698496, 'block.1.capacity': 53687091200,
> 'block.1.physical': 20503556096})]... (sampling:452)
> ...
> 2020-07-01 11:44:21,872+0000 DEBUG (vmchannels) [virt.vm]
> (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Guest connection timed out
> (guestagent:569)
> 2020-07-01 11:44:22,200+0000 DEBUG (vdsm.Scheduler) [Executor] Starting
> worker qgapoller/4 (executor:286)
> 2020-07-01 11:44:22,200+0000 DEBUG (qgapoller/4) [Executor] START thread
> <Thread(qgapoller/4, started daemon 139825571145472)> (func=<bound method
> _Worker._run of <Worker name=qgapoller/4 waiting task#=0 at
> 0x7f2b8c799cf8>>, args=(), kwargs={}) (concurrent:258)
> 2020-07-01 11:44:22,200+0000 WARN  (vdsm.Scheduler) [Executor] executor
> state: count=5 workers={<Worker name=qgapoller/1 waiting task#=45 at
> 0x7f2bcc2fc240>, <Worker name=qgapoller/0 running <Task discardable
> <Operation action=<bound method QemuGuestAgentPoller._poller of
> <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f2bcc36cc88>> at
> 0x7f2bcc2fc198> timeout=30, duration=30.00 at 0x7f2bd491f828> discarded
> task#=45 at 0x7f2bcc34ae80>, <Worker name=qgapoller/4 waiting task#=0 at
> 0x7f2b8c799cf8>, <Worker name=qgapoller/3 waiting task#=44 at
> 0x7f2bcc2fc358>, <Worker name=qgapoller/2 waiting task#=45 at
> 0x7f2bcc2fc1d0>} (executor:213)
> 2020-07-01 11:44:22,202+0000 INFO  (qgapoller/0) [vds] Failed to get guest
> info for vm=7835bc4a-bcf4-4f48-bacc-8d1907045ba7, error: Timed out during
> operation: cannot acquire state change lock (held by
> agent=remoteDispatchDomainFSFreeze) (qemuguestagent:400)
> 2020-07-01 11:44:22,202+0000 INFO  (qgapoller/0) [Executor] Worker was
> discarded (executor:305)
> 2020-07-01 11:44:22,202+0000 INFO  (vdsm.Scheduler) [Executor] Worker
> discarded: <Worker name=qgapoller/0 waiting discarded task#=46 at
> 0x7f2bcc34ae80> (executor:355)
> ...
> 2020-07-01 11:45:32,995+0000 INFO  (jsonrpc/4) [virt.vm]
> (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') 2 guest filesystems frozen
> (vm:3896)
> ...
> 2020-07-01 11:45:43,836+0000 INFO  (snap_abort/65649373) [virt.vm]
> (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Snapshot job already completed
> (snapshot:651)
> ...
> 2020-07-01 11:45:48,479+0000 INFO  (jsonrpc/1) [virt.vm]
> (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Thawing guest filesystems
> (vm:3904)
> 2020-07-01 11:45:48,486+0000 INFO  (jsonrpc/1) [virt.vm]
> (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') 2 guest filesystems thawed
> (vm:3919)
> 
> engine log:
> 2020-07-02 03:43:36,328+08 INFO 
> [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand]
> (EE-ManagedThreadFactory-engine-Thread-23338)
> [802f28cf-7ee7-4f66-8d3e-32dc75c8491f] Running command:
> CreateSnapshotForVmCommand internal: false. Entities affected :  ID:
> 7835bc4a-bcf4-4f48-bacc-8d1907045ba7 Type: VMAction group
> MANIPULATE_VM_SNAPSHOTS with role type USER
> 2020-07-02 03:43:36,343+08 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engine-Thread-23338)
> [802f28cf-7ee7-4f66-8d3e-32dc75c8491f] EVENT_ID:
> FREEZE_VM_INITIATED(10,766), Freeze of guest filesystems on VM test_vm was
> initiated.
> 2020-07-02 03:44:06,774+08 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) []
> EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM up-stream-bug command Get
> Host Capabilities failed: Message timeout which can be caused by
> communication issues
> 2020-07-02 03:44:06,774+08 ERROR
> [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-14) []
> Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException:
> VDSNetworkException: Message timeout which can be caused by communication
> issues
> 2020-07-02 03:44:21,962+08 INFO 
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-11) []
> VM '7835bc4a-bcf4-4f48-bacc-8d1907045ba7'(test_vm) moved from 'Up' -->
> 'NotResponding'
> ...
> 2020-07-02 03:44:52,171+08 INFO 
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-93) []
> VM '7835bc4a-bcf4-4f48-bacc-8d1907045ba7'(test_vm) moved from
> 'NotResponding' --> 'Up'
> 2020-07-02 03:45:21,916+08 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engine-Thread-23338)
> [802f28cf-7ee7-4f66-8d3e-32dc75c8491f] EVENT_ID: FREEZE_VM_SUCCESS(10,767),
> Guest filesystems on VM test_vm have been frozen successfully.
> ...
> 2020-07-02 03:45:37,786+08 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) []
> EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'snap'
> creation for VM 'test_vm' has been completed.
> 
> 
> Part 2, stop qemu-kvm process
> Steps:
> 1. Create and run a VM(rhel8.2)
> 2. Stop qemu-kvm process by running `kill -s STOP $pid`
> 
> Results:
> 1. VM turned to NotResponding after stop qemu-kvm process
> 2. `virsh -r domstats test_vm --nowait` can return a partial stats
> 
> vdsm log:
> 2020-07-01 13:21:19,673+0000 DEBUG (vdsm.Scheduler) [Executor] Starting
> worker periodic/5 (executor:286)
> 2020-07-01 13:21:19,673+0000 DEBUG (periodic/5) [Executor] START thread
> <Thread(periodic/5, started daemon 139821829834496)> (func=<bound method
> _Worker._run of <Worker name=periodic/5 waiting task#=0 at 0x7f2b8c5de4a8>>,
> args=(), kwargs={}) (concurrent:258)
> 2020-07-01 13:21:19,673+0000 WARN  (vdsm.Scheduler) [Executor] executor
> state: count=5 workers={<Worker name=periodic/3 waiting task#=1287 at
> 0x7f2b8c799470>, <Worker name=periodic/5 waiting task#=0 at 0x7f2b8c5de4a8>,
> <Worker name=periodic/2 waiting task#=1305 at 0x7f2b8c799320>, <Worker
> name=periodic/4 waiting task#=213 at 0x7f2b8c70c7b8>, <Worker
> name=periodic/1 running <Task discardable <Operation
> action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f2b8c7997f0> at
> 0x7f2b8c7998d0> timeout=7.5, duration=7.50 at 0x7f2b8c681b38> discarded
> task#=1297 at 0x7f2b8c7993c8>} (executor:213)
> 2020-07-01 13:21:19,674+0000 INFO  (vdsm.Scheduler) [Executor] Worker
> discarded: <Worker name=periodic/1 running <Task discardable <Operation
> action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f2b8c7997f0> at
> 0x7f2b8c7998d0> timeout=7.5, duration=7.50 at 0x7f2b8c681b38> discarded
> task#=1297 at 0x7f2b8c7993c8> (executor:355)
> 2020-07-01 13:21:19,675+0000 DEBUG (periodic/5) [Executor] Worker started
> (executor:298)
> ...
> 2020-07-01 13:21:27,174+0000 DEBUG (periodic/4)
> [virt.sampling.VMBulkstatsMonitor] sampled timestamp 6788.159510452 elapsed
> 0.001 acquired False domains 0 (sampling:451)
> 2020-07-01 13:21:27,174+0000 INFO  (periodic/4)
> [virt.sampling.VMBulkstatsMonitor] DEBUGGING STATS: bulk stats: []...
> (sampling:452)
> 2020-07-01 13:21:27,941+0000 DEBUG (vmchannels) [virt.vm]
> (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') Guest connection timed out
> (guestagent:569)
> ...
> 2020-07-01 13:22:06,310+0000 WARN  (jsonrpc/3) [virt.vm]
> (vmId='7835bc4a-bcf4-4f48-bacc-8d1907045ba7') monitor became unresponsive
> (command timeout, age=69.13788864900016) (vm:5536)
> 
> engine log:
> 2020-07-02 05:21:55,089+08 INFO 
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-98) []
> VM '7835bc4a-bcf4-4f48-bacc-8d1907045ba7'(test_vm) moved from 'Up' -->
> 'NotResponding'
> 2020-07-02 05:21:55,113+08 WARN 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-98) []
> EVENT_ID: VM_NOT_RESPONDING(126), VM test_vm is not responding.
> 2020-07-02 05:22:07,953+08 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) []
> EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM up-stream-bug command Get
> Host Capabilities failed: Message timeout which can be caused by
> communication issues
> 2020-07-02 05:22:07,953+08 ERROR
> [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-83) []
> Unable to RefreshCapabilities: VDSNetworkException: VDSGenericException:
> VDSNetworkException: Message timeout which can be caused by communication
> issues
> 
> # virsh -r domstats test_vm --nowait
> Domain: 'test_vm'
>   state.state=1
>   state.reason=1
>   cpu.time=31507798523
>   cpu.user=2000000000
>   cpu.system=3930000000
>   cpu.cache.monitor.count=0
>   balloon.current=25165824
>   balloon.maximum=25165824
>   vcpu.current=1
>   vcpu.maximum=16
>   vcpu.0.state=1
>   vcpu.0.time=29220000000
>   vcpu.0.wait=0
>   net.count=1
>   net.0.name=vnet0
>   net.0.rx.bytes=346878
>   net.0.rx.pkts=4077
>   net.0.rx.errs=0
>   net.0.rx.drop=25268
>   net.0.tx.bytes=0
>   net.0.tx.pkts=0
>   net.0.tx.errs=0
>   net.0.tx.drop=0
>   block.count=2
>   block.0.name=sdc
>   block.1.name=sda
>  
> block.1.path=/rhev/data-center/mnt/10.x.x.x:_home_nfs_data1/02320133-cbae-
> 41da-88ea-f9a1f636fb73/images/d0defb1a-3a43-4a9d-ad53-6a3e4fb58094/ab4081ef-
> 3d66-495e-805a-c2e3be178a16
>   block.1.backingIndex=1
> 
> 
> Questions:
> 1. In Part 1 test, is it normal behavior that the vm became NotResonding for
> some time and then came bake to Up again?

Yes because this change does not affect performance, only whether the stats are available to send. If all of the stats are available libvirt sends them without waiting.
 
> 2. In Part 2 test, is it correct that VMBulkstatsMonitor was discarded when
> qemu-kvm was stopped? In comment #47 step 5, it could get a partial stats
> when qemu-kvm was stopped.

This was one of our tests, that when qemu is not responding that we would get the partial stats without the process waiting. So you verified that.

Thank you.

Comment 60 Qin Yuan 2020-07-01 21:26:15 UTC
(In reply to Steven Rosenberg from comment #59)
> > Questions:
> > 1. In Part 1 test, is it normal behavior that the vm became NotResonding for
> > some time and then came bake to Up again?
> 
> Yes because this change does not affect performance, only whether the stats
> are available to send. If all of the stats are available libvirt sends them
> without waiting.
>  
> > 2. In Part 2 test, is it correct that VMBulkstatsMonitor was discarded when
> > qemu-kvm was stopped? In comment #47 step 5, it could get a partial stats
> > when qemu-kvm was stopped.
> 
> This was one of our tests, that when qemu is not responding that we would
> get the partial stats without the process waiting. So you verified that.

I mean in my test, VMBulkstatsMonitor didn't get any stats when qemu-kvm process was stopped, but in your test, see comment #47 step 5, it got the partial stats.
So, if partial stats is wanted when qemu process is not responding, then my test shows there is something wrong.

>

Comment 61 Steven Rosenberg 2020-07-02 05:33:12 UTC
(In reply to Qin Yuan from comment #60)
> (In reply to Steven Rosenberg from comment #59)
> > > Questions:
> > > 1. In Part 1 test, is it normal behavior that the vm became NotResonding for
> > > some time and then came bake to Up again?
> > 
> > Yes because this change does not affect performance, only whether the stats
> > are available to send. If all of the stats are available libvirt sends them
> > without waiting.
> >  
> > > 2. In Part 2 test, is it correct that VMBulkstatsMonitor was discarded when
> > > qemu-kvm was stopped? In comment #47 step 5, it could get a partial stats
> > > when qemu-kvm was stopped.
> > 
> > This was one of our tests, that when qemu is not responding that we would
> > get the partial stats without the process waiting. So you verified that.
> 
> I mean in my test, VMBulkstatsMonitor didn't get any stats when qemu-kvm
> process was stopped, but in your test, see comment #47 step 5, it got the
> partial stats.
> So, if partial stats is wanted when qemu process is not responding, then my
> test shows there is something wrong.

It means that the no wait feature is working because it sends just the stats that are available without waiting for the status that are not available because the qemu-kvm is down. 

> 
> >

Comment 62 Milan Zamazal 2020-07-03 14:37:00 UTC
> <Worker name=periodic/1 running <Task discardable <Operation action=<vdsm.virt.sampling.VMBulkstatsMonitor object at 0x7f2b8c7997f0> at 0x7f2b8c7998d0> timeout=7.5, duration=7.50 at 0x7f2b8c681b38> discarded task#=1297 at 0x7f2b8c7993c8>

This doesn't look all right. Wasn't it exactly what NOWAIT should prevent? This explains why no (rather than partial) stats are received. Either the NOWAIT flag is missing for some reason or libvirt still blocks even with the flag (did the virsh --nowait above provide instant response?) or there is another problem.

Comment 63 Arik 2020-07-05 12:41:05 UTC
(In reply to Milan Zamazal from comment #62)
> This doesn't look all right. Wasn't it exactly what NOWAIT should prevent?
> This explains why no (rather than partial) stats are received. Either the
> NOWAIT flag is missing for some reason or libvirt still blocks even with the
> flag (did the virsh --nowait above provide instant response?) or there is
> another problem.

+1

And as for the first test, wouldn't it make sense not to poll the guest agent when we freeze the FS?

Comment 64 meital avital 2020-07-07 13:51:34 UTC
According to comment#57 - it FailedQA

Comment 65 Tomáš Golembiovský 2020-07-24 10:42:38 UTC
(In reply to Arik from comment #63)

> And as for the first test, wouldn't it make sense not to poll the guest
> agent when we freeze the FS?

This should no longer be an issue with the patches for setting timeout in libvirt that we have merged. The worker should no longer be blocked and discarded, the failures should now be handled gracefully without impacting the collection of stats from other guests.

Comment 66 Steven Rosenberg 2020-08-04 09:39:54 UTC
(In reply to Tomáš Golembiovský from comment #65)
> (In reply to Arik from comment #63)
> 
> > And as for the first test, wouldn't it make sense not to poll the guest
> > agent when we freeze the FS?
> 
> This should no longer be an issue with the patches for setting timeout in
> libvirt that we have merged. The worker should no longer be blocked and
> discarded, the failures should now be handled gracefully without impacting
> the collection of stats from other guests.

In reviewing this issue as per Comment 57, the warnings are due to discarded worker tasks [1]. The versions seem to be outdated [2].

Please advise what versions are recommended for retesting this issue. 


[1] 2020-07-01 11:44:22,200+0000 WARN  (vdsm.Scheduler) [Executor] executor state: count=5 workers={<Worker name=qgapoller/1 waiting task#=45 at 0x7f2bcc2fc240>, <Worker name=qgapoller/0 running <Task discardable <Operation action=<bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f2bcc36cc88>> at 0x7f2bcc2fc198> timeout=30, duration=30.00 at 0x7f2bd491f828> discarded task#=45 at 0x7f2bcc34ae80>, <Worker name=qgapoller/4 waiting task#=0 at 0x7f2b8c799cf8>, <Worker name=qgapoller/3 waiting task#=44 at 0x7f2bcc2fc358>, <Worker name=qgapoller/2 waiting task#=45 at 0x7f2bcc2fc1d0>} (executor:213)
2020-07-01 11:44:22,202+0000 INFO  (qgapoller/0) [vds] Failed to get guest info for vm=7835bc4a-bcf4-4f48-bacc-8d1907045ba7, error: Timed out during operation: cannot acquire state change lock (held by agent=remoteDispatchDomainFSFreeze) (qemuguestagent:400)
2020-07-01 11:44:22,202+0000 INFO  (qgapoller/0) [Executor] Worker was discarded (executor:305)

[2] Tested with:
vdsm-4.40.20-1.el8ev.x86_64
libvirt-6.0.0-24.module+el8.2.1+6997+c666f621.x86_64
qemu-kvm-4.2.0-25.module+el8.2.1+6985+9fd9d514.x86_64
ovirt-engine-4.4.1.5-0.17.el8ev.noarch

Comment 67 Tomáš Golembiovský 2020-08-04 10:18:34 UTC
Looks like the changes were included in vdsm version 4.40.23

Comment 68 Steven Rosenberg 2020-08-04 10:36:56 UTC
(In reply to Tomáš Golembiovský from comment #67)
> Looks like the changes were included in vdsm version 4.40.23

It appears the problem is that the vdsm version in QE is outdated.

I am using the following version and do not simulate the issue reported:

vdsm.x86_64                                             4.40.23-4.gita8c46f6a3.el8                                    @ovirt-master-snapshot                    

Please update the Host, ensure we are using the proper versions and try to retest accordingly.

Comment 69 Tomáš Golembiovský 2020-08-04 11:19:42 UTC
(In reply to Tomáš Golembiovský from comment #67)
> Looks like the changes were included in vdsm version 4.40.23
The relevant patch is:

    e3da0242f  qga: add context for setting libvirt timeout

Comment 70 Steven Rosenberg 2020-08-04 11:26:48 UTC
(In reply to Tomáš Golembiovský from comment #69)
> (In reply to Tomáš Golembiovský from comment #67)
> > Looks like the changes were included in vdsm version 4.40.23
> The relevant patch is:
> 
>     e3da0242f  qga: add context for setting libvirt timeout

For visibility, vdsm version 4.40.23 should be released this Thursday.

Comment 75 Qin Yuan 2020-09-17 02:14:45 UTC
Verified with:
vdsm-4.40.29-1.el8ev.x86_64
libvirt-6.6.0-4.module+el8.3.0+7883+3d717aa8.x86_64
qemu-kvm-5.1.0-4.module+el8.3.0+7846+ae9b566f.x86_64
ovirt-engine-4.4.3.2-0.19.el8ev.noarch

Steps:
1. Create and run a VM(rhel8.2) with 16G memory.
2. Run the following cmds/script in VM to generate a lot of writes to storage on the guest in order to generate a lot of dirty pages.
1) modify dirty page settings in VM:
   # vi /etc/sysctl.conf
   vm.dirty_background_ratio=90
   vm.dirty_ratio=90
2) reload dirty page settings:
   # sysctl -p
3) generate writes:
   #!/bin/bash

   for ((i=0; i<100; i++))
   do
       dd if=/dev/zero of=$i.txt bs=1M count=70
   done
3. Create a live snapshot without memory

Results:
1. It took 1 min to finish freezing guest filesystems on vm, and the vm status was up during freezing, didn't changed to NotResponding. 
2. Created snapshot successfully.

engine log:
2020-09-17 04:49:04,732+03 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (default task-16) [ee7badc5-1f2d-4a02-a0bc-af19cbf0c853] Lock Acquired to object 'EngineLock:{exclusiveLocks='[e454789a-4866-44e0-97e5-018f43620320=VM]', sharedLocks=''}'
2020-09-17 04:49:04,797+03 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engine-Thread-655) [ee7badc5-1f2d-4a02-a0bc-af19cbf0c853] Running command: CreateSnapshotForVmCommand internal: false. Entities affected :  ID: e454789a-4866-44e0-97e5-018f43620320 Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER
2020-09-17 04:49:04,808+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-655) [ee7badc5-1f2d-4a02-a0bc-af19cbf0c853] EVENT_ID: FREEZE_VM_INITIATED(10,766), Freeze of guest filesystems on VM test was initiated.
2020-09-17 04:49:04,809+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-655) [ee7badc5-1f2d-4a02-a0bc-af19cbf0c853] START, FreezeVDSCommand(HostName = host_mixed_1, VdsAndVmIDVDSParametersBase:{hostId='ac895480-a2f0-4392-84be-ee007ab60a17', vmId='e454789a-4866-44e0-97e5-018f43620320'}), log id: 78d3538b
2020-09-17 04:50:06,730+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-655) [ee7badc5-1f2d-4a02-a0bc-af19cbf0c853] FINISH, FreezeVDSCommand, return: , log id: 78d3538b
2020-09-17 04:50:06,734+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-655) [ee7badc5-1f2d-4a02-a0bc-af19cbf0c853] EVENT_ID: FREEZE_VM_SUCCESS(10,767), Guest filesystems on VM test have been frozen successfully.
...
2020-09-17 04:50:24,973+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-100) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'snapshot1' creation for VM 'test' has been completed.

Comment 82 errata-xmlrpc 2020-11-24 13:09:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: Red Hat Virtualization security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5179


Note You need to log in before you can comment on or make changes to this bug.