Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1537148

Summary: Guests not responding periodically in Manager
Product: Red Hat Enterprise Virtualization Manager Reporter: Jaroslav Spanko <jspanko>
Component: vdsmAssignee: Francesco Romani <fromani>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.6CC: danken, dfodor, fromani, fsun, jspanko, lsurette, mavital, michal.skrivanek, mtessun, rabraham, sfroemer, srevivo, ycui
Target Milestone: ovirt-4.3.0Keywords: ZStream
Target Release: 4.3.0Flags: pmatyas: testing_plan_complete-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.30.3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1607978 (view as bug list) Environment:
Last Closed: 2019-05-08 12:35:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1607978    

Description Jaroslav Spanko 2018-01-22 14:36:21 UTC
Description of problem:
The VMs not responding periodically in the Manager - it's about 20x per day almost all VMs flipping between NotResponding and Up


engine.log
----------------------------------------------------
2018-01-17 10:12:14,242+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler6) [33b857dc] VM '2f21a782-0162-40cc-ab19-17d5f0aa1411'(xx) moved from 'NotResponding' --> 'Up'
2018-01-17 10:21:33,511+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [6c268bc5] VM 'cf9fc0eb-d729-45d3-842e-0070bc51c800'(xx) moved from 'Up' --> 'NotResponding'
----------------------------------------------------

VMBulkSampler ran fine 
----------------------------------------------------
2018-01-17 10:10:19,559+0100 DEBUG (periodic/990) [virt.sampling.VMBulkSampler] sampled timestamp 5053130.11 elapsed 2.210 acquired True domains all (sampling:508)
2018-01-17 10:10:34,781+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5053145.11 elapsed 2.440 acquired True domains all (sampling:508)
----------------------------------------------------
after that the sampler stopped 
----------------------------------------------------
2018-01-17 10:11:36,086+0100 WARN  (jsonrpc/4) [virt.vm] (vmId='bb74a7a8-95fb-4e5b-a2d7-2f88be7b722f') monitor became unresponsive (command timeout, age=63.7399999993) (vm:5132)
2018-01-17 10:11:36,088+0100 WARN  (jsonrpc/4) [virt.vm] (vmId='dc097117-ad3e-48a0-9969-6e2500b6f1b2') monitor became unresponsive (command timeout, age=63.7399999993) (vm:5132)
2018-01-17 10:12:07,153+0100 WARN  (jsonrpc/4) [virt.vm] (vmId='acbb7da8-5c64-4cbd-8785-b61e4b4fae93') monitor became unresponsive (command timeout, age=94.8099999996) (vm:5132)
----------------------------------------------------

This continues till 10:12:07 where the sampler started work again in 15 seconds interval
----------------------------------------------------
2018-01-17 10:12:10,082+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5053240.45 elapsed 2.400 acquired True domains all (sampling:508)
----------------------------------------------------

and again the latest VMBulkSampler
----------------------------------------------------
2018-01-17 10:20:34,614+0100 DEBUG (periodic/992) [virt.sampling.VMBulkSampler] sampled timestamp 5053745.15 elapsed 2.230 acquired True domains all (sampling:508)
----------------------------------------------------

And the behavior repeated again and again
----------------------------------------------------
2018-01-17 10:21:32,631+0100 WARN  (jsonrpc/5) [virt.vm] (vmId='bb74a7a8-95fb-4e5b-a2d7-2f88be7b722f') monitor became unresponsive (command timeout, age=60.25) (vm:5132)
----------------------------------------------------

But there are no worker discarded during that as was in 
https://bugzilla.redhat.com/show_bug.cgi?id=1451703

Also the cu is running on higher version of RHH/vdsm.

There are delayed call but not periodically which could explain 
https://bugzilla.redhat.com/show_bug.cgi?id=1430847

Version-Release number of selected component (if applicable):
rhevm-4.1.6
vdsm-4.19.42-1.el7ev.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64

How reproducible:
100%

Actual results:
The VMs are switching between NotResponding -> Up periodically 2x per hour 

Additional info:
gcore with vdsm/supervdsm will be uploaded ,also the latest vdsm log


Thanks

Comment 10 Francesco Romani 2018-02-06 09:29:50 UTC
Summary:
All the (scarce) information we can extract from the logs point in the direction of libvirt slowness, in turn caused by slow storage, so we should be in the same category of https://bugzilla.redhat.com/show_bug.cgi?id=1430847 . Vdsm is reacting the best it can to this slowness, but the reported unresponsive VMs are a symptom rather then the root issue. In this sense, Vdsm is doing its job correctly.

More detailed analysis follows.

+++

What we see now is something really weird. Here the sampling works as expected:

2018-01-19 09:07:39,528+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222169.96 elapsed 2.330 acquired True domains all (sampling:508)
2018-01-19 09:07:54,524+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222184.96 elapsed 2.330 acquired True domains all (sampling:508)
2018-01-19 09:08:09,466+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222199.96 elapsed 2.270 acquired True domains all (sampling:508)
2018-01-19 09:08:24,533+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222214.96 elapsed 2.340 acquired True domains all (sampling:508)
2018-01-19 09:08:39,569+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222229.96 elapsed 2.370 acquired True domains all (sampling:508)
2018-01-19 09:08:54,484+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222244.97 elapsed 2.280 acquired True domains all (sampling:508)
2018-01-19 09:09:09,225+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222259.97 elapsed 2.020 acquired True domains all (sampling:508)
2018-01-19 09:09:24,228+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222274.97 elapsed 2.020 acquired True domains all (sampling:508)
2018-01-19 09:09:39,578+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222289.97 elapsed 2.370 acquired True domains all (sampling:508)
2018-01-19 09:09:54,385+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222304.98 elapsed 2.170 acquired True domains all (sampling:508)
2018-01-19 09:10:09,170+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222319.97 elapsed 1.960 acquired True domains all (sampling:508)
2018-01-19 09:10:24,270+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222334.98 elapsed 2.060 acquired True domains all (sampling:508)
2018-01-19 09:10:39,832+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222349.97 elapsed 2.630 acquired True domains all (sampling:508)


Then something happens:
2018-01-19 09:10:39,832+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222349.97 elapsed 2.630 acquired True domains all (sampling:508)
2018-01-19 09:12:24,174+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222454.91 elapsed 2.030 acquired True domains all (sampling:508)

Notice ~2 minutes gap, the sampling should happen every 15s, not every 120s

Then we have a burst of samplings:
2018-01-19 09:12:24,924+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222455.61 elapsed 2.080 acquired False domains 81 (sampling:508)
2018-01-19 09:12:26,054+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222456.75 elapsed 2.070 acquired False domains 80 (sampling:508)
2018-01-19 09:12:26,112+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222457.27 elapsed 1.610 acquired False domains 75 (sampling:508)
2018-01-19 09:12:26,134+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222457.16 elapsed 1.740 acquired True domains 78 (sampling:508)
2018-01-19 09:12:26,923+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222458.02 elapsed 1.670 acquired False domains 72 (sampling:508)
2018-01-19 09:12:28,324+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222459.51 elapsed 1.580 acquired True domains 71 (sampling:508)
2018-01-19 09:12:39,231+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222469.98 elapsed 2.020 acquired True domains 71 (sampling:508)

Interestingly enough, when this pattern occurs, the burst is 7-8 operations long, and 8*15s = 120s.

Also note that workers *are* discarded from time to time, but much more rarely than those events.

So we have something that causes the executor to slow down but *not* to discard workers. So it must be in the dispatch stage (VmDispatcher), which is
1. run inside the executor but
2. has *not* a timeout: it is supposed to be fast and safe.

VmDispatcher is one operation demultiplexer: it takes one single operation, and it dispatches N more operations, each one to be run on a different domain.
Now, Vdsm acknowledges that domain can get blocked, and tries hard to avoid them when doing this dispatching. To do so, it calls another libvirt API, virDomainGetControlInfo(). This is supposed to be fast and safe, but looks like it can block/delay if the domain object is locked (e.g. by another virDomainGetAllDomainStats()) - this however needs to be verified.

Please note that this design was established after talking with libvirt developers, so this is among the best we can do.

There are not other viable approaches besides just going ahead and submit more work to libvirt, with the risk to clog it even more.

Comment 11 Francesco Romani 2018-02-06 09:31:00 UTC
Another step we could do now: please update to the last stable 4.1.z Vdsm. This will *not* solve the issue (not sure Vdsm can actually solve it at all, at the moment), but we merged patch to add more logs when executor is starving/stalling, so we can get better insights at the issue.

Comment 12 Steffen Froemer 2018-02-07 00:01:53 UTC
(In reply to Francesco Romani from comment #11)
> Another step we could do now: please update to the last stable 4.1.z Vdsm.
> This will *not* solve the issue (not sure Vdsm can actually solve it at all,
> at the moment), but we merged patch to add more logs when executor is
> starving/stalling, so we can get better insights at the issue.

What Information should be collected from customer?
Do we adjust some debug-level? Please advice.

Thanks

Comment 13 Yaniv Kaul 2018-02-07 06:32:27 UTC
(In reply to Steffen Froemer from comment #12)
> (In reply to Francesco Romani from comment #11)
> > Another step we could do now: please update to the last stable 4.1.z Vdsm.
> > This will *not* solve the issue (not sure Vdsm can actually solve it at all,
> > at the moment), but we merged patch to add more logs when executor is
> > starving/stalling, so we can get better insights at the issue.
> 
> What Information should be collected from customer?
> Do we adjust some debug-level? Please advice.
> 
> Thanks

Upgrade of vdsm and restart should suffice as it'd print list of stuck tasks.

Comment 14 Jaroslav Spanko 2018-02-07 13:53:12 UTC
(In reply to Francesco Romani from comment #11)
> Another step we could do now: please update to the last stable 4.1.z Vdsm.
> This will *not* solve the issue (not sure Vdsm can actually solve it at all,
> at the moment), but we merged patch to add more logs when executor is
> starving/stalling, so we can get better insights at the issue.

Hi Francesco
Are we speaking about ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 ? 
I'm asking because i did not find this commit in 4.1.z vdsm
Thx

Comment 15 Francesco Romani 2018-02-08 06:30:00 UTC
(In reply to Jaroslav Spanko from comment #14)
> (In reply to Francesco Romani from comment #11)
> > Another step we could do now: please update to the last stable 4.1.z Vdsm.
> > This will *not* solve the issue (not sure Vdsm can actually solve it at all,
> > at the moment), but we merged patch to add more logs when executor is
> > starving/stalling, so we can get better insights at the issue.
> 
> Hi Francesco
> Are we speaking about ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 ? 
> I'm asking because i did not find this commit in 4.1.z vdsm
> Thx

Hi Jaroslav

Indeed that commit will help even more, and indeed it is not backported yet; but I was referring to https://gerrit.ovirt.org/#/q/I23ec1dce83c13fb45dc7097abe08e96d2775c98b

I think (warning: personal opinion only) we will backport ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 later, I'd like it to be a bit more battle-tested.

Comment 16 Francesco Romani 2018-03-06 10:51:00 UTC
NEEDINFO because of https://bugzilla.redhat.com/show_bug.cgi?id=1537148#c12

Comment 19 Michal Skrivanek 2018-04-16 13:10:09 UTC
ping?

Comment 23 Yaniv Kaul 2018-04-24 06:40:40 UTC
Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting stats for domains? Why would we collect libvirt stats from any VMs that are not in running state?

Comment 24 Francesco Romani 2018-04-24 06:57:31 UTC
(In reply to Yaniv Kaul from comment #23)
> Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting
> stats for domains? Why would we collect libvirt stats from any VMs that are
> not in running state?

It should make no difference in the expected use case, but since it looks more correct, let me prepare a patch for that.

Comment 25 Yaniv Kaul 2018-05-17 13:45:28 UTC
(In reply to Francesco Romani from comment #24)
> (In reply to Yaniv Kaul from comment #23)
> > Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting
> > stats for domains? Why would we collect libvirt stats from any VMs that are
> > not in running state?
> 
> It should make no difference in the expected use case, but since it looks
> more correct, let me prepare a patch for that.

Any news?

Comment 26 Francesco Romani 2018-05-17 13:53:12 UTC
(In reply to Yaniv Kaul from comment #25)
> (In reply to Francesco Romani from comment #24)
> > (In reply to Yaniv Kaul from comment #23)
> > > Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting
> > > stats for domains? Why would we collect libvirt stats from any VMs that are
> > > not in running state?
> > 
> > It should make no difference in the expected use case, but since it looks
> > more correct, let me prepare a patch for that.
> 
> Any news?

Yep: https://gerrit.ovirt.org/#/c/90548/
low priority, currently under verification.

Comment 31 RHV bug bot 2018-07-02 15:34:32 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

Comment 32 Dusan Fodor 2018-07-12 14:24:25 UTC
Please clone properly.

Comment 34 RHV bug bot 2018-07-24 12:27:15 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

Comment 36 Petr Matyáš 2018-12-06 11:50:16 UTC
Verified on vdsm-4.30.3-1.el7ev.x86_64

Comment 38 errata-xmlrpc 2019-05-08 12:35:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077