1607978 – [downstream clone - 4.2.5] Guests not responding periodically in Manager

Bug 1607978 - [downstream clone - 4.2.5] Guests not responding periodically in Manager

Summary: [downstream clone - 4.2.5] Guests not responding periodically in Manager

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.5
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Depends On:	1537148
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-24 16:15 UTC by RHV bug bot
Modified:	2021-12-10 16:45 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	Vdsm tried to gather VM statistics for all the VM known to the system, including paused, starting up or shutting down. This approach may lead to aribtrary long delays in rare but possible circumstances, delaying the reporting of well-behaving VMs. To reduce the risk of such occurrences, now Vdsm only queries active, running VMs.
Clone Of:	1537148
Environment:
Last Closed:	2018-07-31 17:50:12 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	pmatyas: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43564	None	None	None	2021-09-09 15:12:07 UTC
Red Hat Product Errata	RHEA-2018:2319	None	None	None	2018-07-31 17:50:26 UTC
oVirt gerrit	90548	master	MERGED	virt: sampling: get stats only for running VMs	2020-12-04 17:40:28 UTC
oVirt gerrit	91735	ovirt-4.2	MERGED	virt: sampling: get stats only for running VMs	2020-12-04 17:40:29 UTC

Description RHV bug bot 2018-07-24 16:15:13 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1537148 +++
======================================================================

Description of problem:
The VMs not responding periodically in the Manager - it's about 20x per day almost all VMs flipping between NotResponding and Up


engine.log
----------------------------------------------------
2018-01-17 10:12:14,242+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler6) [33b857dc] VM '2f21a782-0162-40cc-ab19-17d5f0aa1411'(xx) moved from 'NotResponding' --> 'Up'
2018-01-17 10:21:33,511+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [6c268bc5] VM 'cf9fc0eb-d729-45d3-842e-0070bc51c800'(xx) moved from 'Up' --> 'NotResponding'
----------------------------------------------------

VMBulkSampler ran fine 
----------------------------------------------------
2018-01-17 10:10:19,559+0100 DEBUG (periodic/990) [virt.sampling.VMBulkSampler] sampled timestamp 5053130.11 elapsed 2.210 acquired True domains all (sampling:508)
2018-01-17 10:10:34,781+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5053145.11 elapsed 2.440 acquired True domains all (sampling:508)
----------------------------------------------------
after that the sampler stopped 
----------------------------------------------------
2018-01-17 10:11:36,086+0100 WARN  (jsonrpc/4) [virt.vm] (vmId='bb74a7a8-95fb-4e5b-a2d7-2f88be7b722f') monitor became unresponsive (command timeout, age=63.7399999993) (vm:5132)
2018-01-17 10:11:36,088+0100 WARN  (jsonrpc/4) [virt.vm] (vmId='dc097117-ad3e-48a0-9969-6e2500b6f1b2') monitor became unresponsive (command timeout, age=63.7399999993) (vm:5132)
2018-01-17 10:12:07,153+0100 WARN  (jsonrpc/4) [virt.vm] (vmId='acbb7da8-5c64-4cbd-8785-b61e4b4fae93') monitor became unresponsive (command timeout, age=94.8099999996) (vm:5132)
----------------------------------------------------

This continues till 10:12:07 where the sampler started work again in 15 seconds interval
----------------------------------------------------
2018-01-17 10:12:10,082+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5053240.45 elapsed 2.400 acquired True domains all (sampling:508)
----------------------------------------------------

and again the latest VMBulkSampler
----------------------------------------------------
2018-01-17 10:20:34,614+0100 DEBUG (periodic/992) [virt.sampling.VMBulkSampler] sampled timestamp 5053745.15 elapsed 2.230 acquired True domains all (sampling:508)
----------------------------------------------------

And the behavior repeated again and again
----------------------------------------------------
2018-01-17 10:21:32,631+0100 WARN  (jsonrpc/5) [virt.vm] (vmId='bb74a7a8-95fb-4e5b-a2d7-2f88be7b722f') monitor became unresponsive (command timeout, age=60.25) (vm:5132)
----------------------------------------------------

But there are no worker discarded during that as was in 
https://bugzilla.redhat.com/show_bug.cgi?id=1451703

Also the cu is running on higher version of RHH/vdsm.

There are delayed call but not periodically which could explain 
https://bugzilla.redhat.com/show_bug.cgi?id=1430847

Version-Release number of selected component (if applicable):
rhevm-4.1.6
vdsm-4.19.42-1.el7ev.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64

How reproducible:
100%

Actual results:
The VMs are switching between NotResponding -> Up periodically 2x per hour 

Additional info:
gcore with vdsm/supervdsm will be uploaded ,also the latest vdsm log


Thanks

(Originally by Jaroslav Spanko)

Comment 10 RHV bug bot 2018-07-24 16:16:03 UTC

Summary:
All the (scarce) information we can extract from the logs point in the direction of libvirt slowness, in turn caused by slow storage, so we should be in the same category of https://bugzilla.redhat.com/show_bug.cgi?id=1430847 . Vdsm is reacting the best it can to this slowness, but the reported unresponsive VMs are a symptom rather then the root issue. In this sense, Vdsm is doing its job correctly.

More detailed analysis follows.

+++

What we see now is something really weird. Here the sampling works as expected:

2018-01-19 09:07:39,528+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222169.96 elapsed 2.330 acquired True domains all (sampling:508)
2018-01-19 09:07:54,524+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222184.96 elapsed 2.330 acquired True domains all (sampling:508)
2018-01-19 09:08:09,466+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222199.96 elapsed 2.270 acquired True domains all (sampling:508)
2018-01-19 09:08:24,533+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222214.96 elapsed 2.340 acquired True domains all (sampling:508)
2018-01-19 09:08:39,569+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222229.96 elapsed 2.370 acquired True domains all (sampling:508)
2018-01-19 09:08:54,484+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222244.97 elapsed 2.280 acquired True domains all (sampling:508)
2018-01-19 09:09:09,225+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222259.97 elapsed 2.020 acquired True domains all (sampling:508)
2018-01-19 09:09:24,228+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222274.97 elapsed 2.020 acquired True domains all (sampling:508)
2018-01-19 09:09:39,578+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222289.97 elapsed 2.370 acquired True domains all (sampling:508)
2018-01-19 09:09:54,385+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222304.98 elapsed 2.170 acquired True domains all (sampling:508)
2018-01-19 09:10:09,170+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222319.97 elapsed 1.960 acquired True domains all (sampling:508)
2018-01-19 09:10:24,270+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222334.98 elapsed 2.060 acquired True domains all (sampling:508)
2018-01-19 09:10:39,832+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222349.97 elapsed 2.630 acquired True domains all (sampling:508)


Then something happens:
2018-01-19 09:10:39,832+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222349.97 elapsed 2.630 acquired True domains all (sampling:508)
2018-01-19 09:12:24,174+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222454.91 elapsed 2.030 acquired True domains all (sampling:508)

Notice ~2 minutes gap, the sampling should happen every 15s, not every 120s

Then we have a burst of samplings:
2018-01-19 09:12:24,924+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222455.61 elapsed 2.080 acquired False domains 81 (sampling:508)
2018-01-19 09:12:26,054+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222456.75 elapsed 2.070 acquired False domains 80 (sampling:508)
2018-01-19 09:12:26,112+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222457.27 elapsed 1.610 acquired False domains 75 (sampling:508)
2018-01-19 09:12:26,134+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222457.16 elapsed 1.740 acquired True domains 78 (sampling:508)
2018-01-19 09:12:26,923+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222458.02 elapsed 1.670 acquired False domains 72 (sampling:508)
2018-01-19 09:12:28,324+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222459.51 elapsed 1.580 acquired True domains 71 (sampling:508)
2018-01-19 09:12:39,231+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222469.98 elapsed 2.020 acquired True domains 71 (sampling:508)

Interestingly enough, when this pattern occurs, the burst is 7-8 operations long, and 8*15s = 120s.

Also note that workers *are* discarded from time to time, but much more rarely than those events.

So we have something that causes the executor to slow down but *not* to discard workers. So it must be in the dispatch stage (VmDispatcher), which is
1. run inside the executor but
2. has *not* a timeout: it is supposed to be fast and safe.

VmDispatcher is one operation demultiplexer: it takes one single operation, and it dispatches N more operations, each one to be run on a different domain.
Now, Vdsm acknowledges that domain can get blocked, and tries hard to avoid them when doing this dispatching. To do so, it calls another libvirt API, virDomainGetControlInfo(). This is supposed to be fast and safe, but looks like it can block/delay if the domain object is locked (e.g. by another virDomainGetAllDomainStats()) - this however needs to be verified.

Please note that this design was established after talking with libvirt developers, so this is among the best we can do.

There are not other viable approaches besides just going ahead and submit more work to libvirt, with the risk to clog it even more.

(Originally by Francesco Romani)

Comment 11 RHV bug bot 2018-07-24 16:16:08 UTC

Another step we could do now: please update to the last stable 4.1.z Vdsm. This will *not* solve the issue (not sure Vdsm can actually solve it at all, at the moment), but we merged patch to add more logs when executor is starving/stalling, so we can get better insights at the issue.

(Originally by Francesco Romani)

Comment 12 RHV bug bot 2018-07-24 16:16:13 UTC

(In reply to Francesco Romani from comment #11)
> Another step we could do now: please update to the last stable 4.1.z Vdsm.
> This will *not* solve the issue (not sure Vdsm can actually solve it at all,
> at the moment), but we merged patch to add more logs when executor is
> starving/stalling, so we can get better insights at the issue.

What Information should be collected from customer?
Do we adjust some debug-level? Please advice.

Thanks

(Originally by Steffen Froemer)

Comment 13 RHV bug bot 2018-07-24 16:16:18 UTC

(In reply to Steffen Froemer from comment #12)
> (In reply to Francesco Romani from comment #11)
> > Another step we could do now: please update to the last stable 4.1.z Vdsm.
> > This will *not* solve the issue (not sure Vdsm can actually solve it at all,
> > at the moment), but we merged patch to add more logs when executor is
> > starving/stalling, so we can get better insights at the issue.
> 
> What Information should be collected from customer?
> Do we adjust some debug-level? Please advice.
> 
> Thanks

Upgrade of vdsm and restart should suffice as it'd print list of stuck tasks.

(Originally by Yaniv Kaul)

Comment 14 RHV bug bot 2018-07-24 16:16:22 UTC

(In reply to Francesco Romani from comment #11)
> Another step we could do now: please update to the last stable 4.1.z Vdsm.
> This will *not* solve the issue (not sure Vdsm can actually solve it at all,
> at the moment), but we merged patch to add more logs when executor is
> starving/stalling, so we can get better insights at the issue.

Hi Francesco
Are we speaking about ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 ? 
I'm asking because i did not find this commit in 4.1.z vdsm
Thx

(Originally by Jaroslav Spanko)

Comment 15 RHV bug bot 2018-07-24 16:16:27 UTC

(In reply to Jaroslav Spanko from comment #14)
> (In reply to Francesco Romani from comment #11)
> > Another step we could do now: please update to the last stable 4.1.z Vdsm.
> > This will *not* solve the issue (not sure Vdsm can actually solve it at all,
> > at the moment), but we merged patch to add more logs when executor is
> > starving/stalling, so we can get better insights at the issue.
> 
> Hi Francesco
> Are we speaking about ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 ? 
> I'm asking because i did not find this commit in 4.1.z vdsm
> Thx

Hi Jaroslav

Indeed that commit will help even more, and indeed it is not backported yet; but I was referring to https://gerrit.ovirt.org/#/q/I23ec1dce83c13fb45dc7097abe08e96d2775c98b

I think (warning: personal opinion only) we will backport ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 later, I'd like it to be a bit more battle-tested.

(Originally by Francesco Romani)

Comment 16 RHV bug bot 2018-07-24 16:16:31 UTC

NEEDINFO because of https://bugzilla.redhat.com/show_bug.cgi?id=1537148#c12

(Originally by Francesco Romani)

Comment 19 RHV bug bot 2018-07-24 16:16:44 UTC

ping?

(Originally by michal.skrivanek)

Comment 23 RHV bug bot 2018-07-24 16:17:02 UTC

Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting stats for domains? Why would we collect libvirt stats from any VMs that are not in running state?

(Originally by Yaniv Kaul)

Comment 24 RHV bug bot 2018-07-24 16:17:06 UTC

(In reply to Yaniv Kaul from comment #23)
> Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting
> stats for domains? Why would we collect libvirt stats from any VMs that are
> not in running state?

It should make no difference in the expected use case, but since it looks more correct, let me prepare a patch for that.

(Originally by Francesco Romani)

Comment 25 RHV bug bot 2018-07-24 16:17:11 UTC

(In reply to Francesco Romani from comment #24)
> (In reply to Yaniv Kaul from comment #23)
> > Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting
> > stats for domains? Why would we collect libvirt stats from any VMs that are
> > not in running state?
> 
> It should make no difference in the expected use case, but since it looks
> more correct, let me prepare a patch for that.

Any news?

(Originally by Yaniv Kaul)

Comment 26 RHV bug bot 2018-07-24 16:17:17 UTC

(In reply to Yaniv Kaul from comment #25)
> (In reply to Francesco Romani from comment #24)
> > (In reply to Yaniv Kaul from comment #23)
> > > Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting
> > > stats for domains? Why would we collect libvirt stats from any VMs that are
> > > not in running state?
> > 
> > It should make no difference in the expected use case, but since it looks
> > more correct, let me prepare a patch for that.
> 
> Any news?

Yep: https://gerrit.ovirt.org/#/c/90548/
low priority, currently under verification.

(Originally by Francesco Romani)

Comment 31 RHV bug bot 2018-07-24 16:17:44 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 32 RHV bug bot 2018-07-24 16:17:50 UTC

Please clone properly.

(Originally by Dusan Fodor)

Comment 34 RHV bug bot 2018-07-24 16:18:01 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 37 Petr Matyáš 2018-07-27 08:03:38 UTC

Verified on vdsm-4.20.35-1.el7ev.x86_64

Comment 39 errata-xmlrpc 2018-07-31 17:50:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2319

Note You need to log in before you can comment on or make changes to this bug.