+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1537148 +++ ====================================================================== Description of problem: The VMs not responding periodically in the Manager - it's about 20x per day almost all VMs flipping between NotResponding and Up engine.log ---------------------------------------------------- 2018-01-17 10:12:14,242+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler6) [33b857dc] VM '2f21a782-0162-40cc-ab19-17d5f0aa1411'(xx) moved from 'NotResponding' --> 'Up' 2018-01-17 10:21:33,511+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [6c268bc5] VM 'cf9fc0eb-d729-45d3-842e-0070bc51c800'(xx) moved from 'Up' --> 'NotResponding' ---------------------------------------------------- VMBulkSampler ran fine ---------------------------------------------------- 2018-01-17 10:10:19,559+0100 DEBUG (periodic/990) [virt.sampling.VMBulkSampler] sampled timestamp 5053130.11 elapsed 2.210 acquired True domains all (sampling:508) 2018-01-17 10:10:34,781+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5053145.11 elapsed 2.440 acquired True domains all (sampling:508) ---------------------------------------------------- after that the sampler stopped ---------------------------------------------------- 2018-01-17 10:11:36,086+0100 WARN (jsonrpc/4) [virt.vm] (vmId='bb74a7a8-95fb-4e5b-a2d7-2f88be7b722f') monitor became unresponsive (command timeout, age=63.7399999993) (vm:5132) 2018-01-17 10:11:36,088+0100 WARN (jsonrpc/4) [virt.vm] (vmId='dc097117-ad3e-48a0-9969-6e2500b6f1b2') monitor became unresponsive (command timeout, age=63.7399999993) (vm:5132) 2018-01-17 10:12:07,153+0100 WARN (jsonrpc/4) [virt.vm] (vmId='acbb7da8-5c64-4cbd-8785-b61e4b4fae93') monitor became unresponsive (command timeout, age=94.8099999996) (vm:5132) ---------------------------------------------------- This continues till 10:12:07 where the sampler started work again in 15 seconds interval ---------------------------------------------------- 2018-01-17 10:12:10,082+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5053240.45 elapsed 2.400 acquired True domains all (sampling:508) ---------------------------------------------------- and again the latest VMBulkSampler ---------------------------------------------------- 2018-01-17 10:20:34,614+0100 DEBUG (periodic/992) [virt.sampling.VMBulkSampler] sampled timestamp 5053745.15 elapsed 2.230 acquired True domains all (sampling:508) ---------------------------------------------------- And the behavior repeated again and again ---------------------------------------------------- 2018-01-17 10:21:32,631+0100 WARN (jsonrpc/5) [virt.vm] (vmId='bb74a7a8-95fb-4e5b-a2d7-2f88be7b722f') monitor became unresponsive (command timeout, age=60.25) (vm:5132) ---------------------------------------------------- But there are no worker discarded during that as was in https://bugzilla.redhat.com/show_bug.cgi?id=1451703 Also the cu is running on higher version of RHH/vdsm. There are delayed call but not periodically which could explain https://bugzilla.redhat.com/show_bug.cgi?id=1430847 Version-Release number of selected component (if applicable): rhevm-4.1.6 vdsm-4.19.42-1.el7ev.x86_64 libvirt-python-3.2.0-3.el7_4.1.x86_64 How reproducible: 100% Actual results: The VMs are switching between NotResponding -> Up periodically 2x per hour Additional info: gcore with vdsm/supervdsm will be uploaded ,also the latest vdsm log Thanks (Originally by Jaroslav Spanko)
Summary: All the (scarce) information we can extract from the logs point in the direction of libvirt slowness, in turn caused by slow storage, so we should be in the same category of https://bugzilla.redhat.com/show_bug.cgi?id=1430847 . Vdsm is reacting the best it can to this slowness, but the reported unresponsive VMs are a symptom rather then the root issue. In this sense, Vdsm is doing its job correctly. More detailed analysis follows. +++ What we see now is something really weird. Here the sampling works as expected: 2018-01-19 09:07:39,528+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222169.96 elapsed 2.330 acquired True domains all (sampling:508) 2018-01-19 09:07:54,524+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222184.96 elapsed 2.330 acquired True domains all (sampling:508) 2018-01-19 09:08:09,466+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222199.96 elapsed 2.270 acquired True domains all (sampling:508) 2018-01-19 09:08:24,533+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222214.96 elapsed 2.340 acquired True domains all (sampling:508) 2018-01-19 09:08:39,569+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222229.96 elapsed 2.370 acquired True domains all (sampling:508) 2018-01-19 09:08:54,484+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222244.97 elapsed 2.280 acquired True domains all (sampling:508) 2018-01-19 09:09:09,225+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222259.97 elapsed 2.020 acquired True domains all (sampling:508) 2018-01-19 09:09:24,228+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222274.97 elapsed 2.020 acquired True domains all (sampling:508) 2018-01-19 09:09:39,578+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222289.97 elapsed 2.370 acquired True domains all (sampling:508) 2018-01-19 09:09:54,385+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222304.98 elapsed 2.170 acquired True domains all (sampling:508) 2018-01-19 09:10:09,170+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222319.97 elapsed 1.960 acquired True domains all (sampling:508) 2018-01-19 09:10:24,270+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222334.98 elapsed 2.060 acquired True domains all (sampling:508) 2018-01-19 09:10:39,832+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222349.97 elapsed 2.630 acquired True domains all (sampling:508) Then something happens: 2018-01-19 09:10:39,832+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222349.97 elapsed 2.630 acquired True domains all (sampling:508) 2018-01-19 09:12:24,174+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222454.91 elapsed 2.030 acquired True domains all (sampling:508) Notice ~2 minutes gap, the sampling should happen every 15s, not every 120s Then we have a burst of samplings: 2018-01-19 09:12:24,924+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222455.61 elapsed 2.080 acquired False domains 81 (sampling:508) 2018-01-19 09:12:26,054+0100 DEBUG (periodic/998) [virt.sampling.VMBulkSampler] sampled timestamp 5222456.75 elapsed 2.070 acquired False domains 80 (sampling:508) 2018-01-19 09:12:26,112+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222457.27 elapsed 1.610 acquired False domains 75 (sampling:508) 2018-01-19 09:12:26,134+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222457.16 elapsed 1.740 acquired True domains 78 (sampling:508) 2018-01-19 09:12:26,923+0100 DEBUG (periodic/997) [virt.sampling.VMBulkSampler] sampled timestamp 5222458.02 elapsed 1.670 acquired False domains 72 (sampling:508) 2018-01-19 09:12:28,324+0100 DEBUG (periodic/994) [virt.sampling.VMBulkSampler] sampled timestamp 5222459.51 elapsed 1.580 acquired True domains 71 (sampling:508) 2018-01-19 09:12:39,231+0100 DEBUG (periodic/1000) [virt.sampling.VMBulkSampler] sampled timestamp 5222469.98 elapsed 2.020 acquired True domains 71 (sampling:508) Interestingly enough, when this pattern occurs, the burst is 7-8 operations long, and 8*15s = 120s. Also note that workers *are* discarded from time to time, but much more rarely than those events. So we have something that causes the executor to slow down but *not* to discard workers. So it must be in the dispatch stage (VmDispatcher), which is 1. run inside the executor but 2. has *not* a timeout: it is supposed to be fast and safe. VmDispatcher is one operation demultiplexer: it takes one single operation, and it dispatches N more operations, each one to be run on a different domain. Now, Vdsm acknowledges that domain can get blocked, and tries hard to avoid them when doing this dispatching. To do so, it calls another libvirt API, virDomainGetControlInfo(). This is supposed to be fast and safe, but looks like it can block/delay if the domain object is locked (e.g. by another virDomainGetAllDomainStats()) - this however needs to be verified. Please note that this design was established after talking with libvirt developers, so this is among the best we can do. There are not other viable approaches besides just going ahead and submit more work to libvirt, with the risk to clog it even more. (Originally by Francesco Romani)
Another step we could do now: please update to the last stable 4.1.z Vdsm. This will *not* solve the issue (not sure Vdsm can actually solve it at all, at the moment), but we merged patch to add more logs when executor is starving/stalling, so we can get better insights at the issue. (Originally by Francesco Romani)
(In reply to Francesco Romani from comment #11) > Another step we could do now: please update to the last stable 4.1.z Vdsm. > This will *not* solve the issue (not sure Vdsm can actually solve it at all, > at the moment), but we merged patch to add more logs when executor is > starving/stalling, so we can get better insights at the issue. What Information should be collected from customer? Do we adjust some debug-level? Please advice. Thanks (Originally by Steffen Froemer)
(In reply to Steffen Froemer from comment #12) > (In reply to Francesco Romani from comment #11) > > Another step we could do now: please update to the last stable 4.1.z Vdsm. > > This will *not* solve the issue (not sure Vdsm can actually solve it at all, > > at the moment), but we merged patch to add more logs when executor is > > starving/stalling, so we can get better insights at the issue. > > What Information should be collected from customer? > Do we adjust some debug-level? Please advice. > > Thanks Upgrade of vdsm and restart should suffice as it'd print list of stuck tasks. (Originally by Yaniv Kaul)
(In reply to Francesco Romani from comment #11) > Another step we could do now: please update to the last stable 4.1.z Vdsm. > This will *not* solve the issue (not sure Vdsm can actually solve it at all, > at the moment), but we merged patch to add more logs when executor is > starving/stalling, so we can get better insights at the issue. Hi Francesco Are we speaking about ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 ? I'm asking because i did not find this commit in 4.1.z vdsm Thx (Originally by Jaroslav Spanko)
(In reply to Jaroslav Spanko from comment #14) > (In reply to Francesco Romani from comment #11) > > Another step we could do now: please update to the last stable 4.1.z Vdsm. > > This will *not* solve the issue (not sure Vdsm can actually solve it at all, > > at the moment), but we merged patch to add more logs when executor is > > starving/stalling, so we can get better insights at the issue. > > Hi Francesco > Are we speaking about ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 ? > I'm asking because i did not find this commit in 4.1.z vdsm > Thx Hi Jaroslav Indeed that commit will help even more, and indeed it is not backported yet; but I was referring to https://gerrit.ovirt.org/#/q/I23ec1dce83c13fb45dc7097abe08e96d2775c98b I think (warning: personal opinion only) we will backport ce265f7d6de80a725f1fd0aed0b6dbb826d18c75 later, I'd like it to be a bit more battle-tested. (Originally by Francesco Romani)
NEEDINFO because of https://bugzilla.redhat.com/show_bug.cgi?id=1537148#c12 (Originally by Francesco Romani)
ping? (Originally by michal.skrivanek)
Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting stats for domains? Why would we collect libvirt stats from any VMs that are not in running state? (Originally by Yaniv Kaul)
(In reply to Yaniv Kaul from comment #23) > Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting > stats for domains? Why would we collect libvirt stats from any VMs that are > not in running state? It should make no difference in the expected use case, but since it looks more correct, let me prepare a patch for that. (Originally by Francesco Romani)
(In reply to Francesco Romani from comment #24) > (In reply to Yaniv Kaul from comment #23) > > Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting > > stats for domains? Why would we collect libvirt stats from any VMs that are > > not in running state? > > It should make no difference in the expected use case, but since it looks > more correct, let me prepare a patch for that. Any news? (Originally by Yaniv Kaul)
(In reply to Yaniv Kaul from comment #25) > (In reply to Francesco Romani from comment #24) > > (In reply to Yaniv Kaul from comment #23) > > > Should we use VIR_CONNECT_GET_ALL_DOMAINS_STATS_RUNNING when collecting > > > stats for domains? Why would we collect libvirt stats from any VMs that are > > > not in running state? > > > > It should make no difference in the expected use case, but since it looks > > more correct, let me prepare a patch for that. > > Any news? Yep: https://gerrit.ovirt.org/#/c/90548/ low priority, currently under verification. (Originally by Francesco Romani)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2.z': '?'}', ] For more info please contact: rhv-devops (Originally by rhv-bugzilla-bot)
Please clone properly. (Originally by Dusan Fodor)
Verified on vdsm-4.20.35-1.el7ev.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2319