Description of problem: sampling.HostMonitor is an exclusive operation which prevents this operation from rescheduling when the operation gets stuck. Bu in case that the operation fails not because of time out, but because of the executor queue is full the operation is not rescheduled and the stats. are not being collected. Rescheduling this operation in this case is legit approach even if it is and exclusive operation as it is clear that the operation is not running. Version-Release number of selected component (if applicable): vdsm-4.19.31-1 How reproducible: 100% Steps to Reproduce: 1. Make the executor queue full Actual results: HostMonitor is not rescheduled Expected results: HostMonitor is rescheduled Additional info: Error message: WARN (vdsm.Scheduler) [virt.periodic.Operation] could not run <vdsm.virt.sampling.HostMonitor object at 0x215f450>, executor queue full (periodic:235) code: lib/vdsm/virt/periodic.py: ... def _dispatch(self): """ Send `func' to Executor to be run as soon as possible. """ self._call = None try: self._executor.dispatch(self, self._timeout, discard=self._discard) except executor.TooManyTasks: self._log.warning('could not run %s, executor queue full', self._func) finally: if not self._exclusive: self._reschedule() The opration should get rescheduled when executor.TooManyTasks is caught.
logs? Why is it stuck?
(In reply to Michal Skrivanek from comment #2) > logs? Why is it stuck? Let me answer because Roman is not online yet. It's not stuck. We think it fails to run (TooManyTasks) and is not rescheduled again because this one is set with 'exclusive=True' which only reschedules on __call__ , and not in _dispatch as it fails the if on the finally in case it gets an exception. We dont have much logs except: WARN (vdsm.Scheduler) [virt.periodic.Operation] could not run <vdsm.virt.sampling.HostMonitor object at 0x215f450>, executor queue full (periodic:235) And the fact that all host stats are outdated in vdsm-client Host getStats.
Operation( sampling.HostMonitor(cif=cif), [...] exclusive=True <--- HostMonitor uses exclusive=true since BZ 1419856 ) Something like this: def _dispatch(self): """ Send `func' to Executor to be run as soon as possible. """ self._call = None try: self._executor.dispatch(self, self._timeout, discard=self._discard) except executor.TooManyTasks: self._log.warning('could not run %s, executor queue full', self._func) + if self._exclusive: + self._reschedule() finally: if not self._exclusive: self._reschedule() Or change the finally... Does it make sense?
Customer wants this bug to be fixed in RHV 4.1 , can this be released in Z stream ?
(In reply to Germano Veit Michel from comment #3) > (In reply to Michal Skrivanek from comment #2) > > logs? Why is it stuck? > > Let me answer because Roman is not online yet. > > It's not stuck. We think it fails to run (TooManyTasks) and is not > rescheduled again because this one is set with 'exclusive=True' which only > reschedules on __call__ , and not in _dispatch as it fails the if on the > finally in case it gets an exception. > > We dont have much logs except: > > WARN (vdsm.Scheduler) [virt.periodic.Operation] could not run > <vdsm.virt.sampling.HostMonitor object at 0x215f450>, executor queue full > (periodic:235) > > And the fact that all host stats are outdated in vdsm-client Host getStats. Hi, strange you don't have more information, because we backported a fix for this: https://bugzilla.redhat.com/show_bug.cgi?id=1490810 with this patch: https://gerrit.ovirt.org/#/c/81788/ so you should be able to see which operations depleted the executor pool.
(In reply to Germano Veit Michel from comment #4) > Operation( > sampling.HostMonitor(cif=cif), > [...] > exclusive=True <--- HostMonitor uses exclusive=true since BZ 1419856 > ) > > Something like this: > > def _dispatch(self): > """ > > Send `func' to Executor to be run as soon as possible. > > """ > > self._call = None > > try: > > self._executor.dispatch(self, self._timeout, > discard=self._discard) > except executor.TooManyTasks: > > self._log.warning('could not run %s, executor queue full', > > self._func) > > + if self._exclusive: > + self._reschedule() > finally: > > if not self._exclusive: > > self._reschedule() > > Or change the finally... Does it make sense? We made the HostMonitor exclusive for a reason: https://gerrit.ovirt.org/#/c/73133/ so we need a change to accomodate both cases. One things that bothers me is why the executor got depleted in the first place, there is something wrong there. Both the VM periodic operations and the Host monitoring insist on the same executor, I'm afraid we could have some interference here. Besides, there are knobs we can use to tune the executor behaviour without code changes, in vdsm.conf: [sampling] periodic_workers: Number of worker threads to serve the periodic tasks (default 4) periodic_task_per_worker: Max number of tasks which can be queued on workers. (default 100) max_workers: Maximum number of worker threads to serve the periodic tasks 'at the same time. (default 30) Have you tried tuning those parameters? Could you share the results?
(In reply to Francesco Romani from comment #12) > (In reply to Germano Veit Michel from comment #3) > > (In reply to Michal Skrivanek from comment #2) > > > logs? Why is it stuck? > > > > Let me answer because Roman is not online yet. > > > > It's not stuck. We think it fails to run (TooManyTasks) and is not > > rescheduled again because this one is set with 'exclusive=True' which only > > reschedules on __call__ , and not in _dispatch as it fails the if on the > > finally in case it gets an exception. > > > > We dont have much logs except: > > > > WARN (vdsm.Scheduler) [virt.periodic.Operation] could not run > > <vdsm.virt.sampling.HostMonitor object at 0x215f450>, executor queue full > > (periodic:235) > > > > And the fact that all host stats are outdated in vdsm-client Host getStats. > > Hi, > > strange you don't have more information, because we backported a fix for > this: > https://bugzilla.redhat.com/show_bug.cgi?id=1490810 > with this patch: > https://gerrit.ovirt.org/#/c/81788/ > so you should be able to see which operations depleted the executor pool. nevermind, it was fixed in 4.1.7, but this bug is reported in 4.1.6 :( If you update to a more recent version of Vdsm, however, we can have a better chance to understand the issue. Is this possible, maybe on a canary box?
Tried to reproduce this on our labs. Even overloading the host I cant get the executor flooded. Tons of VMBulkSampler and DriveWatermarkMonitor being discarded though, like the customer saw. I will ask the customer to upgrade so we can have a look at the state of his executor too.
(In reply to Germano Veit Michel from comment #16) > Tried to reproduce this on our labs. Even overloading the host I cant get > the executor flooded. Tons of VMBulkSampler and DriveWatermarkMonitor being > discarded though, like the customer saw. > > I will ask the customer to upgrade so we can have a look at the state of his > executor too. Thanks, this will help. However, HostMonitor - even with the exclusive flag - is supposed to be self-healing, so it should get back to normal automatically once the host is no more overloaded. This is worth investigating, I'll have a look myself.
(In reply to Germano Veit Michel from comment #4) > Operation( > sampling.HostMonitor(cif=cif), > [...] > exclusive=True <--- HostMonitor uses exclusive=true since BZ 1419856 > ) > > Something like this: > > def _dispatch(self): > """ > > Send `func' to Executor to be run as soon as possible. > > """ > > self._call = None > > try: > > self._executor.dispatch(self, self._timeout, > discard=self._discard) > except executor.TooManyTasks: > > self._log.warning('could not run %s, executor queue full', > > self._func) > > + if self._exclusive: > + self._reschedule() > finally: > > if not self._exclusive: > > self._reschedule() > > Or change the finally... Does it make sense? OK, after a careful ispection of the code is just a bug. Exclusive operations are not correctly rescheduled if they happen to find the executor queue full. The fix is in this area, but I'm afraid it will need more work. Will work on something.
Thanks Francesco, that's what I thought, it needs to be rescheduled. Clearing the needinfo, let me know if you need anything else. I'll open a different BZ if something fishy shows up on the executor dump (if we receive it).
Hi Francesco and Michal, Just for completeness of this BZ. What fills the executor on these mass VM migrations/startups is DriveWatermarkMonitor. 400 out of 400 tasks ;) Once the load stabilizes (CPU not a 100% anymore), things go back to normal and DriveWatermarkMonitor seems to complete in time, not filling the executor anymore. So I understand: 1) Mass migrations/startups -> Host overloaded 2) Pile up DriveWatermarkMonitor (which is a quite frequent operation, host is overloaded, they cant complete). The only thing that worries me is that we have 2 DriveWatermarkMonitor queued for each VM in the executor. I understand this is scheduled every 2s, so timeout should be 2s/2 = 1s. So how do we get 2 stuck operations for each VM if 1s before running a new one the previous one should have expired? If it werent for this, instead of 400 tasks we would have half of that. Because all VMs have 2 DriveWatermarkMonitor stuck. Or is this just a sign that the executor workers are starving too? Let me know if you want a new BZ for this. I'm not sure. Thanks
> 2) Pile up DriveWatermarkMonitor (which is a quite frequent operation, host is overloaded, they cant complete). yeah, that's not surprising. Perhaps we do not really need new BZ, we moved to high watermark events in 4.2 and no longer poll that heavily. If you have such an overloaded system, could you try 4.2 beta?
(In reply to Germano Veit Michel from comment #26) > Hi Francesco and Michal, > > Just for completeness of this BZ. > > What fills the executor on these mass VM migrations/startups is > DriveWatermarkMonitor. 400 out of 400 tasks ;) On migrations we should suspend that. On startup it may be indeed caused by overloaded host. Well, that also applies for migrations, but should be less severe. > Once the load stabilizes (CPU not a 100% anymore), things go back to normal > and DriveWatermarkMonitor seems to complete in time, not filling the > executor anymore. > > So I understand: > 1) Mass migrations/startups -> Host overloaded > 2) Pile up DriveWatermarkMonitor (which is a quite frequent operation, host > is overloaded, they cant complete). Yes, every 2s per VM per disk. No wonder the load is high. We have greatly improved in 4.2 as Michal said, however. > The only thing that worries me is that we have 2 DriveWatermarkMonitor > queued for each VM in the executor. I understand this is scheduled every 2s, > so timeout should be 2s/2 = 1s. So how do we get 2 stuck operations for each > VM if 1s before running a new one the previous one should have expired? If > it werent for this, instead of 400 tasks we would have half of that. Because > all VMs have 2 DriveWatermarkMonitor stuck. This is indeed interesting, the most likely explanation is that libvirtd is responding (eg not stuck), but slow. THis fits with the hypotesis of severly overloaded host.
This is Production env, so I'll update them regrading upgrading to 4.2 as soon as GA is out. Thanks for your comments!
Hello Michael Could you please provide Validation instructions for this bug We where thinking: Having Host1 and Host2. Start VMs on Host1 Same time start migration of VMs from Host1 to Host2 or Same time start migration of VMs from Host2 to Host1
(In reply to Daniel Gur from comment #31) > Hello Michael > Could you please provide Validation instructions for this bug > We where thinking: > Having Host1 and Host2. > Start VMs on Host1 > Same time start migration of VMs from Host1 to Host2 > or > Same time start migration of VMs from Host2 to Host1 Migration is not really critical for this bug. I'd start reducing the periodic workers: /etc/vdsm/vdsm.conf (or configlet under vdsm.conf.d) [sampling] periodic_workers = 1 (restart vdsmd) this will make more likely to fill the executor queue. Then run some vms, the more the better, and make sure periodic operations is slow; blocking access to storage should help for this task.
Bug was verified with the following : Lab topology : 1 host 1 engine 4.2 build : 4.2.2-0.1 200 VMS on the host Scenario : Restarted 200 VMS on the host, host was overloaded with 100% CPU and executor queue full ERROR's on the VDSM log. After VMS started to get back up and CPU lowered down on the server error's stopped and monitoring of the server was resumed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1489
BZ<2>Jira Resync
sync2jira