Description of problem: VMS monitoring stops randomly on Engine recompiled from sources on master branch. I was performance-testing some VDSM patches, and I noticed suspicious load drops on VDSM. During a broad-range investigation, I discovered that sometimes Engine randomly stops polling VMs. Engine.log is filled with 2015-02-25 07:40:19,343 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-93) [75a247ad] Failed to invoke scheduled method vmsMonitoring: null 2015-02-25 07:40:24,067 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-14) [] Failed to invoke scheduled method vmsMonitoring: null On server.log we have errors like 2015-02-25 07:40:24,068 INFO [org.quartz.core.JobRunShell] (DefaultQuartzScheduler_Worker-14) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.vmsMonitoring#-9223372036854775775 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) [:1.7.0_75] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] It had the same issue once in the past, still on master, I updated sources and recompiled and apparently it worked well for a little while. Unfortunately for that reason I can't be more precise. Version-Release number of selected component (if applicable): Attached logs and dumps are against master commit8a14792ad19d0b3a09763facbe06f5654a61e9bd How reproducible: Not sure. Seems pretty random. Some runs occurs, some others not. Seems positively correlated with uptime of Engine Steps to Reproduce: 1. prepare environment with Engine, 1+ Hypervisor hosts 2. fire up some VMs 3. let Engine just poll them for hours Actual results: Engine stops polling at random times, emits logs like 2015-02-25 07:40:19,343 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-93) [75a247ad] Failed to invoke scheduled method vmsMonitoring: null 2015-02-25 07:40:24,067 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-14) [] Failed to invoke scheduled method vmsMonitoring: null Expected results: Engine polls forever Additional info: May well be a local issue, being a development environement, so quite messed up.
Created attachment 995001 [details] the DB dump of the affected environment
Created attachment 995002 [details] big snippet of engine.log
Created attachment 995003 [details] big snippet of server.log
Add logs around 2015-02-24 13:45:51 There is a good chance VDSM was actually down around that timeframe, as logs indeed suggest. I don't have precise record because I was testing VDSM patches so VDSM itself was restarted quite some times.
Created attachment 995133 [details] vdsm log 2014-20-24 (1/2) log starts from MainThread::INFO::2015-02-24 13:48:17,344 and ends to Dummy-350::DEBUG::2015-02-24 15:01:01,475
(In reply to Francesco Romani from comment #5) > Created attachment 995133 [details] > vdsm log 2014-20-24 (1/2) of course is 2015-02-24
Created attachment 995134 [details] vdsm log 2015-02-24 (2/2) logs starts at periodic.Executor-worker-2::WARNING::2015-02-24 12:01:02,602 and ends at Thread-27::DEBUG::2015-02-24 13:01:00,625::blockSD: please ignore logs like periodic.Executor-worker-2::WARNING::2015-02-24 12:01:02,602::periodic::254::periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.BlockjobMonitor'> they are false positives (stats are actually still available)
Initial investigation on VDSM side suggests this is just noise. Will do more investigation to make sure.
Not sure how and if this is related, but most often when this error is found in the logs, number of UP VMs in the host panel is also misdetected (higher than it should be). E.g. today's example host can't put into maintenance because VM count was reported as 2, while actually no VMs where running on said host.
maybe we run out of threads?
Update: still appears a random only on Engine master (never experienced in 3.5.x branch). Unfortunately, I can pinpoint yet a sequence of events that trigger this behaviour. It is also very possible that I tainted the environment. Being a developer playground, I often play dirty editing the DB manually and so on.
the server.log might reveal more info. can you attach it?
(In reply to Roy Golan from comment #12) > the server.log might reveal more info. can you attach it? I added a big chunk here: https://bugzilla.redhat.com/attachment.cgi?id=995003 Is this sufficient? If not, I need to reproduce this issue, and since it is appear to be pretty random, can take a while.
Created attachment 1005737 [details] full engine and server logs (In reply to Francesco Romani from comment #13) > (In reply to Roy Golan from comment #12) > > the server.log might reveal more info. can you attach it? > > I added a big chunk here: > https://bugzilla.redhat.com/attachment.cgi?id=995003 > > Is this sufficient? > If not, I need to reproduce this issue, and since it is appear to be pretty > random, can take a while. Maybe we are lucky. It seems to happen right now. What I did: I was benchmarking VDSM resource consumption, so - let Engine run for ~20 hours against one hypervisor - the lonely hypervisor was running up to 250 VMs - I restarted VDSM quite often due to the need to add patches/debug logs
investigate a bit... seems the exception from network is not being handled in the new monitoring code correctly. The err line you see is a result of exception being thrown from the vmsMonitoring This can (and probably id) happen easily in fetching data. VmsListFetcher.fetch() running the VDS command does _not_ catch exceptions from the command. I suppose it's meant to execute onError() instead
(In reply to Michal Skrivanek from comment #15) > investigate a bit... > > seems the exception from network is not being handled in the new monitoring > code correctly. The err line you see is a result of exception being thrown > from the vmsMonitoring > This can (and probably id) happen easily in fetching data. > VmsListFetcher.fetch() running the VDS command does _not_ catch exceptions > from the command. I suppose it's meant to execute onError() instead Exception are catched and fired as an event(with the exception as payload) and are observed by VdsEventListener.onError() which passes the handling to VdsManager.handleNetworkException. hope this helps
ovirt-3.6.0-3 release
Verified with rhevm-3.6.1.2-0.1.el6.noarch. It seems that my host had some connectivity issues and went to non responsive state for a very short period of time. Got the expected error in log: "Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring." few prior entries in log: 2015-12-10 10:12:11,018 ERROR [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (DefaultQuartzScheduler_Worker-76) [] The following storage domains had a problem retrieving data from VDSM: ( Storage Pool Id: 838d693b-28dd-4f1d-a987-c70eb06d506e Storage domain Id: 9d3b50b9-2ec6-42e3-869f-83f2b75e224e File type: Unknown) 2015-12-10 10:12:11,048 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-76) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Refresh image list failed for domain(s): netappISO (Unknown file type). Please check domain activity. 2015-12-10 10:16:39,408 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-19) [] Host 'host_mixed_1' is not responding. 2015-12-10 10:16:39,435 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-54) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='25a0fd3f-9815-4750-b6da-b935fb5a8655', vds='Host[host_mixed_1,25a0fd3f-9815-4750-b6da-b935fb5a8655]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2015-12-10 10:16:39,437 INFO [org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher] (DefaultQuartzScheduler_Worker-54) [] Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring.
oVirt 3.6.0 has been released and the bz verified, moving to closed current release.