Bug 1196040
Summary: | VMS Monitoring stops with error: Failed to invoke scheduled method vmsMonitoring: null | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Francesco Romani <fromani> | ||||||||||||||
Component: | General | Assignee: | Michal Skrivanek <michal.skrivanek> | ||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | sefi litmanovich <slitmano> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | high | ||||||||||||||||
Version: | --- | CC: | bugs, ecohen, fromani, gklein, lsurette, michal.skrivanek, rbalakri, rgolan, yeylon | ||||||||||||||
Target Milestone: | ovirt-3.6.0-rc | Flags: | rule-engine:
ovirt-3.6.0+
ylavi: planning_ack+ rule-engine: devel_ack+ mavital: testing_ack+ |
||||||||||||||
Target Release: | 3.6.0 | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | virt | ||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2015-12-22 13:29:03 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Francesco Romani
2015-02-25 07:17:51 UTC
Created attachment 995001 [details]
the DB dump of the affected environment
Created attachment 995002 [details]
big snippet of engine.log
Created attachment 995003 [details]
big snippet of server.log
Add logs around 2015-02-24 13:45:51 There is a good chance VDSM was actually down around that timeframe, as logs indeed suggest. I don't have precise record because I was testing VDSM patches so VDSM itself was restarted quite some times. Created attachment 995133 [details]
vdsm log 2014-20-24 (1/2)
log starts from MainThread::INFO::2015-02-24 13:48:17,344
and ends to Dummy-350::DEBUG::2015-02-24 15:01:01,475
(In reply to Francesco Romani from comment #5) > Created attachment 995133 [details] > vdsm log 2014-20-24 (1/2) of course is 2015-02-24 Created attachment 995134 [details]
vdsm log 2015-02-24 (2/2)
logs starts at periodic.Executor-worker-2::WARNING::2015-02-24 12:01:02,602
and ends at Thread-27::DEBUG::2015-02-24 13:01:00,625::blockSD:
please ignore logs like
periodic.Executor-worker-2::WARNING::2015-02-24 12:01:02,602::periodic::254::periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.BlockjobMonitor'>
they are false positives (stats are actually still available)
Initial investigation on VDSM side suggests this is just noise. Will do more investigation to make sure. Not sure how and if this is related, but most often when this error is found in the logs, number of UP VMs in the host panel is also misdetected (higher than it should be). E.g. today's example host can't put into maintenance because VM count was reported as 2, while actually no VMs where running on said host. maybe we run out of threads? Update: still appears a random only on Engine master (never experienced in 3.5.x branch). Unfortunately, I can pinpoint yet a sequence of events that trigger this behaviour. It is also very possible that I tainted the environment. Being a developer playground, I often play dirty editing the DB manually and so on. the server.log might reveal more info. can you attach it? (In reply to Roy Golan from comment #12) > the server.log might reveal more info. can you attach it? I added a big chunk here: https://bugzilla.redhat.com/attachment.cgi?id=995003 Is this sufficient? If not, I need to reproduce this issue, and since it is appear to be pretty random, can take a while. Created attachment 1005737 [details] full engine and server logs (In reply to Francesco Romani from comment #13) > (In reply to Roy Golan from comment #12) > > the server.log might reveal more info. can you attach it? > > I added a big chunk here: > https://bugzilla.redhat.com/attachment.cgi?id=995003 > > Is this sufficient? > If not, I need to reproduce this issue, and since it is appear to be pretty > random, can take a while. Maybe we are lucky. It seems to happen right now. What I did: I was benchmarking VDSM resource consumption, so - let Engine run for ~20 hours against one hypervisor - the lonely hypervisor was running up to 250 VMs - I restarted VDSM quite often due to the need to add patches/debug logs investigate a bit... seems the exception from network is not being handled in the new monitoring code correctly. The err line you see is a result of exception being thrown from the vmsMonitoring This can (and probably id) happen easily in fetching data. VmsListFetcher.fetch() running the VDS command does _not_ catch exceptions from the command. I suppose it's meant to execute onError() instead (In reply to Michal Skrivanek from comment #15) > investigate a bit... > > seems the exception from network is not being handled in the new monitoring > code correctly. The err line you see is a result of exception being thrown > from the vmsMonitoring > This can (and probably id) happen easily in fetching data. > VmsListFetcher.fetch() running the VDS command does _not_ catch exceptions > from the command. I suppose it's meant to execute onError() instead Exception are catched and fired as an event(with the exception as payload) and are observed by VdsEventListener.onError() which passes the handling to VdsManager.handleNetworkException. hope this helps ovirt-3.6.0-3 release Verified with rhevm-3.6.1.2-0.1.el6.noarch. It seems that my host had some connectivity issues and went to non responsive state for a very short period of time. Got the expected error in log: "Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring." few prior entries in log: 2015-12-10 10:12:11,018 ERROR [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (DefaultQuartzScheduler_Worker-76) [] The following storage domains had a problem retrieving data from VDSM: ( Storage Pool Id: 838d693b-28dd-4f1d-a987-c70eb06d506e Storage domain Id: 9d3b50b9-2ec6-42e3-869f-83f2b75e224e File type: Unknown) 2015-12-10 10:12:11,048 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-76) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Refresh image list failed for domain(s): netappISO (Unknown file type). Please check domain activity. 2015-12-10 10:16:39,408 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-19) [] Host 'host_mixed_1' is not responding. 2015-12-10 10:16:39,435 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-54) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='25a0fd3f-9815-4750-b6da-b935fb5a8655', vds='Host[host_mixed_1,25a0fd3f-9815-4750-b6da-b935fb5a8655]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2015-12-10 10:16:39,437 INFO [org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher] (DefaultQuartzScheduler_Worker-54) [] Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring. oVirt 3.6.0 has been released and the bz verified, moving to closed current release. |