Bug 1196040

Summary:

VMS Monitoring stops with error: Failed to invoke scheduled method vmsMonitoring: null

Product:

[oVirt] ovirt-engine

Reporter:

Francesco Romani <fromani>

Component:

General

Assignee:

Michal Skrivanek <michal.skrivanek>

Status:

CLOSED CURRENTRELEASE

QA Contact:

sefi litmanovich <slitmano>

Severity:

medium

Docs Contact:

Priority:

high

Version:

---

CC:

bugs, ecohen, fromani, gklein, lsurette, michal.skrivanek, rbalakri, rgolan, yeylon

Target Milestone:

ovirt-3.6.0-rc

Flags:

rule-engine: ovirt-3.6.0+
ylavi: planning_ack+
rule-engine: devel_ack+
mavital: testing_ack+

Target Release:

3.6.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

virt

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-12-22 13:29:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
the DB dump of the affected environment	none
big snippet of engine.log	none
big snippet of server.log	none
vdsm log 2014-20-24 (1/2)	none
vdsm log 2015-02-24 (2/2)	none
full engine and server logs	none

Description Francesco Romani 2015-02-25 07:17:51 UTC

Description of problem:
VMS monitoring stops randomly on Engine recompiled from sources on master branch.

I was performance-testing some VDSM patches, and I noticed suspicious load drops on VDSM. During a broad-range investigation, I discovered that sometimes
Engine randomly stops polling VMs. Engine.log is filled with

2015-02-25 07:40:19,343 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-93) [75a247ad] Failed to invoke scheduled method vmsMonitoring: null
2015-02-25 07:40:24,067 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-14) [] Failed to invoke scheduled method vmsMonitoring: null

On server.log we have errors like

2015-02-25 07:40:24,068 INFO  [org.quartz.core.JobRunShell] (DefaultQuartzScheduler_Worker-14) Job DEFAULT.org.ovirt.engine.core.vdsbroker.VdsManager.vmsMonitoring#-9223372036854775775 threw a JobExecutionException: : org.quartz.JobExecutionException: failed to execute job
	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) [:1.7.0_75]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75]
	at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

It had the same issue once in the past, still on master, I updated sources and recompiled and apparently it worked well for a little while.
Unfortunately for that reason I can't be more precise.

Version-Release number of selected component (if applicable):
Attached logs and dumps are against master commit8a14792ad19d0b3a09763facbe06f5654a61e9bd

How reproducible:
Not sure. Seems pretty random. Some runs occurs, some others not. Seems positively correlated with uptime of Engine


Steps to Reproduce:
1. prepare environment with Engine, 1+ Hypervisor hosts
2. fire up some VMs
3. let Engine just poll them for hours

Actual results:
Engine stops polling at random times, emits logs like
2015-02-25 07:40:19,343 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-93) [75a247ad] Failed to invoke scheduled method vmsMonitoring: null
2015-02-25 07:40:24,067 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-14) [] Failed to invoke scheduled method vmsMonitoring: null

Expected results:
Engine polls forever

Additional info:
May well be a local issue, being a development environement, so quite messed up.

Comment 1 Francesco Romani 2015-02-25 07:18:47 UTC

Created attachment 995001 [details]
the DB dump of the affected environment

Comment 2 Francesco Romani 2015-02-25 07:19:50 UTC

Created attachment 995002 [details]
big snippet of engine.log

Comment 3 Francesco Romani 2015-02-25 07:20:16 UTC

Created attachment 995003 [details]
big snippet of server.log

Comment 4 Francesco Romani 2015-02-25 12:26:56 UTC

Add logs around  2015-02-24 13:45:51

There is a good chance VDSM was actually down around that timeframe, as logs indeed suggest. I don't have precise record because I was testing VDSM patches so VDSM itself was restarted quite some times.

Comment 5 Francesco Romani 2015-02-25 12:29:18 UTC

Created attachment 995133 [details]
vdsm log 2014-20-24 (1/2)

log starts from MainThread::INFO::2015-02-24 13:48:17,344
and ends to Dummy-350::DEBUG::2015-02-24 15:01:01,475

Comment 6 Francesco Romani 2015-02-25 12:30:08 UTC

(In reply to Francesco Romani from comment #5)
> Created attachment 995133 [details]
> vdsm log 2014-20-24 (1/2)

of course is 2015-02-24

Comment 7 Francesco Romani 2015-02-25 12:32:18 UTC

Created attachment 995134 [details]
vdsm log 2015-02-24 (2/2)

logs starts at periodic.Executor-worker-2::WARNING::2015-02-24 12:01:02,602
and ends at Thread-27::DEBUG::2015-02-24 13:01:00,625::blockSD:

please ignore logs like 
periodic.Executor-worker-2::WARNING::2015-02-24 12:01:02,602::periodic::254::periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.BlockjobMonitor'>

they are false positives (stats are actually still available)

Comment 8 Francesco Romani 2015-02-25 17:30:45 UTC

Initial investigation on VDSM side suggests this is just noise.
Will do more investigation to make sure.

Comment 9 Francesco Romani 2015-02-26 16:56:53 UTC

Not sure how and if this is related, but most often when this error is found in the logs, number of UP VMs in the host panel is also misdetected (higher than it should be).

E.g. today's example host can't put into maintenance because VM count was reported as 2, while actually no VMs where running on said host.

Comment 10 Michal Skrivanek 2015-03-23 14:29:25 UTC

maybe we run out of threads?

Comment 11 Francesco Romani 2015-03-23 14:46:53 UTC

Update: still appears a random only on Engine master (never experienced in 3.5.x branch). 
Unfortunately, I can pinpoint yet a sequence of events that trigger this behaviour.

It is also very possible that I tainted the environment. Being a developer playground, I often play dirty editing the DB manually and so on.

Comment 12 Roy Golan 2015-03-24 08:08:06 UTC

the server.log might reveal more info. can you attach it?

Comment 13 Francesco Romani 2015-03-24 08:37:14 UTC

(In reply to Roy Golan from comment #12)
> the server.log might reveal more info. can you attach it?

I added a big chunk here: https://bugzilla.redhat.com/attachment.cgi?id=995003

Is this sufficient?
If not, I need to reproduce this issue, and since it is appear to be pretty random, can take a while.

Comment 14 Francesco Romani 2015-03-24 08:44:57 UTC

Created attachment 1005737 [details]
full engine and server logs

(In reply to Francesco Romani from comment #13)
> (In reply to Roy Golan from comment #12)
> > the server.log might reveal more info. can you attach it?
> 
> I added a big chunk here:
> https://bugzilla.redhat.com/attachment.cgi?id=995003
> 
> Is this sufficient?
> If not, I need to reproduce this issue, and since it is appear to be pretty
> random, can take a while.

Maybe we are lucky. It seems to happen right now.

What I did: I was benchmarking VDSM resource consumption, so
- let Engine run for ~20 hours against one hypervisor
- the lonely hypervisor was running up to 250 VMs
- I restarted VDSM quite often due to the need to add patches/debug logs

Comment 15 Michal Skrivanek 2015-04-23 08:09:58 UTC

investigate a bit...

seems the exception from network is not being handled in the new monitoring code correctly. The err line you see is a result of exception being thrown from the vmsMonitoring
This can (and probably id) happen easily in fetching data.
VmsListFetcher.fetch() running the VDS command does _not_ catch exceptions from the command. I suppose it's meant to execute onError() instead

Comment 16 Roy Golan 2015-04-26 07:46:06 UTC

(In reply to Michal Skrivanek from comment #15)
> investigate a bit...
> 
> seems the exception from network is not being handled in the new monitoring
> code correctly. The err line you see is a result of exception being thrown
> from the vmsMonitoring
> This can (and probably id) happen easily in fetching data.
> VmsListFetcher.fetch() running the VDS command does _not_ catch exceptions
> from the command. I suppose it's meant to execute onError() instead

Exception are catched and fired as an event(with the exception as payload) and  are observed by VdsEventListener.onError() which passes the handling to VdsManager.handleNetworkException. 

hope this helps

Comment 17 Max Kovgan 2015-06-28 14:12:25 UTC

ovirt-3.6.0-3 release

Comment 19 sefi litmanovich 2015-12-10 08:53:55 UTC

Verified with rhevm-3.6.1.2-0.1.el6.noarch.

It seems that my host had some connectivity issues and went to non responsive state for a very short period of time.
Got the expected error in log:

"Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring."

few prior entries in log:

2015-12-10 10:12:11,018 ERROR [org.ovirt.engine.core.bll.IsoDomainListSyncronizer] (DefaultQuartzScheduler_Worker-76) [] The following storage domains had a problem retrieving data from VDSM:  ( Storage Pool Id: 838d693b-28dd-4f1d-a987-c70eb06d506e Storage domain Id: 9d3b50b9-2ec6-42e3-869f-83f2b75e224e File type: Unknown) 
2015-12-10 10:12:11,048 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-76) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Refresh image list failed for domain(s):   netappISO (Unknown file type). Please check domain activity.
2015-12-10 10:16:39,408 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-19) [] Host 'host_mixed_1' is not responding.
2015-12-10 10:16:39,435 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-54) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='25a0fd3f-9815-4750-b6da-b935fb5a8655', vds='Host[host_mixed_1,25a0fd3f-9815-4750-b6da-b935fb5a8655]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2015-12-10 10:16:39,437 INFO  [org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher] (DefaultQuartzScheduler_Worker-54) [] Failed to fetch vms info for host 'host_mixed_1' - skipping VMs monitoring.

Comment 20 Sandro Bonazzola 2015-12-22 13:29:03 UTC

oVirt 3.6.0 has been released and the bz verified, moving to closed current release.