Bug 1488416 - Command 'org.ovirt.engine.core.bll.AddUnmanagedVmsCommand' failed: null
Summary: Command 'org.ovirt.engine.core.bll.AddUnmanagedVmsCommand' failed: null
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.0
: ---
Assignee: Milan Zamazal
QA Contact: Vitalii Yerys
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-05 10:59 UTC by Natalie Gavrielov
Modified: 2018-03-29 10:44 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-03-29 10:44:54 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: planning_ack+
rule-engine: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
logs: engine,vdsm (17.52 MB, application/zip)
2017-09-05 10:59 UTC, Natalie Gavrielov
no flags Details
first 2 hosts (11.90 MB, application/zip)
2017-09-07 09:20 UTC, Natalie Gavrielov
no flags Details
third host vdsm.log + output for virsh.. (7.65 MB, application/zip)
2017-09-07 09:22 UTC, Natalie Gavrielov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 81529 0 master MERGED recovery: Set basic VM parameters on recovery 2020-05-05 13:39:14 UTC

Description Natalie Gavrielov 2017-09-05 10:59:28 UTC
Created attachment 1322162 [details]
logs: engine,vdsm

Description of problem:
Not sure about the exact scenario, it most likely be related to blocking connection from a host to a storage domain while the host is running a VM that has a disk on the storage domain blocked.

Version-Release number of selected component:
ovirt-engine-4.2.0-0.0.master.20170828065003.git0619c76.el7.centos.noarch

How reproducible:
Not sure.

Steps to Reproduce:
Not sure about the exact scenario, it most likely be related to blocking connection from a host to a storage domain while the host is running a VM that has a disk on the storage domain blocked.

Actual results:
The engine.log is basically flooded with these errors (the following messages are the first ones):
2017-08-29 15:35:39,835+03 ERROR [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (DefaultQuartzScheduler2) [5c6544c5] Command 'org.ovirt.engine.core.bll.AddUnmanagedVmsCommand' failed: null

2017-08-29 15:35:39,835+03 ERROR [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (DefaultQuartzScheduler2) [5c6544c5] Exception: java.lang.NumberFormatException: null
 
These error messages appears every ~15 seconds.

vdsm.log for that host has many errors (not sure these errors are related, but I would guess they are): 
[virt.periodic.Operation] <vdsm.virt.sampling.VMBulkstatsMonitor object at 0x2c9cad0> operation failed (periodic:202

This error message also appears every ~15 seconds.

Expected results:
No failures / failures without 'null'.  

Additional info:
1. Unblocking the connection didn't solve the problem (remove rule from iptables).
2. Restart vdsm / host / engine didn't solve the problem (errors just keep on coming).

Comment 1 Milan Zamazal 2017-09-06 16:26:44 UTC
Natalie, what is libvirt version on the host where you experienced the problem? And is you cluster version 4.2?

Comment 2 Natalie Gavrielov 2017-09-07 08:03:58 UTC
(In reply to Milan Zamazal from comment #1)
> Natalie, what is libvirt version on the host where you experienced the
> problem? 
libvirt-3.2.0-14.el7.x86_64 (on all 3 hosts).
> And is you cluster version 4.2?
Yes.

Comment 3 Milan Zamazal 2017-09-07 08:33:08 UTC
Thank you for information. Can you reproduce the error from a clean setup? If so I'd be interested in Vdsm logs and `virsh -r list --all' output from the hosts.

Comment 4 Natalie Gavrielov 2017-09-07 09:20:46 UTC
Created attachment 1323011 [details]
first 2 hosts

(In reply to Milan Zamazal from comment #3)
> Thank you for information. Can you reproduce the error from a clean setup?
> If so I'd be interested in Vdsm logs and `virsh -r list --all' output from
> the hosts.

Attaching vdsm.log and output for 'virsh -r list --all'

Issue started at 2017-09-06 11:52:55

From engine.log: 
2017-09-06 11:52:55,762+03 ERROR [org.ovirt.engine.core.bll.AddUnmanagedVmsCommand] (DefaultQuartzScheduler10) [1c2abee5] Command 'org.ovirt.engine.core.bll.AddUnmanagedVmsCommand' failed: null

Note: This is not my env, but the scenario seems similar - got to do with blocking connection.

Comment 5 Natalie Gavrielov 2017-09-07 09:22:09 UTC
Created attachment 1323013 [details]
third host vdsm.log + output for virsh..

Comment 6 Milan Zamazal 2017-09-07 10:09:29 UTC
Thank you, unfortunately the provided logs contain only recovery of the listed shut off VMs, but don't contain any information why the shut off VMs are present on the host (apparently they have been there long time ago). Were those VMs created by oVirt? If yes, I'd need to Vdsm logs (on DEBUG level) from the time when they stopped running.

Comment 8 Milan Zamazal 2017-09-07 12:56:29 UTC
Inspecting the logs I can see VMs running, then the log becomes suddenly interrupted for 6 days, then Vdsm is started and then the VMs are apparently in DOWN state when being recovered. It looks like they stopped when Vdsm wasn't running. It's suspicious the VM were not known to Engine afterwards and it attempted to add them as external, could they be removed some way while being unreachable?
So it doesn't indicate problems with VM cleanup in Vdsm. But it might be a good idea to undefine or ignore VMs in DOWN status on Vdsm recovery.

Comment 9 Natalie Gavrielov 2017-09-12 08:34:48 UTC
(In reply to Milan Zamazal from comment #8)
> It's suspicious the VM were not known to Engine afterwards
> and it attempted to add them as external, could they be removed some way
> while being unreachable?

I asked the the environment's owner and he says that the VMs were not removed in any way.

Comment 10 Milan Zamazal 2017-09-12 10:51:38 UTC
(In reply to Natalie Gavrielov from comment #9)
> I asked the the environment's owner and he says that the VMs were not
> removed in any way.

OK, so we don't know the origin of the VMs unknown to Engine. Unless their creation can be reproduced from a clean setup, there is nothing we can do about that.

Comment 11 Vitalii Yerys 2018-03-16 17:39:43 UTC
Closing the issue as steps are not clear enough, and was not able to reproduce such behaviour. Steps invoked in my testing:

1. Boot vm.

2. Block connection to the storage domain (nfs in my case) on the host where VM is located. Blocked connection via firewalld by blocking the nfs service and ip to the nfs server (both ways were tested).

3. Verify that errors described by reporter are not reproduced.

During the connection block there were attempts on connection to nfs - which I suppose is expected, such messages stopped appearing after connection was re-established.

Verified with:

ovirt-engine-4.2.2.4-0.1.el7.noarch
vdsm-4.20.20-1.el7ev.x86_64

Unless we get some new info on the steps - there is not much we can do here. BZ verified.

Comment 12 Sandro Bonazzola 2018-03-29 10:44:54 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.