1055455 – [engine-backend] engine asks for vdsManager of a host that was removed from the system

Bug 1055455 - [engine-backend] engine asks for vdsManager of a host that was removed from the system

Summary: [engine-backend] engine asks for vdsManager of a host that was removed from t...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.4
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.4.1
Assignee:	Yaniv Bronhaim
QA Contact:	bugs@ovirt.org
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-20 10:05 UTC by Elad
Modified:	2014-05-08 13:36 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-05-08 13:36:01 UTC
oVirt Team:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sos report (18.19 MB, application/x-xz) 2014-01-20 10:05 UTC, Elad	no flags	Details
sos report (2) (7.33 MB, application/x-gzip) 2014-02-05 12:50 UTC, Elad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	24907	0	None	MERGED	core: allow putting host on maintenance if lonley in dc also for SPM host	Never
oVirt gerrit	25307	0	None	MERGED	core: allow putting host on maintenance if lonley in dc also for SPM host	Never

Description Elad 2014-01-20 10:05:42 UTC

Created attachment 852673 [details]
sos report

Description of problem:
Engine keeps asking for GetCapabilities from a host which was removed from the setup.

Version-Release number of selected component (if applicable):
ovirt-3.4.0-alpha1
ovirt-engine-3.4.0-0.2.master.20140106180914.el6.noarch


How reproducible:
Unknown

Steps to Reproduce:
In my case it happened to me after the following scenario:
1. I removed qemu from host without putting it to maintenance (yum remove qemu), it removed libvirt and vdsm as well
2. confirmed the host had been rebooted and put it in maintenance


Actual results:

Engine removes host from DB:

2014-01-20 11:18:36,441 INFO  [org.ovirt.engine.core.bll.RemoveVdsCommand] (org.ovirt.thread.pool-6-thread-22) [2eeec968] Running command: RemoveVdsCommand internal: false. Entities affected :  ID: 2726cb30-4120-4767-b9f7-578e97ab22c1 Type: VDS
2014-01-20 11:18:36,499 INFO  [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (org.ovirt.thread.pool-6-thread-22) [2eeec968] START, RemoveVdsVDSCommand( HostId = 2726cb30-4120-4767-b9f7-578e97ab22c1), log id: a90022d
2014-01-20 11:18:36,500 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-22) [2eeec968] vdsManager::disposing
2014-01-20 11:18:36,501 INFO  [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (org.ovirt.thread.pool-6-thread-22) [2eeec968] FINISH, RemoveVdsVDSCommand, log id: a90022d

==========

Host no longer exists in DB:


                vds_id
--------------------------------------
 8e47776b-00a4-4fc3-a7c6-25c546113d4a
 12a77d94-7483-4435-b3f0-45e911158a0b
 d1f150ab-52fd-4ed9-a546-39884f162386
(3 rows)

==========

Engine tries to get vdsManager of vdsid=2726cb30-4120-4767-b9f7-578e97ab22c1 which doesn't exist in DB and fails:


2014-01-20 11:18:38,321 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (DefaultQuartzScheduler_Worker-84) [1c2a4fb2] Cannot get vdsManager for vdsid=2726cb30-4120-4767-b9f7-578e97ab22c1
2014-01-20 11:18:38,323 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (DefaultQuartzScheduler_Worker-84) [1c2a4fb2] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 2726cb30-4120-4767-b9f7-578e97ab22c1 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)

=========

This error repeated until I've restarted ovirt-engine service

Expected results:
If host was removed from DB, engine shouldn't look for it.


Additional info:
sos report from engine

Comment 1 Elad 2014-01-20 12:36:23 UTC

Steps to Reproduce:
In my case it happened to me after the following scenario:
1. I removed qemu from host without putting it to maintenance (yum remove qemu), it removed libvirt and vdsm as well - host state - non-responsive
2. confirmed the host had been rebooted and put it in maintenance
3. removed the host

Comment 2 Barak 2014-01-20 12:45:44 UTC

I assume that before the host removal it's status was non-responsive ... and than you have moved it to maintenance ?

Comment 3 Elad 2014-01-20 12:58:05 UTC

(In reply to Barak from comment #2)
> I assume that before the host removal it's status was non-responsive ... and
> than you have moved it to maintenance ?

Yes, I added it in comment #1

Comment 4 Yaniv Bronhaim 2014-02-04 17:45:57 UTC

is it reproducible to you ? I tried the same steps and after the removal ends I don't get any attempt of creating vdsManager for that host

Comment 5 Elad 2014-02-05 12:50:00 UTC

Created attachment 859638 [details]
sos report (2)

(In reply to Yaniv Bronhaim from comment #4)
> is it reproducible to you ? I tried the same steps and after the removal
> ends I don't get any attempt of creating vdsManager for that host
Yes. Try to do it while host is SPM.

2014-02-05 14:44:07,312 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (DefaultQuartzScheduler_Worker-72) [6a30683b] Cannot get vdsManager for vdsid=5a849a31-4431-407f-9c6a-216b442169c4
2014-02-05 14:44:07,314 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (DefaultQuartzScheduler_Worker-72) [6a30683b] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 5a849a31-4431-407f-9c6a-216b442169c4 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004)

Attaching the logs

Comment 6 Elad 2014-02-05 13:45:46 UTC

Easier way to reproduce the issue will be to block connectivity between spm and engine --> wait until host becomes non-responsive --> "confirm host has been rebooted" --> maintenance host --> remove the host from the setup

Comment 7 Yaniv Bronhaim 2014-02-06 12:26:25 UTC

from what I see while clicking on "config host has been reboot" on SPM host with no other host in the setup, engine initiates  FenceVdsManualy which fails at start on "Error while executing action: Due to intermittent connectivity to this Host, fence operations are not allowed at this time. The system is trying to reconnect, please try again in 30 seconds."

after it gets to non-responsive it works, by reports "Manual fence did not revoke the selected SPM (aaa) since the master storage domain was not active or could not use another host for the fence operation."

now i can put the host on maintenance and delete it, which in my understatement should not be available 

any recent changes in this area? what is the expected behavior in that case?

Comment 8 Eli Mesika 2014-02-06 13:21:40 UTC

(In reply to Yaniv Bronhaim from comment #7)
> from what I see while clicking on "config host has been reboot" on SPM host
Yaniv, as far as I understand your check reported in comment 7 works as expected.
The questions is if after you had deleted the host you keep getting messages for it from VDSM
When I look in ResourceManager class I see a _vdsManagersDict structure that caches VdsManager instance per Host, the fact that the behavior reported in this bug not occurs after restarting engine point to the possibility that the VdsManager for the removed host is not cleared from the cache from some reason

Comment 9 Yaniv Bronhaim 2014-02-06 14:49:25 UTC

I know, that exactly the problem . I wonder if the ui should not let you put the host on maintenance in the case when you host was SPM and the soft fence fails, even-though the user clicked on "confirm Host has been rebooted"

Comment 10 Yaniv Bronhaim 2014-02-09 17:45:12 UTC

Patch http://gerrit.ovirt.org/#/c/13045 exposed this issue. If the spm status is kept, MaintenanceNumberOfVdssCommand fails on VDS_CANNOT_MAINTENANCE_VDS_IS_NOT_RESPONDING_AND_IS_SPM, which is the treatment when other hosts are available in the cluster. afaiu bug 837539 is not a bug but the desired behavior, the fix for that bug means that by "confirm Host had been rebooted" we can harm the master mount by proceed the removal of the host while the mount is still on it. The RESOURCE_MANAGER_VDS_NOT_FOUND error is raised when the engine checks for the spmStatus but the host had been removed. the vds id of this spm still in db. so its either remove it from there also, or reverting this change that can cause other hazards.

Comment 11 Yaniv Bronhaim 2014-02-23 13:35:39 UTC

The meaning of fixing this issue raises main storage concern, which allows users to clear spm without having another host in DC.

This is easy to fix. afaiu in patch [1] the fix was not full. All we need to do is to change the lines that added in [1] to resetIrs() when activateDataCenter() fails to find another running vds ([1] only clean the storage_pool table and leaves irsBrokerCommand cached data). 
but there is a strong reason why to not allow that, we can harm the msd and broke the dc availability. I suggest to consider again what is the right approach to handle the issue raised by bug 837539.

Its either fixing [1] by change activateDataCenter() to call resetIrs when no up hosts are available, or remove [1] at all and keep the dc blocked (not allow to put this host on maintenance\remove it) until another host is added and available.

Allon, adding you to consider the alternatives.

[1] http://gerrit.ovirt.org/#/c/13045

Comment 12 Yaniv Bronhaim 2014-03-02 11:58:00 UTC

the patch http://gerrit.ovirt.org/24907 fix the patch of bug 837539

Comment 13 Sandro Bonazzola 2014-03-04 09:26:25 UTC

This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

Comment 14 Sandro Bonazzola 2014-05-08 13:36:01 UTC

This is an automated message

oVirt 3.4.1 has been released:
 * should fix your issue
 * should be available at your local mirror within two days.

If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.