Created attachment 852673 [details] sos report Description of problem: Engine keeps asking for GetCapabilities from a host which was removed from the setup. Version-Release number of selected component (if applicable): ovirt-3.4.0-alpha1 ovirt-engine-3.4.0-0.2.master.20140106180914.el6.noarch How reproducible: Unknown Steps to Reproduce: In my case it happened to me after the following scenario: 1. I removed qemu from host without putting it to maintenance (yum remove qemu), it removed libvirt and vdsm as well 2. confirmed the host had been rebooted and put it in maintenance Actual results: Engine removes host from DB: 2014-01-20 11:18:36,441 INFO [org.ovirt.engine.core.bll.RemoveVdsCommand] (org.ovirt.thread.pool-6-thread-22) [2eeec968] Running command: RemoveVdsCommand internal: false. Entities affected : ID: 2726cb30-4120-4767-b9f7-578e97ab22c1 Type: VDS 2014-01-20 11:18:36,499 INFO [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (org.ovirt.thread.pool-6-thread-22) [2eeec968] START, RemoveVdsVDSCommand( HostId = 2726cb30-4120-4767-b9f7-578e97ab22c1), log id: a90022d 2014-01-20 11:18:36,500 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-22) [2eeec968] vdsManager::disposing 2014-01-20 11:18:36,501 INFO [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (org.ovirt.thread.pool-6-thread-22) [2eeec968] FINISH, RemoveVdsVDSCommand, log id: a90022d ========== Host no longer exists in DB: vds_id -------------------------------------- 8e47776b-00a4-4fc3-a7c6-25c546113d4a 12a77d94-7483-4435-b3f0-45e911158a0b d1f150ab-52fd-4ed9-a546-39884f162386 (3 rows) ========== Engine tries to get vdsManager of vdsid=2726cb30-4120-4767-b9f7-578e97ab22c1 which doesn't exist in DB and fails: 2014-01-20 11:18:38,321 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (DefaultQuartzScheduler_Worker-84) [1c2a4fb2] Cannot get vdsManager for vdsid=2726cb30-4120-4767-b9f7-578e97ab22c1 2014-01-20 11:18:38,323 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (DefaultQuartzScheduler_Worker-84) [1c2a4fb2] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 2726cb30-4120-4767-b9f7-578e97ab22c1 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) ========= This error repeated until I've restarted ovirt-engine service Expected results: If host was removed from DB, engine shouldn't look for it. Additional info: sos report from engine
Steps to Reproduce: In my case it happened to me after the following scenario: 1. I removed qemu from host without putting it to maintenance (yum remove qemu), it removed libvirt and vdsm as well - host state - non-responsive 2. confirmed the host had been rebooted and put it in maintenance 3. removed the host
I assume that before the host removal it's status was non-responsive ... and than you have moved it to maintenance ?
(In reply to Barak from comment #2) > I assume that before the host removal it's status was non-responsive ... and > than you have moved it to maintenance ? Yes, I added it in comment #1
is it reproducible to you ? I tried the same steps and after the removal ends I don't get any attempt of creating vdsManager for that host
Created attachment 859638 [details] sos report (2) (In reply to Yaniv Bronhaim from comment #4) > is it reproducible to you ? I tried the same steps and after the removal > ends I don't get any attempt of creating vdsManager for that host Yes. Try to do it while host is SPM. 2014-02-05 14:44:07,312 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (DefaultQuartzScheduler_Worker-72) [6a30683b] Cannot get vdsManager for vdsid=5a849a31-4431-407f-9c6a-216b442169c4 2014-02-05 14:44:07,314 ERROR [org.ovirt.engine.core.vdsbroker.ResourceManager] (DefaultQuartzScheduler_Worker-72) [6a30683b] CreateCommand failed: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Vds with id: 5a849a31-4431-407f-9c6a-216b442169c4 was not found (Failed with error RESOURCE_MANAGER_VDS_NOT_FOUND and code 5004) Attaching the logs
Easier way to reproduce the issue will be to block connectivity between spm and engine --> wait until host becomes non-responsive --> "confirm host has been rebooted" --> maintenance host --> remove the host from the setup
from what I see while clicking on "config host has been reboot" on SPM host with no other host in the setup, engine initiates FenceVdsManualy which fails at start on "Error while executing action: Due to intermittent connectivity to this Host, fence operations are not allowed at this time. The system is trying to reconnect, please try again in 30 seconds." after it gets to non-responsive it works, by reports "Manual fence did not revoke the selected SPM (aaa) since the master storage domain was not active or could not use another host for the fence operation." now i can put the host on maintenance and delete it, which in my understatement should not be available any recent changes in this area? what is the expected behavior in that case?
(In reply to Yaniv Bronhaim from comment #7) > from what I see while clicking on "config host has been reboot" on SPM host Yaniv, as far as I understand your check reported in comment 7 works as expected. The questions is if after you had deleted the host you keep getting messages for it from VDSM When I look in ResourceManager class I see a _vdsManagersDict structure that caches VdsManager instance per Host, the fact that the behavior reported in this bug not occurs after restarting engine point to the possibility that the VdsManager for the removed host is not cleared from the cache from some reason
I know, that exactly the problem . I wonder if the ui should not let you put the host on maintenance in the case when you host was SPM and the soft fence fails, even-though the user clicked on "confirm Host has been rebooted"
Patch http://gerrit.ovirt.org/#/c/13045 exposed this issue. If the spm status is kept, MaintenanceNumberOfVdssCommand fails on VDS_CANNOT_MAINTENANCE_VDS_IS_NOT_RESPONDING_AND_IS_SPM, which is the treatment when other hosts are available in the cluster. afaiu bug 837539 is not a bug but the desired behavior, the fix for that bug means that by "confirm Host had been rebooted" we can harm the master mount by proceed the removal of the host while the mount is still on it. The RESOURCE_MANAGER_VDS_NOT_FOUND error is raised when the engine checks for the spmStatus but the host had been removed. the vds id of this spm still in db. so its either remove it from there also, or reverting this change that can cause other hazards.
The meaning of fixing this issue raises main storage concern, which allows users to clear spm without having another host in DC. This is easy to fix. afaiu in patch [1] the fix was not full. All we need to do is to change the lines that added in [1] to resetIrs() when activateDataCenter() fails to find another running vds ([1] only clean the storage_pool table and leaves irsBrokerCommand cached data). but there is a strong reason why to not allow that, we can harm the msd and broke the dc availability. I suggest to consider again what is the right approach to handle the issue raised by bug 837539. Its either fixing [1] by change activateDataCenter() to call resetIrs when no up hosts are available, or remove [1] at all and keep the dc blocked (not allow to put this host on maintenance\remove it) until another host is added and available. Allon, adding you to consider the alternatives. [1] http://gerrit.ovirt.org/#/c/13045
the patch http://gerrit.ovirt.org/24907 fix the patch of bug 837539
This is an automated message. Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.
This is an automated message oVirt 3.4.1 has been released: * should fix your issue * should be available at your local mirror within two days. If problems still persist, please make note of it in this bug report.