Bug 1720908 - Remove host fails when host is in maintenance as it's lock due to DisconnectHostFromStoragePoolServersCommand - host in maintenance should not be locked
Summary: Remove host fails when host is in maintenance as it's lock due to DisconnectH...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.3.4.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.3.6
: 4.3.6
Assignee: Ahmad Khiet
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-16 13:08 UTC by Avihai
Modified: 2019-09-26 19:43 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.3.6
Clone Of:
Environment:
Last Closed: 2019-09-26 19:43:18 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.3+
pm-rhel: blocker?


Attachments (Terms of Use)
engine and vdsm logs (1.79 MB, application/zip)
2019-06-16 13:08 UTC, Avihai
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 101344 0 master MERGED engine: remove and maintenance lock conflict 2020-12-15 16:18:04 UTC
oVirt gerrit 101382 0 ovirt-engine-4.3 MERGED engine: remove and maintenance lock conflict 2020-12-15 16:18:04 UTC

Description Avihai 2019-06-16 13:08:28 UTC
Created attachment 1581112 [details]
engine and vdsm logs

Description of problem:
Following Bug 1667723 fix patch https://gerrit.ovirt.org/#/c/99378 we now have a state when a host is already in maintenance state but the host object is still locked until DisconnectHostFromStoragePoolServersCommand is done.

In this time the user can not remove/delete the host.

I would expect the host to be in maintenance state only when DisconnectHostFromStoragePoolServersCommand is done.

Engine log:
#Host is in maintenance state:
2019-06-12 06:44:00,402+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [] Updated host status from 'Preparing for Maintenance' to 'Maint
enance' in database, host 'host_mixed_2'(aa4e6411-db68-4f57-94e1-4dd17230f86e)
2019-06-12 06:44:00,432+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-2907) [] Clearing cache of pool: 'c23a9d84-8ffe-4858-b5d3-bdb021ff62a0' for problemat
ic entities of VDS: 'host_mixed_2'.
2019-06-12 06:44:00,432+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-2907) [] Removing vds '[aa4e6411-db68-4f57-94e1-4dd17230f86e]' from the domain in mai
ntenance cache

#Afterwards trying to remove the host fails as DisconnectHostFromStoragePoolServersCommand is still in progress :

2019-06-12 06:44:06,720+03 INFO  [org.ovirt.engine.core.bll.storage.pool.DisconnectHostFromStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [305d2e2e] Running command: DisconnectHo
stFromStoragePoolServersCommand internal: true. Entities affected :  ID: c23a9d84-8ffe-4858-b5d3-bdb021ff62a0 Type: StoragePool
2019-06-12 06:44:06,779+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [305d2e2e] START, DisconnectStorageServerVDSCom
mand(HostName = host_mixed_2, StorageServerConnectionManagementVDSParameters:{hostId='aa4e6411-db68-4f57-94e1-4dd17230f86e', storagePoolId='c23a9d84-8ffe-4858-b5d3-bdb021ff62a0', storageType='NFS', connectionList='[StorageServerConnections:{id='24d3d010-7ae2-4208-b0b7-e81a45622016', connection='mantis-nfs-lif2.lab.eng.tlv2.redhat.com:/nas01/ge_7_nfs_2', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='2fddd09b-133b-40d4-95b8-5d7073a7812a', connection='mantis-nfs-lif2.lab.eng.tlv2.redhat.com:/nas01/ge_7_nfs_1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='f399a10b-5363-44a3-b43a-cffa71312f45', connection='mantis-nfs-lif2.lab.eng.tlv2.redhat.com:/nas01/ge_7_nfs_0', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='1b81ac3d-2cb2-466b-b4d0-b72ee2a40c5e', connection='mantis-nfs-lif2.lab.eng.tlv2.redhat.com:/nas01/ge_7_export', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]', sendNetworkEventOnFailure='true'}), log id: 23f78d92
2019-06-12 06:44:08,519+03 INFO  [org.ovirt.engine.core.bll.RemoveVdsCommand] (default task-24) [hosts_delete_9e65b7cc-9177-485d] Failed to Acquire Lock to object 'EngineLock:{exclusiveLocks='[aa4e6411-db68-4f57-94e1-4dd17230f86e=VDS, VDS_POOL_AND_STORAGE_CONNECTIONSaa4e6411-db68-4f57-94e1-4dd17230f86e=VDS_POOL_AND_STORAGE_CONNECTIONS]', sharedLocks=''}'
2019-06-12 06:44:08,519+03 WARN  [org.ovirt.engine.core.bll.RemoveVdsCommand] (default task-24) [hosts_delete_9e65b7cc-9177-485d] Validation of action 'RemoveVds' failed for user admin@internal-authz. Reasons: VAR__ACTION__REMOVE,VAR__TYPE__HOST,ACTION_TYPE_FAILED_OBJECT_LOCKED
2019-06-12 06:44:08,521+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-24) [] Operation Failed: [Cannot remove Host. Related operation is currently in progress. Please try again later.]


 
Version-Release number of selected component (if applicable):
introduces at 4.3.4.2 when Bug 1667723 was fixed

How reproducible:
100%
Fails each time running automation TestCase18976


Steps to Reproduce:
1. Move host to maintenance
2. Right when the host hit maintenance state try to remove the host.


Actual results:
Remove host fails as the host is locked due to DisconnectHostFromStoragePoolServersCommand

Expected results:
The host should be moved to maintenance only after all operations including DisconnectHostFromStoragePoolServersCommand is finished. 

Additional info:
A workaround is to wait until DisconnectHostFromStoragePoolServersCommand is finished BUT this is really annoying as there is no indication that the host is locked and as the user expect 'maintenance' state on the host to mean he can remove the host and he can't this is really annoying.

Comment 1 RHEL Program Management 2019-06-17 14:12:32 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Ahmad Khiet 2019-06-20 11:58:11 UTC
After debugging the mentioned issue in this bug, and discussed it with Benny, 
we found that the status updated in the job table in the database reflects 
the real status of the host, and what the status in the log is not accurate
as mentioned in this bug.

so we have the following solutions:

1- check job table if the status in 'finished' state, then remove command 
can be issued.

2- revert the bug Bug 1667723 https://bugzilla.redhat.com/show_bug.cgi?id=1667723

3- stay with the current state and make no change.


what do you think?

Comment 3 Avihai 2019-06-25 13:51:35 UTC
(In reply to Ahmad Khiet from comment #2)
> After debugging the mentioned issue in this bug, and discussed it with
> Benny, 
> we found that the status updated in the job table in the database reflects 
> the real status of the host, and what the status in the log is not accurate
> as mentioned in this bug.
What do you mean by that? So the state I get in RESTAPI from the hosts are wrong/different than from the DB?
If so this issue should be solved and host be solution#0 .

In automation, I wait till the host is in 'maintenance' and only then try to remove the host.
But as there is still running related tasks/jobs there this operation fails.


> so we have the following solutions:
> 
> 1- check job table if the status in 'finished' state, then remove command 
> can be issued.
This is a workaround, not a bug solution.
We need the host state to reflect it's true state.

host DB state = logs host state = REST host state (which user use in automation/ansible)


> 2- revert the bug Bug 1667723
> https://bugzilla.redhat.com/show_bug.cgi?id=1667723
Also not good as then we return the issue when the host is removed while disconnecting from storage which is bad.

> 3- stay with the current state and make no change.
No way, as now most of our users also use ansible automation and will wait until the host is in maintenance and expect DB state reflected in logs/REST.

How hard is it to implement the logic solution which is:
Wait until all related jobs/tasks are done and then move the host to maintenance?
This is the root cause of this bug.

If the state I get in RESTAPI from the hosts is wrong/different than from the DB than this issue should be addressed.

Comment 4 Ahmad Khiet 2019-06-30 07:12:51 UTC
as Moti sent in the mailing list. 
I have added a lock and wait with a timeout of 1 minute for RemoveVdsCommand.

Comment 5 Evelina Shames 2019-08-12 07:26:22 UTC
Verified on engine 4.3.6-0.1.

Comment 6 Sandro Bonazzola 2019-09-26 19:43:18 UTC
This bugzilla is included in oVirt 4.3.6 release, published on September 26th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.