Created attachment 1581112 [details] engine and vdsm logs Description of problem: Following Bug 1667723 fix patch https://gerrit.ovirt.org/#/c/99378 we now have a state when a host is already in maintenance state but the host object is still locked until DisconnectHostFromStoragePoolServersCommand is done. In this time the user can not remove/delete the host. I would expect the host to be in maintenance state only when DisconnectHostFromStoragePoolServersCommand is done. Engine log: #Host is in maintenance state: 2019-06-12 06:44:00,402+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [] Updated host status from 'Preparing for Maintenance' to 'Maint enance' in database, host 'host_mixed_2'(aa4e6411-db68-4f57-94e1-4dd17230f86e) 2019-06-12 06:44:00,432+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-2907) [] Clearing cache of pool: 'c23a9d84-8ffe-4858-b5d3-bdb021ff62a0' for problemat ic entities of VDS: 'host_mixed_2'. 2019-06-12 06:44:00,432+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-2907) [] Removing vds '[aa4e6411-db68-4f57-94e1-4dd17230f86e]' from the domain in mai ntenance cache #Afterwards trying to remove the host fails as DisconnectHostFromStoragePoolServersCommand is still in progress : 2019-06-12 06:44:06,720+03 INFO [org.ovirt.engine.core.bll.storage.pool.DisconnectHostFromStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [305d2e2e] Running command: DisconnectHo stFromStoragePoolServersCommand internal: true. Entities affected : ID: c23a9d84-8ffe-4858-b5d3-bdb021ff62a0 Type: StoragePool 2019-06-12 06:44:06,779+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-33) [305d2e2e] START, DisconnectStorageServerVDSCom mand(HostName = host_mixed_2, StorageServerConnectionManagementVDSParameters:{hostId='aa4e6411-db68-4f57-94e1-4dd17230f86e', storagePoolId='c23a9d84-8ffe-4858-b5d3-bdb021ff62a0', storageType='NFS', connectionList='[StorageServerConnections:{id='24d3d010-7ae2-4208-b0b7-e81a45622016', connection='mantis-nfs-lif2.lab.eng.tlv2.redhat.com:/nas01/ge_7_nfs_2', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='2fddd09b-133b-40d4-95b8-5d7073a7812a', connection='mantis-nfs-lif2.lab.eng.tlv2.redhat.com:/nas01/ge_7_nfs_1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='f399a10b-5363-44a3-b43a-cffa71312f45', connection='mantis-nfs-lif2.lab.eng.tlv2.redhat.com:/nas01/ge_7_nfs_0', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='1b81ac3d-2cb2-466b-b4d0-b72ee2a40c5e', connection='mantis-nfs-lif2.lab.eng.tlv2.redhat.com:/nas01/ge_7_export', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]', sendNetworkEventOnFailure='true'}), log id: 23f78d92 2019-06-12 06:44:08,519+03 INFO [org.ovirt.engine.core.bll.RemoveVdsCommand] (default task-24) [hosts_delete_9e65b7cc-9177-485d] Failed to Acquire Lock to object 'EngineLock:{exclusiveLocks='[aa4e6411-db68-4f57-94e1-4dd17230f86e=VDS, VDS_POOL_AND_STORAGE_CONNECTIONSaa4e6411-db68-4f57-94e1-4dd17230f86e=VDS_POOL_AND_STORAGE_CONNECTIONS]', sharedLocks=''}' 2019-06-12 06:44:08,519+03 WARN [org.ovirt.engine.core.bll.RemoveVdsCommand] (default task-24) [hosts_delete_9e65b7cc-9177-485d] Validation of action 'RemoveVds' failed for user admin@internal-authz. Reasons: VAR__ACTION__REMOVE,VAR__TYPE__HOST,ACTION_TYPE_FAILED_OBJECT_LOCKED 2019-06-12 06:44:08,521+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-24) [] Operation Failed: [Cannot remove Host. Related operation is currently in progress. Please try again later.] Version-Release number of selected component (if applicable): introduces at 4.3.4.2 when Bug 1667723 was fixed How reproducible: 100% Fails each time running automation TestCase18976 Steps to Reproduce: 1. Move host to maintenance 2. Right when the host hit maintenance state try to remove the host. Actual results: Remove host fails as the host is locked due to DisconnectHostFromStoragePoolServersCommand Expected results: The host should be moved to maintenance only after all operations including DisconnectHostFromStoragePoolServersCommand is finished. Additional info: A workaround is to wait until DisconnectHostFromStoragePoolServersCommand is finished BUT this is really annoying as there is no indication that the host is locked and as the user expect 'maintenance' state on the host to mean he can remove the host and he can't this is really annoying.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
After debugging the mentioned issue in this bug, and discussed it with Benny, we found that the status updated in the job table in the database reflects the real status of the host, and what the status in the log is not accurate as mentioned in this bug. so we have the following solutions: 1- check job table if the status in 'finished' state, then remove command can be issued. 2- revert the bug Bug 1667723 https://bugzilla.redhat.com/show_bug.cgi?id=1667723 3- stay with the current state and make no change. what do you think?
(In reply to Ahmad Khiet from comment #2) > After debugging the mentioned issue in this bug, and discussed it with > Benny, > we found that the status updated in the job table in the database reflects > the real status of the host, and what the status in the log is not accurate > as mentioned in this bug. What do you mean by that? So the state I get in RESTAPI from the hosts are wrong/different than from the DB? If so this issue should be solved and host be solution#0 . In automation, I wait till the host is in 'maintenance' and only then try to remove the host. But as there is still running related tasks/jobs there this operation fails. > so we have the following solutions: > > 1- check job table if the status in 'finished' state, then remove command > can be issued. This is a workaround, not a bug solution. We need the host state to reflect it's true state. host DB state = logs host state = REST host state (which user use in automation/ansible) > 2- revert the bug Bug 1667723 > https://bugzilla.redhat.com/show_bug.cgi?id=1667723 Also not good as then we return the issue when the host is removed while disconnecting from storage which is bad. > 3- stay with the current state and make no change. No way, as now most of our users also use ansible automation and will wait until the host is in maintenance and expect DB state reflected in logs/REST. How hard is it to implement the logic solution which is: Wait until all related jobs/tasks are done and then move the host to maintenance? This is the root cause of this bug. If the state I get in RESTAPI from the hosts is wrong/different than from the DB than this issue should be addressed.
as Moti sent in the mailing list. I have added a lock and wait with a timeout of 1 minute for RemoveVdsCommand.
Verified on engine 4.3.6-0.1.
This bugzilla is included in oVirt 4.3.6 release, published on September 26th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.6 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.