Created attachment 1743178 [details] logs Description of problem: Since the new ability to assign Master SD with 'SwitchMasterStorageDomainCommand' command was introduced (BZ 1576923), there are two ways of assigning a new master SD: 1. With a help of restfull API which invokes the 'SwitchMasterStorageDomainCommand' at the engine end. POST /ovirt-engine/api/datacenters/123/setmaster <action> <storage_domain id="456"/> </action> 2. Via the old way of putting the Master SD to maintenance, and waiting till some other SD will take the master role. When with a help of first way, we re-assign the 'master' role to another SD of the same storage type (GlusterFS), we are failing to do so. The engine imitates the 'SwitchMasterStorageDomainCommand' [1] Apparently it fails to complete it [2] And triggers the reconstruction of Storage domains [3] Then, a chain of ERRORs triggered on the engine [4] and SPM host [7] You will notice that at this stage, the SPM host will have an orphaned task [5] and the reconstruction wont end till we clear this task [6]. After the engine completes to reconstruct the SDs, there is still an orphaned async task in the engine's DB [8] The same issue can be reproduced with a help of 2nd way as well, but it requires some additional effort to hit this issue again by activating and deactivating the domains of the GlusterFs type. [1]: 2020-12-30 10:23:28,962+02 INFO [org.ovirt.engine.core.bll.storage.pool.SwitchMasterStorageDomainCommand] (default task-35) [d4c45b3d-6d41-473a-b5c5-8a1b2e2c7aeb] Lock Acquired to object 'EngineLock:{exclusiveLocks='[7b7f56b4-3957-41d2-9c27-da539d 6af836=STORAGE, 2a0d3c24-3357-4677-b9b2-35486af464a3=STORAGE, dfad0b2f-c1e9-4a0d-9ede-30414f6bee36=POOL]', sharedLocks=''}' 2020-12-30 10:23:28,970+02 INFO [org.ovirt.engine.core.bll.storage.pool.SwitchMasterStorageDomainCommand] (default task-35) [d4c45b3d-6d41-473a-b5c5-8a1b2e2c7aeb] Running command: SwitchMasterStorageDomainCommand internal: false. Entities affected : ID: dfad0b2f-c1e9-4a0d-9ede-30414f6bee36 Type: StoragePoolAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2020-12-30 10:23:28,970+02 INFO [org.ovirt.engine.core.bll.storage.pool.SwitchMasterStorageDomainCommand] (default task-35) [d4c45b3d-6d41-473a-b5c5-8a1b2e2c7aeb] Locking the following storage domains: test_gluster_0, test_gluster_1 [2]: 2020-12-30 10:23:32,305+02 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-63727) [] Master domain is not in sync between DB and VDSM. Domain test_gluster_1 marked as master in DB and not in the storage 2020-12-30 10:23:32,312+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-63727) [] EVENT_ID: SYSTEM_MASTER_DOMAIN_NOT_IN_SYNC(990), Sync Error on Master Domain between Host host_mixed_1 and oVirt Engine. Domain: test_gluster_1 is marked as Master in oVirt Engine database but not on the Storage side. Please consult with Support on how to fix this issue. [3]: 2020-12-30 10:23:32,330+02 INFO [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: 2a0d3c24-3357-4677-b9b2-35486af464a3 Type: Storage [4]: 2020-12-30 10:23:32,392+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] Failed in 'HSMGetAllTasksStatusesVDS' method 2020-12-30 10:23:32,397+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command HSMGetAllTasksStatusesVDS failed: value=(1, 0, b'', b'') abortedcode=100 2020-12-30 10:23:32,398+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] SpmStopVDSCommand::Not stopping SPM on vds 'host_mixed_1', pool id 'dfad0b2f-c1e9-4a0d-9ede-30414f6bee36' as there are uncleared tasks 'Task '29240337-9e23-4500-ac1a-abe495ae2b68', status 'finished'' 2020-12-30 10:23:32,402+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] EVENT_ID: VDS_ALERT_NOT_STOPPING_SPM_UNCLEARED_TASKS(9,030), Not stopping SPM on vds host_mixed_1, pool id dfad0b2f-c1e9-4a0d-9ede-30414f6bee36 as there are uncleared tasks Task '29240337-9e23-4500-ac1a-abe495ae2b68', status 'finished' [5]: [root@storage-ge5-vdsm1 ~]# vdsm-client Host getAllTasksStatuses { "63855d34-1358-4cc8-b44d-363616ed42d7": { "code": 100, "message": "value=(1, 0, b'', b'') abortedcode=100", "taskID": "63855d34-1358-4cc8-b44d-363616ed42d7", "taskResult": "cleanSuccess", "taskState": "finished" } } [6]: [root@storage-ge5-vdsm1 ~]# vdsm-client Task clear taskID=63855d34-1358-4cc8-b44d-363616ed42d7 true [7]: 2020-12-30 10:23:31,468+0200 ERROR (tasks/2) [storage.StoragePool] Migration to new master 2a0d3c24-3357-4677-b9b2-35486af464a3 failed (sp:903) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 891, in masterMigrate exclude=('./lost+found',)) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileUtils.py", line 71, in tarCopy raise TarCopyFailed(tsrc.returncode, tdst.returncode, out, err) vdsm.storage.fileUtils.TarCopyFailed: (1, 0, b'', b'') [8]: engine=# select * from async_tasks; -[ RECORD 1 ]---+------------------------------------- task_id | 6f8549f0-b62a-4c15-86d3-3cbafa8d7b5d action_type | 1048 status | 2 result | 0 step_id | aa47a5b8-61bf-4bba-8e41-f7a2510e082f command_id | 1ae62c43-39c3-476f-a23b-e99ee613d65f started_at | 2020-12-30 10:23:29.42+02 storage_pool_id | dfad0b2f-c1e9-4a0d-9ede-30414f6bee36 task_type | 19 vdsm_task_id | 29240337-9e23-4500-ac1a-abe495ae2b68 root_command_id | 1ae62c43-39c3-476f-a23b-e99ee613d65f user_id | d0f3cb14-48fd-11eb-bf93-001a4a231747 Version-Release number of selected component (if applicable): rhv-release-4.4.4-6 How reproducible: 1st way always reproduces the issue 2nd way requires some additional tries Steps to Reproduce: 1. Have an env with more than one GlusterFS SDs. 2. Assign master from A glusterfs to B glusterfs 3. If you havent seen the issue at this point, just assign master to another SD of the same GlusterFS type. Actual results: Assigning the new master SD fails Orphaned tasks in engine DB and SPM host leads to infinite (or long) reconstruction of the SDs Expected results: Should re-assign the master role to another SD Additional info: Attaching engine log, spm log and rest of the vdsm hosts logs.
Looks like the problem related to the 2nd way of moving the master role (with maintenance) appeared in the past: https://bugzilla.redhat.com/show_bug.cgi?id=1360456, https://bugzilla.redhat.com/show_bug.cgi?id=1298724
Shir, please share you recent findings regardings
(In reply to Tal Nisan from comment #2) > Shir, please share you recent findings regardings Reproduce on 4.3 as well: ovirt-engine-4.3.11.4-0.1.el7.noarch vdsm-4.30.50-1.el7ev.x86_64
Hi Shani, can you please explain Why is this on QA? Was the issue described in the description of this bug fixed? I can only find a patch which adds a validation for avoiding switching the master storage domain from/to gluster domains. So there is no a real fix here if i understand correctly. Just a WA.
Until bug 1913764 will be fixed we've blocked switch master to and from Gluster domains, this block should be removed after the root cause is fixed
Moving to 'Verified'.
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.