Bug 1911597 - Block manually switch master domain for Gluster domains
Summary: Block manually switch master domain for Gluster domains
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.4
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.4-1
: ---
Assignee: shani
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks: 1576923
TreeView+ depends on / blocked
 
Reported: 2020-12-30 09:19 UTC by Ilan Zuckerman
Modified: 2021-01-12 16:24 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1913764 (view as bug list)
Environment:
Last Closed: 2021-01-12 16:24:00 UTC
oVirt Team: Storage
Embargoed:
aoconnor: blocker+


Attachments (Terms of Use)
logs (1.12 MB, application/zip)
2020-12-30 09:19 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 112823 0 master MERGED core: avoid switching master role for gluster storage domains 2021-02-08 05:58:47 UTC
oVirt gerrit 112861 0 ovirt-engine-4.4.4.z MERGED core: avoid switching master role for gluster storage domains 2021-02-08 05:58:47 UTC

Description Ilan Zuckerman 2020-12-30 09:19:33 UTC
Created attachment 1743178 [details]
logs

Description of problem:

Since the new ability to assign Master SD with 'SwitchMasterStorageDomainCommand' command was introduced (BZ 1576923), there are two ways of assigning a new master SD:

1. With a help of restfull API which invokes the 'SwitchMasterStorageDomainCommand' at the engine end.
 POST /ovirt-engine/api/datacenters/123/setmaster
  <action>
    <storage_domain id="456"/>
  </action>

2. Via the old way of putting the Master SD to maintenance, and waiting till some other SD will take the master role.

When with a help of first way, we re-assign the 'master' role to another SD of the same storage type (GlusterFS), we are failing to do so.
The engine imitates the 'SwitchMasterStorageDomainCommand' [1]
Apparently it fails to complete it [2]
And triggers the reconstruction of Storage domains [3]
Then, a chain of ERRORs triggered on the engine [4] and SPM host [7]
You will notice that at this stage, the SPM host will have an orphaned task [5] and the reconstruction wont end till we clear this task [6].
After the engine completes to reconstruct the SDs, there is still an orphaned async task in the engine's DB [8]

The same issue can be reproduced with a help of 2nd way as well, but it requires some additional effort to hit this issue again by activating and deactivating the domains of the GlusterFs type.


[1]:
2020-12-30 10:23:28,962+02 INFO  [org.ovirt.engine.core.bll.storage.pool.SwitchMasterStorageDomainCommand] (default task-35)
 [d4c45b3d-6d41-473a-b5c5-8a1b2e2c7aeb] Lock Acquired to object 'EngineLock:{exclusiveLocks='[7b7f56b4-3957-41d2-9c27-da539d
6af836=STORAGE, 2a0d3c24-3357-4677-b9b2-35486af464a3=STORAGE, dfad0b2f-c1e9-4a0d-9ede-30414f6bee36=POOL]', sharedLocks=''}'
2020-12-30 10:23:28,970+02 INFO  [org.ovirt.engine.core.bll.storage.pool.SwitchMasterStorageDomainCommand] (default task-35)
 [d4c45b3d-6d41-473a-b5c5-8a1b2e2c7aeb] Running command: SwitchMasterStorageDomainCommand internal: false. Entities affected
 :  ID: dfad0b2f-c1e9-4a0d-9ede-30414f6bee36 Type: StoragePoolAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
2020-12-30 10:23:28,970+02 INFO  [org.ovirt.engine.core.bll.storage.pool.SwitchMasterStorageDomainCommand] (default task-35)
 [d4c45b3d-6d41-473a-b5c5-8a1b2e2c7aeb] Locking the following storage domains: test_gluster_0, test_gluster_1


[2]:
2020-12-30 10:23:32,305+02 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyImpl] (EE-ManagedThreadFactory-engine-Thread-63727) [] Master domain is not in sync between DB and VDSM. Domain test_gluster_1 marked as master in DB and not in the storage
2020-12-30 10:23:32,312+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-63727) [] EVENT_ID: SYSTEM_MASTER_DOMAIN_NOT_IN_SYNC(990), Sync Error on Master Domain between Host host_mixed_1 and oVirt Engine. Domain: test_gluster_1 is marked as Master in oVirt Engine database but not on the Storage side. Please consult with Support on how to fix this issue.


[3]:
2020-12-30 10:23:32,330+02 INFO  [org.ovirt.engine.core.bll.storage.pool.ReconstructMasterDomainCommand] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: 2a0d3c24-3357-4677-b9b2-35486af464a3 Type: Storage


[4]:
2020-12-30 10:23:32,392+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] Failed in 'HSMGetAllTasksStatusesVDS' method
2020-12-30 10:23:32,397+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command HSMGetAllTasksStatusesVDS failed: value=(1, 0, b'', b'') abortedcode=100
2020-12-30 10:23:32,398+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] SpmStopVDSCommand::Not stopping SPM on vds 'host_mixed_1', pool id 'dfad0b2f-c1e9-4a0d-9ede-30414f6bee36' as there are uncleared tasks 'Task '29240337-9e23-4500-ac1a-abe495ae2b68', status 'finished''
2020-12-30 10:23:32,402+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-63727) [541ee6e7] EVENT_ID: VDS_ALERT_NOT_STOPPING_SPM_UNCLEARED_TASKS(9,030), Not stopping SPM on vds host_mixed_1, pool id dfad0b2f-c1e9-4a0d-9ede-30414f6bee36 as there are uncleared tasks Task '29240337-9e23-4500-ac1a-abe495ae2b68', status 'finished'


[5]:
[root@storage-ge5-vdsm1 ~]# vdsm-client Host getAllTasksStatuses
{
    "63855d34-1358-4cc8-b44d-363616ed42d7": {
        "code": 100,
        "message": "value=(1, 0, b'', b'') abortedcode=100",
        "taskID": "63855d34-1358-4cc8-b44d-363616ed42d7",
        "taskResult": "cleanSuccess",
        "taskState": "finished"
    }
}


[6]:
[root@storage-ge5-vdsm1 ~]# vdsm-client Task clear taskID=63855d34-1358-4cc8-b44d-363616ed42d7
true


[7]:
2020-12-30 10:23:31,468+0200 ERROR (tasks/2) [storage.StoragePool] Migration to new master 2a0d3c24-3357-4677-b9b2-35486af464a3 failed (sp:903)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 891, in masterMigrate
    exclude=('./lost+found',))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileUtils.py", line 71, in tarCopy
    raise TarCopyFailed(tsrc.returncode, tdst.returncode, out, err)
vdsm.storage.fileUtils.TarCopyFailed: (1, 0, b'', b'')


[8]:
engine=# select * from async_tasks;
-[ RECORD 1 ]---+-------------------------------------
task_id         | 6f8549f0-b62a-4c15-86d3-3cbafa8d7b5d
action_type     | 1048
status          | 2
result          | 0
step_id         | aa47a5b8-61bf-4bba-8e41-f7a2510e082f
command_id      | 1ae62c43-39c3-476f-a23b-e99ee613d65f
started_at      | 2020-12-30 10:23:29.42+02
storage_pool_id | dfad0b2f-c1e9-4a0d-9ede-30414f6bee36
task_type       | 19
vdsm_task_id    | 29240337-9e23-4500-ac1a-abe495ae2b68
root_command_id | 1ae62c43-39c3-476f-a23b-e99ee613d65f
user_id         | d0f3cb14-48fd-11eb-bf93-001a4a231747





Version-Release number of selected component (if applicable):
rhv-release-4.4.4-6


How reproducible:
1st way always reproduces the issue
2nd way requires some additional tries


Steps to Reproduce:
1. Have an env with more than one GlusterFS SDs.
2. Assign master from A glusterfs to B glusterfs
3. If you havent seen the issue at this point, just assign master to another SD of the same GlusterFS type.

Actual results:
Assigning the new master SD fails
Orphaned tasks in engine DB and SPM host leads to infinite (or long) reconstruction of the SDs

Expected results:
Should re-assign the master role to another SD

Additional info:
Attaching engine log, spm log and rest of the vdsm hosts logs.

Comment 1 Ilan Zuckerman 2020-12-30 09:25:11 UTC
Looks like the problem related to the 2nd way of moving the master role (with maintenance) appeared in the past: 
https://bugzilla.redhat.com/show_bug.cgi?id=1360456, https://bugzilla.redhat.com/show_bug.cgi?id=1298724

Comment 2 Tal Nisan 2020-12-30 18:07:40 UTC
Shir, please share you recent findings regardings

Comment 3 Evelina Shames 2020-12-31 14:04:00 UTC
(In reply to Tal Nisan from comment #2)
> Shir, please share you recent findings regardings

Reproduce on 4.3 as well:
ovirt-engine-4.3.11.4-0.1.el7.noarch
vdsm-4.30.50-1.el7ev.x86_64

Comment 4 Ilan Zuckerman 2021-01-07 06:14:47 UTC
Hi Shani, can you please explain Why is this on QA?
Was the issue described in the description of this bug fixed?
I can only find a patch which adds a validation for avoiding switching the master storage domain from/to gluster domains. So there is no a real fix here if i understand correctly. Just a WA.

Comment 6 Tal Nisan 2021-01-07 14:41:26 UTC
Until bug 1913764 will be fixed we've blocked switch master to and from Gluster domains, this block should be removed after the root cause is fixed

Comment 10 Evelina Shames 2021-01-11 10:13:37 UTC
Moving to 'Verified'.

Comment 11 Sandro Bonazzola 2021-01-12 16:24:00 UTC
This bugzilla is included in oVirt 4.4.4 release, published on December 21st 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.