1895277 – [DR] Remote data sync to the secondary site never completes

Bug 1895277 - [DR] Remote data sync to the secondary site never completes

Summary: [DR] Remote data sync to the secondary site never completes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Gluster
Sub Component:
Version:	4.4.3.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.4.3-2
Target Release:	---
Assignee:	Kaustav Majumder
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1894758
TreeView+	depends on / blocked

Reported:	2020-11-06 08:16 UTC by SATHEESARAN
Modified:	2021-01-22 12:51 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:	1894758
Environment:
Last Closed:	2021-01-22 12:51:43 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:
Flags:	sasundar: ovirt-4.4? aoconnor: blocker+ sasundar: planning_ack? godas: devel_ack+ sasundar: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	112099	0	None	MERGED	gluster: Added null check for 'remoteHost' attribute coming from georep status.	2021-02-18 11:03:48 UTC
oVirt gerrit	112189	0	None	MERGED	gluster: Added null check for 'remoteHost' attribute coming from georep status.	2021-02-18 11:03:48 UTC

Description SATHEESARAN 2020-11-06 08:16:48 UTC

Description of problem:
------------------------
RHHI-V DR mechanism makes use of gluster geo-replication to sync the data to the remote site. I see that works good, and the checkpoint is reached, which means the data is synced successfully to the secondary site. But RHV Manager at the primary site fails to recognize the completion of geo-rep data sync and waits indefinitely.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHV Manager 4.4.3
RHVH 4.4.3
RHGS 3.5.3

How reproducible:
-------------------
Always

Steps to Reproduce:
-------------------
1. Create a primary site with 3 node RHHI-V deployment
2. Create a secondary site with 3 node RHHI-V deployment, with no storage domains created, but just the volumes created
3. Create a VM with 40GB OS disk and install it with RHEL 8.3
4. Create a geo-rep session from primary site to secondary site
5. Create a schedule to sync the data to secondary site
6. Wait for the schedule for geo-rep session to get triggered

Actual results:
---------------
Geo-rep session starts and syncs the data successfully, which the RHV Manager /Engine fails to interpret

Expected results:
-----------------
Once the gluster geo-replication successfully completes data sync, engine should understand the same, and appropriate events to be triggered.
--- Additional comment from SATHEESARAN on 2020-11-05 01:48:27 UTC ---

Suspected snippet from engine.log:

<snip>
2020-11-05 01:41:09,406Z INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] START, Gluster
TasksListVDSCommand(HostName = rhsqa-grafton8-nic2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{hostId='ec261014-f90d-4044-ae99-18b6d51defa9'}), log id: 99d5563
2020-11-05 01:41:09,882Z ERROR [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeGeoRepSessionStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-55904) [] Failed in 'GetGlusterV
olumeGeoRepSessionStatusVDS' method, for vds: 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com'; host: 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com': null
2020-11-05 01:41:09,882Z ERROR [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeGeoRepSessionStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-55904) [] Command 'GetGlusterVol
umeGeoRepSessionStatusVDSCommand(HostName = rhsqa-grafton8-nic2.lab.eng.blr.redhat.com, GlusterVolumeGeoRepSessionVDSParameters:{hostId='ec261014-f90d-4044-ae99-18b6d51defa9', volumeName='vm
store'})' execution failed: null
2020-11-05 01:41:09,882Z INFO  [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeGeoRepSessionStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-55904) [] FINISH, GetGlusterVolu
meGeoRepSessionStatusVDSCommand, return: , log id: 7573d0bf
2020-11-05 01:41:09,882Z ERROR [org.ovirt.engine.core.bll.gluster.GlusterGeoRepSyncJob] (EE-ManagedThreadFactory-engine-Thread-55904) [] Exception getting geo-rep status from vds EngineExcep
tion: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
2020-11-05 01:41:09,882Z INFO  [org.ovirt.engine.core.bll.gluster.GlusterGeoRepSyncJob] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-26) [] Geo-replication session de
tails not updated for session 'ccd5800e-135d-4ef1-a868-f744eee88967:ssh://rhsqa-grafton10-nic2.lab.eng.blr.redhat.com::backupvol:ab084f86-cf8c-4cc6-89f5-10c53df60893' as there was error retu
rning data from VDS
</snip>

At the same instance from supervdsm.log from that VDS host
----------------------------------------------------------
<snip>
MainProcess|jsonrpc/4::DEBUG::2020-11-05 01:41:09,894::commands::98::common.commands::(run) SUCCESS: <err> = b''; <rc> = 0
MainProcess|jsonrpc/4::DEBUG::2020-11-05 01:41:09,895::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return volumeGeoRepStatus with {'vmstore': {'sessions': [{'sessionKey': 'ccd5800e-135d-4ef1-a868-f744eee88967:ssh://rhsqa-grafton10-nic2.lab.eng.blr.redhat.com::backupvol:ab084f86-cf8c-4cc6-89f5-10c53df60893', 'remoteVolumeName': 'backupvol', 'bricks': [{'host': 'rhsqa-grafton8.lab.eng.blr.redhat.com', 'hostUuid': 'e5de2108-b3e1-4221-b00b-badc39137c9c', 'brickName': '/gluster_bricks/vmstore/vmstore', 'remoteHost': None, 'remoteUserName': 'root', 'status': 'Passive', 'crawlStatus': 'N/A', 'timeZone': 'UTC', 'lastSynced': 'N/A', 'checkpointTime': 'N/A', 'checkpointCompletionTime': 'N/A', 'entry': 'N/A', 'data': 'N/A', 'meta': 'N/A', 'failures': 'N/A', 'checkpointCompleted': 'N/A'}, {'host': 'rhsqa-grafton9.lab.eng.blr.redhat.com', 'hostUuid': '6d7102be-e874-417f-8270-5ef6b01be89e', 'brickName': '/gluster_bricks/vmstore/vmstore', 'remoteHost': None, 'remoteUserName': 'root', 'status': 'Passive', 'crawlStatus': 'N/A', 'timeZone': 'UTC', 'lastSynced': 'N/A', 'checkpointTime': 'N/A', 'checkpointCompletionTime': 'N/A', 'entry': 'N/A', 'data': 'N/A', 'meta': 'N/A', 'failures': 'N/A', 'checkpointCompleted': 'N/A'}, {'host': 'rhsqa-grafton7.lab.eng.blr.redhat.com', 'hostUuid': 'ccd5800e-135d-4ef1-a868-f744eee88967', 'brickName': '/gluster_bricks/vmstore/vmstore', 'remoteHost': None, 'remoteUserName': 'root', 'status': 'Active', 'crawlStatus': 'Changelog Crawl', 'timeZone': 'UTC', 'lastSynced': 1604540461, 'checkpointTime': 1604494210, 'checkpointCompletionTime': 1604494543, 'entry': '2525', 'data': '0', 'meta': '0', 'failures': '0', 'checkpointCompleted': 'Yes'}]}]}}
MainProcess|jsonrpc/5::DEBUG::2020-11-05 01:41:09,906::commands::98::common.commands::(run) SUCCESS: <err> = b''; <rc> = 0
MainProcess|jsonrpc/5::DEBUG::2020-11-05 01:41:09,906::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return tasksList with {}
</snip>


--- Additional comment from Kaustav Majumder on 2020-11-06 06:43:24 UTC ---

According to the above xml the <slave_node/> has null. Turning the received json at engine side as "remoteHost": null,(From engine debug logs).
Now checking the engine code  at https://github.com/oVirt/ovirt-engine/blob/ede62008318d924556bc9dfc5710d90e9519670d/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/gluster/GlusterVolumeGeoRepStatus.java#L63
Since the key 'remoteHost' is present but has value equal to null, by logic when toString() is invoked on  null we get the NullPointerException and results in error.

gluster cli is not passing the correct xml hence it is failing. Although this is trival and can be handled at both engine/vdsm side.
Need someone from the gluster cli team to verify this.

Comment 1 SATHEESARAN 2020-11-09 06:52:17 UTC

Fix affects the geo-rep functionality related to hyperconverged solution.
This fix will not impact, any other engine side and doesn't need any more testing on that part

Comment 5 SATHEESARAN 2020-12-04 10:55:27 UTC

Tested with 4.4.3.12-0.1.el8ev and glusterfs-6.0-49.el8rhgs, with glusterfs-selinux package.

Geo-replication successfully syncs the data from the primary gluster volume to secondary gluster volume
using rsync as the sync-method.

Also post the sync disaster-recovery roles works good and the VMs could successfully start on the 
secondary site

Note You need to log in before you can comment on or make changes to this bug.