Description of problem: ------------------------ RHHI-V DR mechanism makes use of gluster geo-replication to sync the data to the remote site. I see that works good, and the checkpoint is reached, which means the data is synced successfully to the secondary site. But RHV Manager at the primary site fails to recognize the completion of geo-rep data sync and waits indefinitely. Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHV Manager 4.4.3 RHVH 4.4.3 RHGS 3.5.3 How reproducible: ------------------- Always Steps to Reproduce: ------------------- 1. Create a primary site with 3 node RHHI-V deployment 2. Create a secondary site with 3 node RHHI-V deployment, with no storage domains created, but just the volumes created 3. Create a VM with 40GB OS disk and install it with RHEL 8.3 4. Create a geo-rep session from primary site to secondary site 5. Create a schedule to sync the data to secondary site 6. Wait for the schedule for geo-rep session to get triggered Actual results: --------------- Geo-rep session starts and syncs the data successfully, which the RHV Manager /Engine fails to interpret Expected results: ----------------- Once the gluster geo-replication successfully completes data sync, engine should understand the same, and appropriate events to be triggered. --- Additional comment from SATHEESARAN on 2020-11-05 01:48:27 UTC --- Suspected snippet from engine.log: <snip> 2020-11-05 01:41:09,406Z INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-42) [] START, Gluster TasksListVDSCommand(HostName = rhsqa-grafton8-nic2.lab.eng.blr.redhat.com, VdsIdVDSCommandParametersBase:{hostId='ec261014-f90d-4044-ae99-18b6d51defa9'}), log id: 99d5563 2020-11-05 01:41:09,882Z ERROR [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeGeoRepSessionStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-55904) [] Failed in 'GetGlusterV olumeGeoRepSessionStatusVDS' method, for vds: 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com'; host: 'rhsqa-grafton8-nic2.lab.eng.blr.redhat.com': null 2020-11-05 01:41:09,882Z ERROR [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeGeoRepSessionStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-55904) [] Command 'GetGlusterVol umeGeoRepSessionStatusVDSCommand(HostName = rhsqa-grafton8-nic2.lab.eng.blr.redhat.com, GlusterVolumeGeoRepSessionVDSParameters:{hostId='ec261014-f90d-4044-ae99-18b6d51defa9', volumeName='vm store'})' execution failed: null 2020-11-05 01:41:09,882Z INFO [org.ovirt.engine.core.vdsbroker.gluster.GetGlusterVolumeGeoRepSessionStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-55904) [] FINISH, GetGlusterVolu meGeoRepSessionStatusVDSCommand, return: , log id: 7573d0bf 2020-11-05 01:41:09,882Z ERROR [org.ovirt.engine.core.bll.gluster.GlusterGeoRepSyncJob] (EE-ManagedThreadFactory-engine-Thread-55904) [] Exception getting geo-rep status from vds EngineExcep tion: java.lang.NullPointerException (Failed with error ENGINE and code 5001) 2020-11-05 01:41:09,882Z INFO [org.ovirt.engine.core.bll.gluster.GlusterGeoRepSyncJob] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-26) [] Geo-replication session de tails not updated for session 'ccd5800e-135d-4ef1-a868-f744eee88967:ssh://rhsqa-grafton10-nic2.lab.eng.blr.redhat.com::backupvol:ab084f86-cf8c-4cc6-89f5-10c53df60893' as there was error retu rning data from VDS </snip> At the same instance from supervdsm.log from that VDS host ---------------------------------------------------------- <snip> MainProcess|jsonrpc/4::DEBUG::2020-11-05 01:41:09,894::commands::98::common.commands::(run) SUCCESS: <err> = b''; <rc> = 0 MainProcess|jsonrpc/4::DEBUG::2020-11-05 01:41:09,895::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return volumeGeoRepStatus with {'vmstore': {'sessions': [{'sessionKey': 'ccd5800e-135d-4ef1-a868-f744eee88967:ssh://rhsqa-grafton10-nic2.lab.eng.blr.redhat.com::backupvol:ab084f86-cf8c-4cc6-89f5-10c53df60893', 'remoteVolumeName': 'backupvol', 'bricks': [{'host': 'rhsqa-grafton8.lab.eng.blr.redhat.com', 'hostUuid': 'e5de2108-b3e1-4221-b00b-badc39137c9c', 'brickName': '/gluster_bricks/vmstore/vmstore', 'remoteHost': None, 'remoteUserName': 'root', 'status': 'Passive', 'crawlStatus': 'N/A', 'timeZone': 'UTC', 'lastSynced': 'N/A', 'checkpointTime': 'N/A', 'checkpointCompletionTime': 'N/A', 'entry': 'N/A', 'data': 'N/A', 'meta': 'N/A', 'failures': 'N/A', 'checkpointCompleted': 'N/A'}, {'host': 'rhsqa-grafton9.lab.eng.blr.redhat.com', 'hostUuid': '6d7102be-e874-417f-8270-5ef6b01be89e', 'brickName': '/gluster_bricks/vmstore/vmstore', 'remoteHost': None, 'remoteUserName': 'root', 'status': 'Passive', 'crawlStatus': 'N/A', 'timeZone': 'UTC', 'lastSynced': 'N/A', 'checkpointTime': 'N/A', 'checkpointCompletionTime': 'N/A', 'entry': 'N/A', 'data': 'N/A', 'meta': 'N/A', 'failures': 'N/A', 'checkpointCompleted': 'N/A'}, {'host': 'rhsqa-grafton7.lab.eng.blr.redhat.com', 'hostUuid': 'ccd5800e-135d-4ef1-a868-f744eee88967', 'brickName': '/gluster_bricks/vmstore/vmstore', 'remoteHost': None, 'remoteUserName': 'root', 'status': 'Active', 'crawlStatus': 'Changelog Crawl', 'timeZone': 'UTC', 'lastSynced': 1604540461, 'checkpointTime': 1604494210, 'checkpointCompletionTime': 1604494543, 'entry': '2525', 'data': '0', 'meta': '0', 'failures': '0', 'checkpointCompleted': 'Yes'}]}]}} MainProcess|jsonrpc/5::DEBUG::2020-11-05 01:41:09,906::commands::98::common.commands::(run) SUCCESS: <err> = b''; <rc> = 0 MainProcess|jsonrpc/5::DEBUG::2020-11-05 01:41:09,906::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return tasksList with {} </snip> --- Additional comment from Kaustav Majumder on 2020-11-06 06:43:24 UTC --- According to the above xml the <slave_node/> has null. Turning the received json at engine side as "remoteHost": null,(From engine debug logs). Now checking the engine code at https://github.com/oVirt/ovirt-engine/blob/ede62008318d924556bc9dfc5710d90e9519670d/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/gluster/GlusterVolumeGeoRepStatus.java#L63 Since the key 'remoteHost' is present but has value equal to null, by logic when toString() is invoked on null we get the NullPointerException and results in error. gluster cli is not passing the correct xml hence it is failing. Although this is trival and can be handled at both engine/vdsm side. Need someone from the gluster cli team to verify this.
Fix affects the geo-rep functionality related to hyperconverged solution. This fix will not impact, any other engine side and doesn't need any more testing on that part
Tested with 4.4.3.12-0.1.el8ev and glusterfs-6.0-49.el8rhgs, with glusterfs-selinux package. Geo-replication successfully syncs the data from the primary gluster volume to secondary gluster volume using rsync as the sync-method. Also post the sync disaster-recovery roles works good and the VMs could successfully start on the secondary site