1580215 – [geo-rep]: Lot of changelogs retries and "dict is null" errors in geo-rep logs

Bug 1580215 - [geo-rep]: Lot of changelogs retries and "dict is null" errors in geo-rep logs

Summary: [geo-rep]: Lot of changelogs retries and "dict is null" errors in geo-rep logs

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	4.1
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Mohit Agrawal
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1499520 1565577 1600671
Blocks:	1571069 1576767
TreeView+	depends on / blocked

Reported:	2018-05-21 03:42 UTC by Mohit Agrawal
Modified:	2018-07-12 18:16 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-v4.1.0
Clone Of:	1565577
Environment:
Last Closed:	2018-06-20 18:06:39 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mohit Agrawal 2018-05-21 03:42:06 UTC

+++ This bug was initially created as a clone of Bug #1565577 +++

Description of problem:
=======================
Observed excessive 'dict is null' errors on the master and the slave:

Master:
-------
[2018-04-10 06:57:16.611887] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null" repeated 6693 times between [2018-04-10 06:57:16.611887] and [2018-04-10 06:58:16.426846]
[2018-04-10 06:58:34.040023] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null" repeated 11449 times between [2018-04-10 06:58:34.040023] and [2018-04-10 07:00:12.429952]
[2018-04-10 07:00:26.919063] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null" repeated 9760 times between [2018-04-10 07:00:26.919063] and [2018-04-10 07:01:52.179336]


Slave:
------
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null" repeated 51 times between [2018-04-10 06:24:36.408769] and [2018-04-10 06:24:37.168309]
[2018-04-10 06:24:37.179356] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = aee3b531-d3ea-4a1b-a030-91f8e98b566c). Holes=1 overlaps=0
[2018-04-10 06:24:37.213912] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null" repeated 2 times between [2018-04-10 06:24:37.213912] and [2018-04-10 06:24:37.233486]
[2018-04-10 06:24:37.244772] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = 152cd127-929e-432c-af50-9e2f5de008bd). Holes=1 overlaps=0
[2018-04-10 06:24:37.256863] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null" repeated 2 times between [2018-04-10 06:24:37.256863] and [2018-04-10 06:24:37.275916]




Version-Release number of selected component (if applicable):
==============================================================
[root@dhcp42-58 geo-replication-slaves]# rpm -qa | grep gluster
glusterfs-3.12.2-7.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7.x86_64
glusterfs-api-3.12.2-7.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-7.el7rhgs.x86_64
glusterfs-fuse-3.12.2-7.el7rhgs.x86_64
python2-gluster-3.12.2-7.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.12.2-7.el7rhgs.x86_64
glusterfs-server-3.12.2-7.el7rhgs.x86_64
glusterfs-rdma-3.12.2-7.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-cli-3.12.2-7.el7rhgs.x86_64
glusterfs-libs-3.12.2-7.el7rhgs.x86_6

How reproducible:
=================
2/2

Steps to Reproduce:
===================
1.Create and start master and slave volumes
2.Create data on the master mount
3.Create and start a geo-replication session
4.Calculate the checksum of master and slave (matches)
5.rm -rf * on master
6.Checksum matches.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-04-10 06:54:33 EDT ---

This bug is automatically being proposed for the release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgs‑3.4.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Rochelle on 2018-04-10 23:57:32 EDT ---

sosreports at : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rallan/1565577/

--- Additional comment from Kotresh HR on 2018-04-11 01:02:24 EDT ---

This is introduced because of the fix [1] in dht. Hence assigned it to Mohit.

[1]: https://code.engineering.redhat.com/gerrit/#/c/132383/

--- Additional comment from Kotresh HR on 2018-04-11 01:25:42 EDT ---

Hi Mohit,

The bug also captures problems of geo-rep retries for directories. I had raised the concern with the bug [1] long back when the fix landed in upstream and there was no root cause or fix for the same. This should be root caused and fixed as it's causing lot of changelog retries which will slow down geo-rep significantly for directory operation workloads.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1499520

Thanks,
Kotresh HR

--- Additional comment from Rochelle on 2018-04-11 01:47:10 EDT ---

This issue is seen on the latest 3.4.0 bits : glusterfs-fuse-3.12.2-7.el7rhgs.x86_64

Geo-replication logs shows retrial of changelogs but its been processed already.
This is happening for every directory sync.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-04-23 08:12:47 EDT ---

This bug is automatically being provided 'pm_ack+' for the release flag 'rhgs‑3.4.0', having been appropriately marked for the release, and having been provided ACK from Development and QE

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-04-24 07:26:44 EDT ---

Since this bug has has been approved for the RHGS 3.4.0 release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.4.0+', and through the Internal Whiteboard entry of '3.4.0', the Target Release is being automatically set to 'RHGS 3.4.0'

--- Additional comment from Mohit Agrawal on 2018-04-30 01:16:13 EDT ---

Hi,

Patch is posted on upstream

https://review.gluster.org/#/c/19930/

Regards
Mohit Agrawal

--- Additional comment from Sunil Kumar Acharya on 2018-05-07 08:22:25 EDT ---

Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/137779/

--- Additional comment from auto-bz-updater on 2018-05-09 11:46:24 EDT ---

Bug report changed to ON_QA status by bugzilla-updater. Bug has been added to an advisory.
https://errata.devel.redhat.com/advisory/32725

--- Additional comment from Rochelle on 2018-05-10 07:20:56 EDT ---

with build : 

glusterfs-cli-3.12.2-9.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.2.x86_64
python2-gluster-3.12.2-9.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-libs-3.12.2-9.el7rhgs.x86_64
glusterfs-fuse-3.12.2-9.el7rhgs.x86_64
glusterfs-events-3.12.2-9.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-3.12.2-9.el7rhgs.x86_64
glusterfs-api-3.12.2-9.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-9.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-9.el7rhgs.x86_64
glusterfs-rdma-3.12.2-9.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-server-3.12.2-9.el7rhgs.x86_64




Changelog retries are not seen in the geo-rep logs as expected but 'dict is null' errors are still seen in the master and slave:

master:
-------
ssh%3A%2F%2Froot%4010.70.42.164%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Frhs%2Fbrick1%2Fb1.gluster.log:[2018-05-10 10:57:39.026498] E [MSGID: 101046] [dht-common.c:1904:dht_revalidate_cbk] 0-master-dht: dict is null
ssh%3A%2F%2Froot%4010.70.42.164%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Frhs%2Fbrick2%2Fb4.gluster.log:[2018-05-10 10:57:37.256890] E [MSGID: 101046] [dht-common.c:1904:dht_revalidate_cbk] 0-master-dht: dict is null


slave:
-----
common.c:751:dht_discover_complete] 0-slave-dht: dict is null
ebfab113-66ca-4f01-b3fa-f5210b6d0bc7:10.70.42.53.%2Frhs%2Fbrick2%2Fb4.slave.gluster.log:[2018-05-10 10:53:25.076222] E [MSGID: 101046] [dht-common.c:751:dht_discover_complete] 0-slave-dht: dict is null
ebfab113-66ca-4f01-b3fa-f5210b6d0bc7:10.70.42.53.%2Frhs%2Fbrick2%2Fb4.slave.gluster.log:[2018-05-10 10:53:25.113736] E [MSGID: 101046] [dht-common.c:751:dht_discover_complete] 0-slave-dht: dict is null

Moving this but to Failed_QA

--- Additional comment from Mohit Agrawal on 2018-05-10 07:37:20 EDT ---

Hi,

Last time at the time of fixing the issue I missed to fix the issue specific to avoid logs
in DHT.

I have posted a patch on upstream to resolve the same
https://review.gluster.org/20001


Regards
Mohit Agrawal

--- Additional comment from Sunil Kumar Acharya on 2018-05-14 02:47:28 EDT ---

Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/138490/

--- Additional comment from errata-xmlrpc on 2018-05-15 09:11:02 EDT ---

Bug report changed from MODIFIED to ON_QA status by the Errata System: 
Advisory RHEA-2018:32725-01: 
Changed by: Milind Changire (mchangir)
https://errata.devel.redhat.com/advisory/32725

--- Additional comment from Rochelle on 2018-05-16 02:13:03 EDT ---

Changelog retries are no longer seen thereby improving the time it takes for syncing.
 dict is null' errors are no longer seen as well.

[root@dhcp43-122 master]# rpm -qa | grep gluster
glusterfs-events-3.12.2-10.el7rhgs.x86_64
glusterfs-3.12.2-10.el7rhgs.x86_64
python2-gluster-3.12.2-10.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.4.x86_64
glusterfs-fuse-3.12.2-10.el7rhgs.x86_64
glusterfs-cli-3.12.2-10.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-server-3.12.2-10.el7rhgs.x86_64
glusterfs-rdma-3.12.2-10.el7rhgs.x86_64
glusterfs-libs-3.12.2-10.el7rhgs.x86_64
glusterfs-api-3.12.2-10.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-10.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-10.el7rhgs.x86_64


Moving this bug to verified.

Comment 2 Worker Ant 2018-05-22 04:18:14 UTC

REVIEW: https://review.gluster.org/20041 (dht: Excessive 'dict is null' logs in dht_discover_complete) posted (#2) for review on release-4.1 by N Balachandran

Comment 3 Worker Ant 2018-05-22 10:25:45 UTC

COMMIT: https://review.gluster.org/20041 committed in release-4.1 by "Shyamsundar Ranganathan" <srangana> with a commit message- dht: Excessive 'dict is null' logs in dht_discover_complete

Problem: In Geo-Rep setup excessive "dict is null" logs in
         dht_discover_complete while xattr is NULL

Solution: To avoid the logs update a condition in dht_discover_complete



BUG: 1580215
Change-Id: Ic7aad712d9b6d69b85b76e4fdf2881adb0512237
Signed-off-by: Mohit Agrawal <moagrawa>

Comment 4 Shyamsundar 2018-06-20 18:06:39 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report.

glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.