1571069 – [geo-rep]: Lot of changelogs retries and "dict is null" errors in geo-rep logs

Bug 1571069 - [geo-rep]: Lot of changelogs retries and "dict is null" errors in geo-rep logs

Summary: [geo-rep]: Lot of changelogs retries and "dict is null" errors in geo-rep logs

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Mohit Agrawal
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1499520 1565577 1576767 1580215 1600671
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-24 04:18 UTC by Mohit Agrawal
Modified:	2018-07-12 18:16 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-v4.1.0
Clone Of:	1565577
Environment:
Last Closed:	2018-06-20 18:05:40 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mohit Agrawal 2018-04-24 04:18:23 UTC

+++ This bug was initially created as a clone of Bug #1565577 +++

Description of problem:
=======================
Observed excessive 'dict is null' errors on the master and the slave:

Master:
-------
[2018-04-10 06:57:16.611887] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null" repeated 6693 times between [2018-04-10 06:57:16.611887] and [2018-04-10 06:58:16.426846]
[2018-04-10 06:58:34.040023] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null" repeated 11449 times between [2018-04-10 06:58:34.040023] and [2018-04-10 07:00:12.429952]
[2018-04-10 07:00:26.919063] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-master-dht: dict is null" repeated 9760 times between [2018-04-10 07:00:26.919063] and [2018-04-10 07:01:52.179336]


Slave:
------
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null" repeated 51 times between [2018-04-10 06:24:36.408769] and [2018-04-10 06:24:37.168309]
[2018-04-10 06:24:37.179356] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = aee3b531-d3ea-4a1b-a030-91f8e98b566c). Holes=1 overlaps=0
[2018-04-10 06:24:37.213912] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null" repeated 2 times between [2018-04-10 06:24:37.213912] and [2018-04-10 06:24:37.233486]
[2018-04-10 06:24:37.244772] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-slave-dht: Found anomalies in (null) (gfid = 152cd127-929e-432c-af50-9e2f5de008bd). Holes=1 overlaps=0
[2018-04-10 06:24:37.256863] E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null
The message "E [MSGID: 101046] [dht-common.c:749:dht_discover_complete] 0-slave-dht: dict is null" repeated 2 times between [2018-04-10 06:24:37.256863] and [2018-04-10 06:24:37.275916]




Version-Release number of selected component (if applicable):
==============================================================
[root@dhcp42-58 geo-replication-slaves]# rpm -qa | grep gluster
glusterfs-3.12.2-7.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7.x86_64
glusterfs-api-3.12.2-7.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-7.el7rhgs.x86_64
glusterfs-fuse-3.12.2-7.el7rhgs.x86_64
python2-gluster-3.12.2-7.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.12.2-7.el7rhgs.x86_64
glusterfs-server-3.12.2-7.el7rhgs.x86_64
glusterfs-rdma-3.12.2-7.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-cli-3.12.2-7.el7rhgs.x86_64
glusterfs-libs-3.12.2-7.el7rhgs.x86_6

How reproducible:
=================
2/2

Steps to Reproduce:
===================
1.Create and start master and slave volumes
2.Create data on the master mount
3.Create and start a geo-replication session
4.Calculate the checksum of master and slave (matches)
5.rm -rf * on master
6.Checksum matches.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-04-10 06:54:33 EDT ---

This bug is automatically being proposed for the release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgs‑3.4.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Rochelle on 2018-04-10 23:57:32 EDT ---

sosreports at : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rallan/1565577/

--- Additional comment from Kotresh HR on 2018-04-11 01:02:24 EDT ---

This is introduced because of the fix [1] in dht. Hence assigned it to Mohit.

[1]: https://code.engineering.redhat.com/gerrit/#/c/132383/

--- Additional comment from Kotresh HR on 2018-04-11 01:25:42 EDT ---

Hi Mohit,

The bug also captures problems of geo-rep retries for directories. I had raised the concern with the bug [1] long back when the fix landed in upstream and there was no root cause or fix for the same. This should be root caused and fixed as it's causing lot of changelog retries which will slow down geo-rep significantly for directory operation workloads.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1499520

Thanks,
Kotresh HR

--- Additional comment from Rochelle on 2018-04-11 01:47:10 EDT ---

This issue is seen on the latest 3.4.0 bits : glusterfs-fuse-3.12.2-7.el7rhgs.x86_64

Geo-replication logs shows retrial of changelogs but its been processed already.
This is happening for every directory sync.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-04-23 08:12:47 EDT ---

This bug is automatically being provided 'pm_ack+' for the release flag 'rhgs‑3.4.0', having been appropriately marked for the release, and having been provided ACK from Development and QE

Comment 2 Worker Ant 2018-04-24 04:25:33 UTC

REVIEW: https://review.gluster.org/19930 (posix: Avoid changelog retries for geo-rep) posted (#1) for review on master by MOHIT AGRAWAL

Comment 3 Worker Ant 2018-05-03 15:43:23 UTC

COMMIT: https://review.gluster.org/19930 committed in master by "Amar Tumballi" <amarts> with a commit message- posix: Avoid changelog retries for geo-rep

Problem: georep is slowdown to migrate directory
         from master volume to slave volume due to lot
         of changelog retries

Solution: Update the condition in posix_getxattr to
          ignore MDS_INTERNAL_XATTR as it(posix) ignored
          other internal xattrs

BUG: 1571069
Change-Id: I4d91ec73e5b1ca1cb3ecf0825ab9f49e261da70e
fixes: bz#1571069
Signed-off-by: Mohit Agrawal <moagrawa>

Comment 4 Shyamsundar 2018-06-20 18:05:40 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v4.1.0, please open a new bug report.

glusterfs-v4.1.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-June/000102.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.