Bug 1671603

Summary:	flooding of "dict is NULL" logging & crash of client process
Product:	[Community] GlusterFS	Reporter:	Amar Tumballi <atumball>
Component:	core	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	low	Docs Contact:
Priority:	low
Version:	5	CC:	amgad.saleh, archon810, bugs, pasik, vpvainio
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-6.x	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1313567
Clones:	1674225 (view as bug list)		Environment:
Last Closed:	2019-07-10 06:13:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1313567
Bug Blocks:	1667103, 1674225

Description Amar Tumballi 2019-02-01 03:17:34 UTC

+++ This bug was initially created as a clone of Bug #1313567 +++

Description of problem:
following logs flood the log files

[2016-03-01 10:45:51.688339] W [dict.c:1282:dict_foreach_match] (-->/usr/lib64/libglusterfs.so.0(dict_foreach_match+0x65) [0x7ff139e1e5d5] -->/usr/lib64/glusterfs/3.7.8/xlator/features/index.so(+0x3950) [0x7ff12de49950] -->/usr/lib64/libglusterfs.so.0(dict_foreach_match+0xe1) [0x7ff139e1e651] ) 0-dict: dict|match|action is NULL [Invalid argument]

Version-Release number of selected component (if applicable):
glusterfs-3.7.8

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
I have seen the older one 
https://bugzilla.redhat.com/show_bug.cgi?id=1289893
but since i am using the latest version (3.7.8) fix is there. Could this be related to another part of index.c ?

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-03-01 23:14:38 UTC ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Nithya Balachandran on 2016-03-09 04:16:05 UTC ---

This looks like it refers to an upstream release (3.7.8). Changing the product to reflect this.

--- Additional comment from evangelos on 2016-03-09 09:18:21 UTC ---

is there any update on this ?

thank you very much!

--- Additional comment from Nithya Balachandran on 2016-03-09 10:43:09 UTC ---

Moving this to Anuradha who worked on the original patch.

--- Additional comment from evangelos on 2016-04-12 11:20:00 UTC ---

is there any update on this ?

--- Additional comment from Anuradha on 2016-06-23 10:01:53 UTC ---

Hi evangelos,

That fix was made in 3.7.5. You say you have been the old issue. Did you upgrade from 3.7.5 to 3.7.8 and are seeing problem or was this volume freshly created based on 3.7.8?

As far as I know all the fixes for dict is NULL in index translator are sent. But there is an issue when volfiles are not updated during an upgrade.
If you had upgraded the volume, could you please provided the steps that you used to upgrade?

Also, could you also verify one thing for me from brick volfiles of you volume?
The brick volfiles are supposed to have the following lines:

volume test-index
    type features/index
    option xattrop-pending-watchlist trusted.afr.test-    <--------(1)
    option xattrop-dirty-watchlist trusted.afr.dirty      <--------(2)
    option index-base /export/test/brick2/.glusterfs/indices
    subvolumes test-barrier
end-volume

The two options mentioned above should exist. Otherwise you will see this problem. You can find volfiles at /var/lib/glusterd/vols/<volname>.

Thanks.

--- Additional comment from evangelos on 2016-07-07 19:28:59 UTC ---

Hi Anuradha, 

in the mean time due to various issues we had we decided to downgrade to 3.6 branch so currently i do not have a 3.7 deployment up and running. But thx for the hint i will keep this in mind for the future.
In the meantime you can close this bugzilla.

thank you

--- Additional comment from Anuradha on 2016-07-11 09:09:11 UTC ---

Hi Evangelos,

Thanks for the update.

Closing this bug as per comment#7

Thanks,
Anuradha.

--- Additional comment from Emerson Gomes on 2019-01-27 15:42:59 UTC ---

This error is still reproduceable in 5.3 when upgrading from a 3.x volume.
I had to recreate volume from scratch in 5.3 and copy data back in order to avoid it.

--- Additional comment from Artem Russakovskii on 2019-01-30 20:23:44 UTC ---

I just started seeing this error after upgrading from 4.1 to 5.3.

[2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]

And it floods like crazy with these.

--- Additional comment from Emerson Gomes on 2019-01-30 20:33:12 UTC ---

I "solved" the issue after upgrading to 5.3 by creating a new volume and moving all data to it.

Apparently something is missing on the volumes after upgrade.

--- Additional comment from Artem Russakovskii on 2019-01-30 20:37:13 UTC ---

I just sent a message to the gluster mailing list about this because that's not how this problem should be resolved. I'm curious to hear what they say.

--- Additional comment from Emerson Gomes on 2019-01-30 20:39:04 UTC ---

Absolutely. That's the second big issue I had after upgrading. The first one is https://bugzilla.redhat.com/show_bug.cgi?id=1651246
Still unsolved (open for more than 2 months now)

--- Additional comment from Artem Russakovskii on 2019-01-30 20:40:29 UTC ---

You know, I was *just* going to comment in a follow-up reply about whether the issue here is possibly related to the one you just linked. Seeing tons of those too, though at least the dupes are suppressed.



==> mnt-SITE_data1.log <==
[2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]

==> mnt-SITE_data3.log <==
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 413 times between [2019-01-30 20:36:23.881090] and [2019-01-30 20:38:20.015593]
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-0" repeated 42 times between [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306]

==> mnt-SITE_data1.log <==
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-0" repeated 50 times between [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789]
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and [2019-01-30 20:38:20.546355]
[2019-01-30 20:38:21.492319] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-0

==> mnt-SITE_data3.log <==
[2019-01-30 20:38:22.349689] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-0

==> mnt-SITE_data1.log <==
[2019-01-30 20:38:22.762941] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler

--- Additional comment from Emerson Gomes on 2019-01-30 20:48:52 UTC ---

Yeah, both arrised after upgrading from 3.x to 5.1, persisting in 5.2 and 5.3.

The other issue is even more critical.
It causes crashes, making the mount point being inacessible ("Transport endpoint is not connected" error) - Requiring a new manual umount/mount.
For now I have a crontab entry doing this, but I will have to downgrade if a fix is not issued soon...

--- Additional comment from Artem Russakovskii on 2019-01-31 18:00:40 UTC ---

Damn, you weren't kidding, I wish I saw these bug reports before I updated from rock solid 4.1.

Less than 24 hours after upgrading, I already got a crash that you referenced:


[2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]
[2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]
[2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]
[2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-3" repeated 5 times between [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061]
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 72 times between [2019-01-31 09:37:53.746741] and [2019-01-31 09:38:04.696993]
pending frames:
frame : type(1) op(READ)
frame : type(1) op(OPEN)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash:
2019-01-31 09:38:04
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.3
/usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6]
/lib64/libc.so.6(+0x36160)[0x7fccd622d160]
/lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0]
/lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1]
/lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa]
/lib64/libc.so.6(+0x2e772)[0x7fccd6225772]
/lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8]
/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d]
/usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778]
/usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820]
/usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063]
/usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2]
/usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3]
/lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559]
/lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f]
---------

Comment 1 Artem Russakovskii 2019-02-02 20:15:09 UTC

The fuse crash happened again yesterday, to another volume. Are there any mount options that could help mitigate this?

In the meantime, I set up a monit (https://mmonit.com/monit/) task to watch and restart the mount, which works and recovers the mount point within a minute. Not ideal, but a temporary workaround.

By the way, the way to reproduce this "Transport endpoint is not connected" condition for testing purposes is to kill -9 the right "glusterfs --process-name fuse" process.


monit check:
check filesystem glusterfs_data1 with path /mnt/glusterfs_data1
  start program  = "/bin/mount  /mnt/glusterfs_data1"
  stop program  = "/bin/umount /mnt/glusterfs_data1"
  if space usage > 90% for 5 times within 15 cycles
    then alert else if succeeded for 10 cycles then alert


stack trace:
[2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fa0249e4329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument]
[2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fa0249e4329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument]
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 26 times between [2019-02-01 23:21:20.857333] and [2019-02-01 23:21:56.164427]
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-3" repeated 27 times between [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036]
pending frames:
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash:
2019-02-01 23:22:03
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.3
/usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6]
/lib64/libc.so.6(+0x36160)[0x7fa02c12d160]
/lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0]
/lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1]
/lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa]
/lib64/libc.so.6(+0x2e772)[0x7fa02c125772]
/lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8]
/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d]
/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1]
/usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f]
/usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820]
/usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063]
/usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2]
/usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3]
/lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559]
/lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f]

Comment 2 Amar Tumballi 2019-07-10 06:13:47 UTC

This got fixed with system flags which made spinlock properly. 

Also, the dict is NULL logs are now not coming up in the deployments.