Description of problem: following logs flood the log files [2016-03-01 10:45:51.688339] W [dict.c:1282:dict_foreach_match] (-->/usr/lib64/libglusterfs.so.0(dict_foreach_match+0x65) [0x7ff139e1e5d5] -->/usr/lib64/glusterfs/3.7.8/xlator/features/index.so(+0x3950) [0x7ff12de49950] -->/usr/lib64/libglusterfs.so.0(dict_foreach_match+0xe1) [0x7ff139e1e651] ) 0-dict: dict|match|action is NULL [Invalid argument] Version-Release number of selected component (if applicable): glusterfs-3.7.8 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: I have seen the older one https://bugzilla.redhat.com/show_bug.cgi?id=1289893 but since i am using the latest version (3.7.8) fix is there. Could this be related to another part of index.c ?
This looks like it refers to an upstream release (3.7.8). Changing the product to reflect this.
is there any update on this ? thank you very much!
Moving this to Anuradha who worked on the original patch.
is there any update on this ?
Hi evangelos, That fix was made in 3.7.5. You say you have been the old issue. Did you upgrade from 3.7.5 to 3.7.8 and are seeing problem or was this volume freshly created based on 3.7.8? As far as I know all the fixes for dict is NULL in index translator are sent. But there is an issue when volfiles are not updated during an upgrade. If you had upgraded the volume, could you please provided the steps that you used to upgrade? Also, could you also verify one thing for me from brick volfiles of you volume? The brick volfiles are supposed to have the following lines: volume test-index type features/index option xattrop-pending-watchlist trusted.afr.test- <--------(1) option xattrop-dirty-watchlist trusted.afr.dirty <--------(2) option index-base /export/test/brick2/.glusterfs/indices subvolumes test-barrier end-volume The two options mentioned above should exist. Otherwise you will see this problem. You can find volfiles at /var/lib/glusterd/vols/<volname>. Thanks.
Hi Anuradha, in the mean time due to various issues we had we decided to downgrade to 3.6 branch so currently i do not have a 3.7 deployment up and running. But thx for the hint i will keep this in mind for the future. In the meantime you can close this bugzilla. thank you
Hi Evangelos, Thanks for the update. Closing this bug as per comment#7 Thanks, Anuradha.
This error is still reproduceable in 5.3 when upgrading from a 3.x volume. I had to recreate volume from scratch in 5.3 and copy data back in order to avoid it.
I just started seeing this error after upgrading from 4.1 to 5.3. [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] And it floods like crazy with these.
I "solved" the issue after upgrading to 5.3 by creating a new volume and moving all data to it. Apparently something is missing on the volumes after upgrade.
I just sent a message to the gluster mailing list about this because that's not how this problem should be resolved. I'm curious to hear what they say.
Absolutely. That's the second big issue I had after upgrading. The first one is https://bugzilla.redhat.com/show_bug.cgi?id=1651246 Still unsolved (open for more than 2 months now)
You know, I was *just* going to comment in a follow-up reply about whether the issue here is possibly related to the one you just linked. Seeing tons of those too, though at least the dupes are suppressed. ==> mnt-SITE_data1.log <== [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] ==> mnt-SITE_data3.log <== The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 413 times between [2019-01-30 20:36:23.881090] and [2019-01-30 20:38:20.015593] The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-0" repeated 42 times between [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] ==> mnt-SITE_data1.log <== The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-0" repeated 50 times between [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and [2019-01-30 20:38:20.546355] [2019-01-30 20:38:21.492319] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-0 ==> mnt-SITE_data3.log <== [2019-01-30 20:38:22.349689] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-0 ==> mnt-SITE_data1.log <== [2019-01-30 20:38:22.762941] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler
Yeah, both arrised after upgrading from 3.x to 5.1, persisting in 5.2 and 5.3. The other issue is even more critical. It causes crashes, making the mount point being inacessible ("Transport endpoint is not connected" error) - Requiring a new manual umount/mount. For now I have a crontab entry doing this, but I will have to downgrade if a fix is not issued soon...
Damn, you weren't kidding, I wish I saw these bug reports before I updated from rock solid 4.1. Less than 24 hours after upgrading, I already got a crash that you referenced: [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-3" repeated 5 times between [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 72 times between [2019-01-31 09:37:53.746741] and [2019-01-31 09:38:04.696993] pending frames: frame : type(1) op(READ) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2019-01-31 09:38:04 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.3 /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] /lib64/libc.so.6(+0x36160)[0x7fccd622d160] /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] ---------
The fuse crash happened again yesterday, to another volume. Are there any mount options that could help mitigate this? In the meantime, I set up a monit (https://mmonit.com/monit/) task to watch and restart the mount, which works and recovers the mount point within a minute. Not ideal, but a temporary workaround. By the way, the way to reproduce this "Transport endpoint is not connected" condition for testing purposes is to kill -9 the right "glusterfs --process-name fuse" process. monit check: check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 start program = "/bin/mount /mnt/glusterfs_data1" stop program = "/bin/umount /mnt/glusterfs_data1" if space usage > 90% for 5 times within 15 cycles then alert else if succeeded for 10 cycles then alert stack trace: [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fa0249e4329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fa0249e4329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 26 times between [2019-02-01 23:21:20.857333] and [2019-02-01 23:21:56.164427] The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-3" repeated 27 times between [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] pending frames: frame : type(1) op(LOOKUP) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2019-02-01 23:22:03 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.3 /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f]
(In reply to Artem Russakovskii from comment #17) > The fuse crash happened again yesterday, to another volume. Are there any > mount options that could help mitigate this? > > In the meantime, I set up a monit (https://mmonit.com/monit/) task to watch > and restart the mount, which works and recovers the mount point within a > minute. Not ideal, but a temporary workaround. > > By the way, the way to reproduce this "Transport endpoint is not connected" > condition for testing purposes is to kill -9 the right "glusterfs > --process-name fuse" process. > > > monit check: > check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 > start program = "/bin/mount /mnt/glusterfs_data1" > stop program = "/bin/umount /mnt/glusterfs_data1" > if space usage > 90% for 5 times within 15 cycles > then alert else if succeeded for 10 cycles then alert > > > stack trace: > [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fa0249e4329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fa0249e4329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 26 times between [2019-02-01 23:21:20.857333] and > [2019-02-01 23:21:56.164427] > The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] > 0-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-3" > repeated 27 times between [2019-02-01 23:21:11.142467] and [2019-02-01 > 23:22:03.474036] > pending frames: > frame : type(1) op(LOOKUP) > frame : type(0) op(0) > patchset: git://git.gluster.org/glusterfs.git > signal received: 6 > time of crash: > 2019-02-01 23:22:03 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 5.3 > /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] > /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] > /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] > /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] > /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] > /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] > /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] > /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate. > so(+0x5dc9d)[0x7fa025543c9d] > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate. > so(+0x70ba1)[0x7fa025556ba1] > /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] > /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] > /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] > /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] > /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] > /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] > /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] Please mount the volume using the option lru-limit=0 and see if the crashes go away. We are currently working on analysing some coredumps and will update once we have a fix.
(In reply to Nithya Balachandran from comment #18) Before I make this change, what is the expected effect on performance on a busy system with lots of small and large files (millions)? Also, is it possible to set the lru-limit to 0 via gluster volume options (network.inode-lru-limit?) or will it be sufficiently different from using the fuse mount option? Do gluster docs have clear instructions on how to capture a core dump because I haven't found any other than the instructions to provide one (but not how) in case of a crash?
(In reply to Artem Russakovskii from comment #19) > (In reply to Nithya Balachandran from comment #18) > > Before I make this change, what is the expected effect on performance on a > busy system with lots of small and large files (millions)? This turns off a feature that was introduced in 5.3 to clear unused inodes from memory for the fuse client gluster process and prevent the memory usage from rising. Turning it off would cause the client to behave the same as in the earlier releases. > > Also, is it possible to set the lru-limit to 0 via gluster volume options > (network.inode-lru-limit?) or will it be sufficiently different from using > the fuse mount option? Not at the moment. This is a mount option for the gluster fuse client. This is not the same as the network.inode-lru-limit option which controls the inodes in the brick processes. > > Do gluster docs have clear instructions on how to capture a core dump > because I haven't found any other than the instructions to provide one (but > not how) in case of a crash? To configure coredumps, you need to specify the size and location. The following link describes it well: https://stackoverflow.com/questions/17965/how-to-generate-a-core-dump-in-linux-on-a-segmentation-fault Make sure you have sufficient space in the location for a large coredump (it can run into several GBs)
Also seeing this issue in 4.1.7 with the Samba VFS client logs. Have worked around the issue by setting client log levels to error ('gluster volume set vol01 diagnostics.client-log-level ERROR'). Please could you advise on a long term solution? Many thanks, Ryan
Added core dump and gdb output here: https://bugzilla.redhat.com/show_bug.cgi?id=1651246 Probably same issue as this one.
Hello, Is there any progress with this? We've had multiple systems consume the entire root volume due to the log file filling the volume.
I confirm all issues are gone after upgrading to 5.4.