Description of problem: ======================= During geo-replication regression automated testing found 8 core files on 3 slave nodes. Setup: Master Volume: Tiered (both cold and hot tier are distributed-replicate (4x2) and (2x2)) Slave Volume: Distributed-replicate (4x2) from 6 node slave cluster Core files: =========== Slave Node1: ============ [root@dhcp37-99 ~]# file /core* /core.11201: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re' /core.11342: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re' [root@dhcp37-99 ~]# Slave Node4: ============ [root@dhcp37-199 ~]# file /core* /core.8970: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re' /core.9067: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re' /core.9113: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re' /core.9161: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re' /core.9210: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re' [root@dhcp37-199 ~]# Slave Node5: ============ [root@dhcp37-162 ~]# file /core* /core.9392: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-re' [root@dhcp37-162 ~]# Bt's: ===== # gdb /core.8970 (gdb) bt #0 0x00007f5bdbf92696 in __memcpy_ssse3_back () from /lib64/libc.so.6 #1 0x00007f5bcedd2b84 in memcpy (__len=<optimized out>, __src=<optimized out>, __dest=<optimized out>) at /usr/include/bits/string3.h:51 #2 qr_content_extract (xdata=xdata@entry=0x7f5bdda61990) at quick-read.c:278 #3 0x00007f5bcedd2f54 in qr_lookup_cbk (frame=0x7f5bdb2a13bc, cookie=<optimized out>, this=0x7f5bc801a6f0, op_ret=0, op_errno=116, inode_ret=0x7f5bc73230dc, buf=0x7f5bce051864, xdata=0x7f5bdda61990, postparent=0x7f5bce051a94) at quick-read.c:422 #4 0x00007f5bcefde91c in ioc_lookup_cbk (frame=0x7f5bdb29c520, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, inode=0x7f5bc73230dc, stbuf=0x7f5bce051864, xdata=0x7f5bdda61990, postparent=0x7f5bce051a94) at io-cache.c:260 #5 0x00007f5bcf83c184 in dht_discover_complete (this=this@entry=0x7f5bc8014130, discover_frame=discover_frame@entry=0x7f5bdb29dd50) at dht-common.c:371 #6 0x00007f5bcf83cd1a in dht_discover_cbk (frame=0x7f5bdb29dd50, cookie=0x7f5bdb29f37c, this=0x7f5bc8014130, op_ret=<optimized out>, op_errno=117, inode=0x7f5bc73230dc, stbuf=0x7f5bb80721d0, xattr=0x7f5bdda61990, postparent=0x7f5bb8072240) at dht-common.c:506 #7 0x00007f5bcfac54de in afr_discover_done (this=<optimized out>, frame=0x7f5bdb29f37c) at afr-common.c:2166 #8 afr_discover_cbk (frame=0x7f5bdb29f37c, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, inode=<optimized out>, buf=0x7f5bd0724930, xdata=0x7f5bdda61a1c, postparent=0x7f5bd07249a0) at afr-common.c:2211 #9 0x00007f5bcfd0d467 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f5bdb2a1264) at client-rpc-fops.c:2981 #10 0x00007f5bdd55eb20 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f5bc813f4d0, pollin=pollin@entry=0x7f5bc8228cb0) at rpc-clnt.c:766 #11 0x00007f5bdd55eddf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f5bc813f500, event=<optimized out>, data=0x7f5bc8228cb0) at rpc-clnt.c:907 #12 0x00007f5bdd55a913 in rpc_transport_notify (this=this@entry=0x7f5bc814f150, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f5bc8228cb0) at rpc-transport.c:545 #13 0x00007f5bd21f34b6 in socket_event_poll_in (this=this@entry=0x7f5bc814f150) at socket.c:2236 #14 0x00007f5bd21f63a4 in socket_event_handler (fd=fd@entry=11, idx=idx@entry=9, data=0x7f5bc814f150, poll_in=1, poll_out=0, poll_err=0) at socket.c:2349 #15 0x00007f5bdd7f18ca in event_dispatch_epoll_handler (event=0x7f5bd0724e80, event_pool=0x7f5bde206c90) at event-epoll.c:575 #16 event_dispatch_epoll_worker (data=0x7f5bde25d440) at event-epoll.c:678 #17 0x00007f5bdc5f8dc5 in start_thread () from /lib64/libpthread.so.0 #18 0x00007f5bdbf3f1cd in clone () from /lib64/libc.so.6 (gdb) # gdb /core.9113 (gdb) bt #0 0x00007f73dd520d19 in __memcpy_ssse3_back () from /lib64/libc.so.6 #1 0x00007f73d03632c9 in memcpy (__len=17584, __src=<optimized out>, __dest=<optimized out>) at /usr/include/bits/string3.h:51 #2 qr_readv_cached (frame=frame@entry=0x7f73dc835cf8, qr_inode=0x7f73c401a050, size=size@entry=20480, offset=offset@entry=0, flags=flags@entry=32768, xdata=xdata@entry=0x0) at quick-read.c:580 #3 0x00007f73d036351a in qr_readv (frame=0x7f73dc835cf8, this=0x7f73cc01a6f0, fd=0x7f73cc2139f8, size=20480, offset=0, flags=32768, xdata=0x0) at quick-read.c:617 #4 0x00007f73ded30d48 in default_readv_resume (frame=0x7f73dc836b14, this=0x7f73cc01bb10, fd=0x7f73cc2139f8, size=20480, offset=0, flags=32768, xdata=0x0) at defaults.c:1405 #5 0x00007f73ded4fb82 in call_resume_wind (stub=<optimized out>) at call-stub.c:2118 #6 0x00007f73ded5017d in call_resume (stub=0x7f73dc2d206c) at call-stub.c:2576 #7 0x00007f73d0159648 in open_and_resume (this=this@entry=0x7f73cc01bb10, fd=fd@entry=0x7f73cc2139f8, stub=stub@entry=0x7f73dc2d206c) at open-behind.c:242 #8 0x00007f73d015b8af in ob_readv (frame=0x7f73dc836b14, this=0x7f73cc01bb10, fd=<optimized out>, size=<optimized out>, offset=0, flags=32768, xdata=0x0) at open-behind.c:392 #9 0x00007f73cbdf63c3 in mdc_readv (frame=0x7f73dc839008, this=0x7f73cc01cee0, fd=0x7f73cc21398c, size=20480, offset=0, flags=32768, xdata=0x0) at md-cache.c:1499 #10 0x00007f73cbbda83d in io_stats_readv (frame=0x7f73dc834f88, this=0x7f73cc01e2b0, fd=0x7f73cc21398c, size=20480, offset=0, flags=32768, xdata=0x0) at io-stats.c:2141 #11 0x00007f73ded246a0 in default_readv (frame=0x7f73dc834f88, this=0x7f73cc01f7b0, fd=0x7f73cc21398c, size=20480, offset=0, flags=32768, xdata=0x0) at defaults.c:1846 #12 0x00007f73ded246a0 in default_readv (frame=0x7f73dc834f88, this=0x7f73cc020890, fd=0x7f73cc21398c, size=20480, offset=0, flags=32768, xdata=0x0) at defaults.c:1846 #13 0x00007f73ded246a0 in default_readv (frame=0x7f73dc834f88, this=0x7f73cc021c70, fd=0x7f73cc21398c, size=20480, offset=0, flags=32768, xdata=0x0) at defaults.c:1846 #14 0x00007f73cb5aa07e in meta_readv (frame=0x7f73dc834f88, this=0x7f73cc021c70, fd=0x7f73cc21398c, size=20480, offset=0, flags=32768, xdata=0x0) at meta.c:79 #15 0x00007f73d59ae490 in fuse_readv_resume (state=0x7f73bc04e650) at fuse-bridge.c:2231 #16 0x00007f73d59a6cb5 in fuse_resolve_done (state=<optimized out>) at fuse-resolve.c:642 #17 fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:669 #18 0x00007f73d59a69f8 in fuse_resolve (state=0x7f73bc04e650) at fuse-resolve.c:633 #19 0x00007f73d59a6cfe in fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:665 #20 0x00007f73d59a6283 in fuse_resolve_continue (state=state@entry=0x7f73bc04e650) at fuse-resolve.c:685 #21 0x00007f73d59a6998 in fuse_resolve_fd (state=0x7f73bc04e650) at fuse-resolve.c:545 #22 fuse_resolve (state=0x7f73bc04e650) at fuse-resolve.c:622 #23 0x00007f73d59a6cde in fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:658 #24 0x00007f73d59a6d20 in fuse_resolve_and_resume (state=0x7f73bc04e650, fn=0x7f73d59ae270 <fuse_readv_resume>) at fuse-resolve.c:697 #25 0x00007f73d59b9a0e in fuse_thread_proc (data=0x7f73df7eda30) at fuse-bridge.c:4935 #26 0x00007f73ddb88dc5 in start_thread () from /lib64/libpthread.so.0 #27 0x00007f73dd4cf1cd in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.5-10.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: =================== 1. Run geo-rep distaf automation for following cases testcases="changelog-test-create changelog-test-chmod changelog-test-chown changelog-test-chgrp changelog-test-symlink changelog-test-truncate changelog-test-rename changelog-test-remove xsync-test-create xsync-test-chmod xsync-test-chown xsync-test-chgrp xsync-test-symlink xsync-test-truncate history-test-create history-test-chmod history-test-chown history-test-chgrp history-test-symlink history-test-truncate history-test-rename history-test-remove history-dynamic-create history-dynamic-chmod history-dynamic-chown history-dynamic-chgrp history-dynamic-symlink history-dynamic-truncate history-dynamic-rename history-dynamic-remove" 2. python main.py -d "geo_rep" -t "$testcases"
Observed this again with build: rglusterfs-3.7.5-15.el7rhgs.x86_64 (gdb) bt #0 0x00007f0f548cadfb in __memcpy_sse2 () from /lib64/libc.so.6 #1 0x00007f0f435c7b84 in qr_content_extract () from /usr/lib64/glusterfs/3.7.5/xlator/performance/quick-read.so #2 0x00007f0f435c7f54 in qr_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/performance/quick-read.so #3 0x00007f0f437d391c in ioc_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/performance/io-cache.so #4 0x00007f0f4822d364 in dht_discover_complete () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so #5 0x00007f0f4822defa in dht_discover_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so #6 0x00007f0f484b79ee in afr_discover_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so #7 0x00007f0f486ff477 in client3_3_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so #8 0x00007f0f55f51b20 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0 #9 0x00007f0f55f51ddf in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #10 0x00007f0f55f4d913 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #11 0x00007f0f4abe64b6 in socket_event_poll_in () from /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so #12 0x00007f0f4abe93a4 in socket_event_handler () from /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so #13 0x00007f0f561e48ca in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #14 0x00007f0f54febdc5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f0f5493221d in clone () from /lib64/libc.so.6 (gdb) quit
Occasionally hitting this crash on tiered volume during geo-rep automation run (gdb) bt #0 0x00007fbf73a4d5f7 in raise () from /lib64/libc.so.6 #1 0x00007fbf73a4ece8 in abort () from /lib64/libc.so.6 #2 0x00007fbf73a8d317 in __libc_message () from /lib64/libc.so.6 #3 0x00007fbf73a93184 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007fbf73a96877 in _int_malloc () from /lib64/libc.so.6 #5 0x00007fbf73a9787c in malloc () from /lib64/libc.so.6 #6 0x00007fbf75391ecb in __gf_malloc () from /lib64/libglusterfs.so.0 #7 0x00007fbf753921e3 in gf_vasprintf () from /lib64/libglusterfs.so.0 #8 0x00007fbf753922d4 in gf_asprintf () from /lib64/libglusterfs.so.0 #9 0x00007fbf75361c97 in gf_glusterlog_log_repetitions.isra.3 () from /lib64/libglusterfs.so.0 #10 0x00007fbf75362073 in gf_log_flush_message () from /lib64/libglusterfs.so.0 #11 0x00007fbf75362159 in gf_log_flush_list () from /lib64/libglusterfs.so.0 #12 0x00007fbf753623dd in gf_log_set_log_buf_size () from /lib64/libglusterfs.so.0 #13 0x00007fbf75362437 in gf_log_disable_suppression_before_exit () from /lib64/libglusterfs.so.0 #14 0x00007fbf7537b1c5 in gf_print_trace () from /lib64/libglusterfs.so.0 #15 <signal handler called> #16 0x00007fbf73aa6dc7 in __memcpy_sse2 () from /lib64/libc.so.6 #17 0x00007fbf6699fb84 in qr_content_extract () from /usr/lib64/glusterfs/3.7.5/xlator/performance/quick-read.so #18 0x00007fbf6699ff54 in qr_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/performance/quick-read.so #19 0x00007fbf66bab91c in ioc_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/performance/io-cache.so ---Type <return> to continue, or q <return> to quit--- #20 0x00007fbf6740a1d4 in dht_discover_complete () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so #21 0x00007fbf6740ad6a in dht_discover_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so #22 0x00007fbf676939ee in afr_discover_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so #23 0x00007fbf678db477 in client3_3_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so #24 0x00007fbf7512db20 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0 #25 0x00007fbf7512dddf in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #26 0x00007fbf75129913 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #27 0x00007fbf69dc24b6 in socket_event_poll_in () from /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so #28 0x00007fbf69dc53a4 in socket_event_handler () from /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so #29 0x00007fbf753c08ca in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #30 0x00007fbf741c7dc5 in start_thread () from /lib64/libpthread.so.0 #31 0x00007fbf73b0e21d in clone () from /lib64/libc.so.6 (gdb) Build: glusterfs-3.7.5-16.el7rhgs.x86_64 Crash is observed on slave client. Tried the same testsuite 4 times on non tiered volume and haven't seen. But with tiered volume hit this twice in 4 trials.
Remounted the volume, and perfomed arequal checksum which does lookup. It crashed again: [root@dj slave1]# df -h df: ‘/mnt/slave’: Transport endpoint is not connected Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel_dj-root 45G 2.7G 42G 7% / devtmpfs 1.9G 0 1.9G 0% /dev tmpfs 1.9G 0 1.9G 0% /dev/shm tmpfs 1.9G 8.5M 1.9G 1% /run tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/mapper/rhel_dj-home 22G 33M 22G 1% /home /dev/vda1 497M 210M 287M 43% /boot tmpfs 380M 0 380M 0% /run/user/0 10.70.37.165:/master 50G 4.3G 46G 9% /mnt/glusterfs 10.70.37.99:/slave 20G 2.8G 18G 14% /mnt/slave1 [root@dj slave1]# ls /core* /core.9952 [root@dj slave1]# cd [root@dj ~]# cd scripts/ [root@dj scripts]# ./arequal-checksum -p /mnt/slave1 md5sum: /mnt/slave1/thread9/level02/level12/level22/level32/level42/level52/level62/level72/level82/level92/hardlink_to_files/569d427e%%NK9ZD0IAOW: Software caused connection abort /mnt/slave1/thread9/level02/level12/level22/level32/level42/level52/level62/level72/level82/level92/hardlink_to_files/569d427e%%NK9ZD0IAOW: short read ftw (-p) returned -1 (Success), terminating [root@dj scripts]# ls /core.* /core.12047 /core.9952 [root@dj scripts]# df -h /mnt/slave1 df: ‘/mnt/slave1’: Transport endpoint is not connected [root@dj scripts]#
Verified with build: glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64 With default performance.quick-read ON at slave, haven't seen the crashes when master is tiered volume and slave is DR. Tried the use case atleast 3 times, and given it is not reproducible with latest 3.3.0. This bug should be considered fixed. [root@dhcp37-71 geo-replication-slaves]# rpm -qa | grep gluster | grep geo glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64 [root@dhcp37-71 geo-replication-slaves]# gluster volume get slave quick-read Option Value ------ ----- performance.quick-read on [root@dhcp37-71 geo-replication-slaves]# [root@dhcp37-150 master]# gluster volume rebalance master tier status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 381 709 in progress 10.70.37.171 340 668 in progress 10.70.37.105 327 643 in progress 10.70.37.194 413 672 in progress 10.70.37.42 361 0 in progress 10.70.37.190 371 0 in progress Tiering Migration Functionality: master: success [root@dhcp37-150 master]# gluster volume info master Volume Name: master Type: Tier Volume ID: 7f7b81d8-4f1f-4f1d-9ee7-a60918a5623c Status: Started Snapshot Count: 0 Number of Bricks: 16 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: 10.70.37.194:/rhs/brick3/t4 Brick2: 10.70.37.105:/rhs/brick3/t3 Brick3: 10.70.37.171:/rhs/brick3/t2 Brick4: 10.70.37.150:/rhs/brick3/t1 Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (4 + 2) = 12 Brick5: 10.70.37.150:/rhs/brick1/b1 Brick6: 10.70.37.171:/rhs/brick1/b2 Brick7: 10.70.37.105:/rhs/brick1/b3 Brick8: 10.70.37.194:/rhs/brick1/b4 Brick9: 10.70.37.42:/rhs/brick1/b5 Brick10: 10.70.37.190:/rhs/brick1/b6 Brick11: 10.70.37.150:/rhs/brick2/b7 Brick12: 10.70.37.171:/rhs/brick2/b8 Brick13: 10.70.37.105:/rhs/brick2/b9 Brick14: 10.70.37.194:/rhs/brick2/b10 Brick15: 10.70.37.42:/rhs/brick2/b11 Brick16: 10.70.37.190:/rhs/brick2/b12 Options Reconfigured: cluster.watermark-hi: 20 cluster.watermark-low: 2 changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on cluster.tier-mode: cache features.ctr-enabled: on transport.address-family: inet nfs.disable: on cluster.enable-shared-storage: enable [root@dhcp37-150 master]# gluster volume get master quick-read Option Value ------ ----- performance.quick-read on [root@dhcp37-150 master]#
Based on comment 20, moving this bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774