Bug 1291617
Summary: | Multiple crashes observed during "qr_lookup_cbk" and "qr_readv" on slave side of geo-replication setup | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> |
Component: | quick-read | Assignee: | Milind Changire <mchangir> |
Status: | CLOSED ERRATA | QA Contact: | storage-qa-internal <storage-qa-internal> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.1 | CC: | amukherj, asrivast, mzywusko, nbalacha, rgowdapp, rhinduja, rhs-bugs, sankarshan, sarumuga, sasundar |
Target Milestone: | --- | ||
Target Release: | RHGS 3.3.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.8.4-19 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-09-21 04:25:52 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1417147 |
Description
Rahul Hinduja
2015-12-15 10:04:05 UTC
Observed this again with build: rglusterfs-3.7.5-15.el7rhgs.x86_64 (gdb) bt #0 0x00007f0f548cadfb in __memcpy_sse2 () from /lib64/libc.so.6 #1 0x00007f0f435c7b84 in qr_content_extract () from /usr/lib64/glusterfs/3.7.5/xlator/performance/quick-read.so #2 0x00007f0f435c7f54 in qr_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/performance/quick-read.so #3 0x00007f0f437d391c in ioc_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/performance/io-cache.so #4 0x00007f0f4822d364 in dht_discover_complete () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so #5 0x00007f0f4822defa in dht_discover_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so #6 0x00007f0f484b79ee in afr_discover_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so #7 0x00007f0f486ff477 in client3_3_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so #8 0x00007f0f55f51b20 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0 #9 0x00007f0f55f51ddf in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #10 0x00007f0f55f4d913 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #11 0x00007f0f4abe64b6 in socket_event_poll_in () from /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so #12 0x00007f0f4abe93a4 in socket_event_handler () from /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so #13 0x00007f0f561e48ca in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #14 0x00007f0f54febdc5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f0f5493221d in clone () from /lib64/libc.so.6 (gdb) quit Occasionally hitting this crash on tiered volume during geo-rep automation run (gdb) bt #0 0x00007fbf73a4d5f7 in raise () from /lib64/libc.so.6 #1 0x00007fbf73a4ece8 in abort () from /lib64/libc.so.6 #2 0x00007fbf73a8d317 in __libc_message () from /lib64/libc.so.6 #3 0x00007fbf73a93184 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007fbf73a96877 in _int_malloc () from /lib64/libc.so.6 #5 0x00007fbf73a9787c in malloc () from /lib64/libc.so.6 #6 0x00007fbf75391ecb in __gf_malloc () from /lib64/libglusterfs.so.0 #7 0x00007fbf753921e3 in gf_vasprintf () from /lib64/libglusterfs.so.0 #8 0x00007fbf753922d4 in gf_asprintf () from /lib64/libglusterfs.so.0 #9 0x00007fbf75361c97 in gf_glusterlog_log_repetitions.isra.3 () from /lib64/libglusterfs.so.0 #10 0x00007fbf75362073 in gf_log_flush_message () from /lib64/libglusterfs.so.0 #11 0x00007fbf75362159 in gf_log_flush_list () from /lib64/libglusterfs.so.0 #12 0x00007fbf753623dd in gf_log_set_log_buf_size () from /lib64/libglusterfs.so.0 #13 0x00007fbf75362437 in gf_log_disable_suppression_before_exit () from /lib64/libglusterfs.so.0 #14 0x00007fbf7537b1c5 in gf_print_trace () from /lib64/libglusterfs.so.0 #15 <signal handler called> #16 0x00007fbf73aa6dc7 in __memcpy_sse2 () from /lib64/libc.so.6 #17 0x00007fbf6699fb84 in qr_content_extract () from /usr/lib64/glusterfs/3.7.5/xlator/performance/quick-read.so #18 0x00007fbf6699ff54 in qr_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/performance/quick-read.so #19 0x00007fbf66bab91c in ioc_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/performance/io-cache.so ---Type <return> to continue, or q <return> to quit--- #20 0x00007fbf6740a1d4 in dht_discover_complete () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so #21 0x00007fbf6740ad6a in dht_discover_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so #22 0x00007fbf676939ee in afr_discover_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so #23 0x00007fbf678db477 in client3_3_lookup_cbk () from /usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so #24 0x00007fbf7512db20 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0 #25 0x00007fbf7512dddf in rpc_clnt_notify () from /lib64/libgfrpc.so.0 #26 0x00007fbf75129913 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #27 0x00007fbf69dc24b6 in socket_event_poll_in () from /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so #28 0x00007fbf69dc53a4 in socket_event_handler () from /usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so #29 0x00007fbf753c08ca in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #30 0x00007fbf741c7dc5 in start_thread () from /lib64/libpthread.so.0 #31 0x00007fbf73b0e21d in clone () from /lib64/libc.so.6 (gdb) Build: glusterfs-3.7.5-16.el7rhgs.x86_64 Crash is observed on slave client. Tried the same testsuite 4 times on non tiered volume and haven't seen. But with tiered volume hit this twice in 4 trials. Remounted the volume, and perfomed arequal checksum which does lookup. It crashed again: [root@dj slave1]# df -h df: ‘/mnt/slave’: Transport endpoint is not connected Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel_dj-root 45G 2.7G 42G 7% / devtmpfs 1.9G 0 1.9G 0% /dev tmpfs 1.9G 0 1.9G 0% /dev/shm tmpfs 1.9G 8.5M 1.9G 1% /run tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/mapper/rhel_dj-home 22G 33M 22G 1% /home /dev/vda1 497M 210M 287M 43% /boot tmpfs 380M 0 380M 0% /run/user/0 10.70.37.165:/master 50G 4.3G 46G 9% /mnt/glusterfs 10.70.37.99:/slave 20G 2.8G 18G 14% /mnt/slave1 [root@dj slave1]# ls /core* /core.9952 [root@dj slave1]# cd [root@dj ~]# cd scripts/ [root@dj scripts]# ./arequal-checksum -p /mnt/slave1 md5sum: /mnt/slave1/thread9/level02/level12/level22/level32/level42/level52/level62/level72/level82/level92/hardlink_to_files/569d427e%%NK9ZD0IAOW: Software caused connection abort /mnt/slave1/thread9/level02/level12/level22/level32/level42/level52/level62/level72/level82/level92/hardlink_to_files/569d427e%%NK9ZD0IAOW: short read ftw (-p) returned -1 (Success), terminating [root@dj scripts]# ls /core.* /core.12047 /core.9952 [root@dj scripts]# df -h /mnt/slave1 df: ‘/mnt/slave1’: Transport endpoint is not connected [root@dj scripts]# Verified with build: glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64 With default performance.quick-read ON at slave, haven't seen the crashes when master is tiered volume and slave is DR. Tried the use case atleast 3 times, and given it is not reproducible with latest 3.3.0. This bug should be considered fixed. [root@dhcp37-71 geo-replication-slaves]# rpm -qa | grep gluster | grep geo glusterfs-geo-replication-3.8.4-27.el7rhgs.x86_64 [root@dhcp37-71 geo-replication-slaves]# gluster volume get slave quick-read Option Value ------ ----- performance.quick-read on [root@dhcp37-71 geo-replication-slaves]# [root@dhcp37-150 master]# gluster volume rebalance master tier status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 381 709 in progress 10.70.37.171 340 668 in progress 10.70.37.105 327 643 in progress 10.70.37.194 413 672 in progress 10.70.37.42 361 0 in progress 10.70.37.190 371 0 in progress Tiering Migration Functionality: master: success [root@dhcp37-150 master]# gluster volume info master Volume Name: master Type: Tier Volume ID: 7f7b81d8-4f1f-4f1d-9ee7-a60918a5623c Status: Started Snapshot Count: 0 Number of Bricks: 16 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick1: 10.70.37.194:/rhs/brick3/t4 Brick2: 10.70.37.105:/rhs/brick3/t3 Brick3: 10.70.37.171:/rhs/brick3/t2 Brick4: 10.70.37.150:/rhs/brick3/t1 Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (4 + 2) = 12 Brick5: 10.70.37.150:/rhs/brick1/b1 Brick6: 10.70.37.171:/rhs/brick1/b2 Brick7: 10.70.37.105:/rhs/brick1/b3 Brick8: 10.70.37.194:/rhs/brick1/b4 Brick9: 10.70.37.42:/rhs/brick1/b5 Brick10: 10.70.37.190:/rhs/brick1/b6 Brick11: 10.70.37.150:/rhs/brick2/b7 Brick12: 10.70.37.171:/rhs/brick2/b8 Brick13: 10.70.37.105:/rhs/brick2/b9 Brick14: 10.70.37.194:/rhs/brick2/b10 Brick15: 10.70.37.42:/rhs/brick2/b11 Brick16: 10.70.37.190:/rhs/brick2/b12 Options Reconfigured: cluster.watermark-hi: 20 cluster.watermark-low: 2 changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on cluster.tier-mode: cache features.ctr-enabled: on transport.address-family: inet nfs.disable: on cluster.enable-shared-storage: enable [root@dhcp37-150 master]# gluster volume get master quick-read Option Value ------ ----- performance.quick-read on [root@dhcp37-150 master]# Based on comment 20, moving this bug to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |