Bug 1608507

Summary: glusterd segfault - memcpy () at /usr/include/bits/string3.h:51
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: John Strunk <jstrunk>
Component: glusterdAssignee: Sanju <srakonde>
Status: CLOSED ERRATA QA Contact: Bala Konda Reddy M <bmekala>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.3CC: amukherj, apaladug, jstrunk, kiyer, nbalacha, nravinas, rhinduja, rhs-bugs, sankarshan, sarora, sheggodu, srakonde, srangana, storage-qa-internal, vbellur, vdas
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.z Batch Update 3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-33 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1615385 (view as bug list) Environment:
Last Closed: 2019-02-04 07:41:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1615385    

Description John Strunk 2018-07-25 16:35:05 UTC
Description of problem:
Glusterd encountered a segfault.

Version-Release number of selected component (if applicable):

[ec2-user@ip-192-168-0-14 ~]$ rpm -qa | grep gluster
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-cli-3.8.4-54.10.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.10.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-54.10.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.10.el7rhgs.x86_64
glusterfs-api-3.8.4-54.10.el7rhgs.x86_64
python-gluster-3.8.4-54.10.el7rhgs.noarch
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
pcp-pmda-gluster-4.1.0-0.201805281909.git68ab4b18.el7.x86_64
glusterfs-libs-3.8.4-54.10.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.10.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
glusterfs-3.8.4-54.10.el7rhgs.x86_64
glusterfs-server-3.8.4-54.10.el7rhgs.x86_64
glusterfs-rdma-3.8.4-54.10.el7rhgs.x86_64


How reproducible:

I do not have a reproducer. There were no admin actions ongoing other than regular monitoring via pcp glusterpmda and standard I/O workload.


Steps to Reproduce:
1. unknown
2.
3.

Actual results:


Expected results:


Additional info:


segv is a result of a bad length to memcpy... see below.


BT of relevant thread(s):
Thread 3 (Thread 0x7f2dabfff700 (LWP 29173)):
#0  0x00007f2e4bb4ae9d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f2e3f116fa9 in poll (__timeout=-1, __nfds=2, __fds=0x7f2dabffee80) at /usr/include/bits/poll2.h:46
#2  socket_poller (ctx=0x7f2e28126310) at socket.c:2563
#3  0x00007f2e4c28cdd5 in start_thread (arg=0x7f2dabfff700) at pthread_create.c:308
#4  0x00007f2e4bb55b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 2 (Thread 0x7f2e2f7fe700 (LWP 21641)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f2e4d46359b in __synclock_lock (lock=lock@entry=0x7f2e4d7a0818) at syncop.c:924
#2  0x00007f2e4d466ad6 in synclock_lock (lock=lock@entry=0x7f2e4d7a0818) at syncop.c:950
#3  0x00007f2e41f1981b in glusterd_big_locked_cbk (req=0x7f2e283f59a0, iov=0x7f2e283f59e0, count=1, myframe=0x7f2e283d1fd0, fn=0x7f2e41f177b0 <__glusterd_commit_op_cbk>)
    at glusterd-rpc-ops.c:214
#4  0x00007f2e4d1f5770 in rpc_clnt_handle_reply (clnt=clnt@entry=0x55718e4f8a50, pollin=pollin@entry=0x7f2e26b77310) at rpc-clnt.c:794
#5  0x00007f2e4d1f5a57 in rpc_clnt_notify (trans=<optimized out>, mydata=0x55718e4f8a80, event=<optimized out>, data=0x7f2e26b77310) at rpc-clnt.c:987
#6  0x00007f2e4d1f1913 in rpc_transport_notify (this=this@entry=0x55718e4f9430, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f2e26b77310) at rpc-transport.c:538
#7  0x00007f2e3f114366 in socket_event_poll_in (this=this@entry=0x55718e4f9430, notify_handled=notify_handled@entry=_gf_false) at socket.c:2306
#8  0x00007f2e3f117042 in socket_poller (ctx=0x55718e4f9430) at socket.c:2581
#9  0x00007f2e4c28cdd5 in start_thread (arg=0x7f2e2f7fe700) at pthread_create.c:308
#10 0x00007f2e4bb55b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7f2e06ffd700 (LWP 13885)):
#0  0x00007f2e4d423798 in memcpy (__len=18446744072547383994, __src=0x7f2e26b9f0b0, __dest=0x7f2e4c4506b0) at /usr/include/bits/string3.h:51
#1  data_to_int32_ptr (data=0x7f2e26b9f000, val=val@entry=0x7f2e06ffc21c) at dict.c:1518
#2  0x00007f2e4d4265fd in dict_get_int32 (this=this@entry=0x7f2e19efbfc0, key=key@entry=0x7f2e41fd5a2c "count", val=val@entry=0x7f2e06ffc21c) at dict.c:1776
#3  0x00007f2e41f04681 in glusterd_profile_volume_use_rsp_dict (aggr=0x7f2e19efbfc0, rsp_dict=0x7f2da64d1550) at glusterd-utils.c:9544
#4  0x00007f2e41f17c2e in __glusterd_commit_op_cbk (req=req@entry=0x7f2e283f6690, iov=iov@entry=0x7f2e283f66d0, count=count@entry=1, myframe=myframe@entry=0x7f2e283d2140)
    at glusterd-rpc-ops.c:1430
#5  0x00007f2e41f1982a in glusterd_big_locked_cbk (req=0x7f2e283f6690, iov=0x7f2e283f66d0, count=1, myframe=0x7f2e283d2140, fn=0x7f2e41f177b0 <__glusterd_commit_op_cbk>)
    at glusterd-rpc-ops.c:215
#6  0x00007f2e4d1f5770 in rpc_clnt_handle_reply (clnt=clnt@entry=0x55718e501280, pollin=pollin@entry=0x7f2da64e85a0) at rpc-clnt.c:794
#7  0x00007f2e4d1f5a57 in rpc_clnt_notify (trans=<optimized out>, mydata=0x55718e5012b0, event=<optimized out>, data=0x7f2da64e85a0) at rpc-clnt.c:987
#8  0x00007f2e4d1f1913 in rpc_transport_notify (this=this@entry=0x55718e501480, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f2da64e85a0) at rpc-transport.c:538
#9  0x00007f2e3f114366 in socket_event_poll_in (this=this@entry=0x55718e501480, notify_handled=notify_handled@entry=_gf_false) at socket.c:2306
#10 0x00007f2e3f117042 in socket_poller (ctx=0x55718e501480) at socket.c:2581
#11 0x00007f2e4c28cdd5 in start_thread (arg=0x7f2e06ffd700) at pthread_create.c:308
#12 0x00007f2e4bb55b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Snip of glusterd log:
...
[2018-07-25 14:46:39.319143] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervol01
The message "I [MSGID: 106488] [glusterd-handler.c:1540:__glusterd_handle_cli_get_volume] 0-management: Received get vol req" repeated 7 times between [2018-07-25 14:45:39.062451] and [2018-07-25 14:46:39.092824]
[2018-07-25 14:46:39.202692] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume gluster_shared_storage
[2018-07-25 14:46:39.228740] E [MSGID: 106524] [glusterd-op-sm.c:1987:glusterd_op_stage_stats_volume] 0-glusterd: Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:46:40.320097] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervole2a03
[2018-07-25 14:47:14.369772] E [socket.c:2631:socket_poller] 0-socket.management: socket_poller 172.31.71.222:1023 failed (No data available)
[2018-07-25 14:47:39.093345] I [MSGID: 106488] [glusterd-handler.c:1540:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2018-07-25 14:47:39.205960] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume gluster_shared_storage
[2018-07-25 14:47:39.243975] E [MSGID: 106524] [glusterd-op-sm.c:1987:glusterd_op_stage_stats_volume] 0-glusterd: Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:47:39.244014] E [MSGID: 106301] [glusterd-op-sm.c:4553:glusterd_op_ac_send_stage_op] 0-management: Staging of operation 'Volume Profile' failed on localhost : Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:47:39.369274] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervol01
[2018-07-25 14:47:39.369370] W [glusterd-locks.c:622:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x24a7b) [0x7f2e41ecba7b] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x242d2) [0x7f2e41ecb2d2] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdda56) [0x7f2e41f84a56] ) 0-management: Lock for supervol01 held by 8bbb0dba-d052-4321-b460-030465d89d70
[2018-07-25 14:47:39.369383] E [MSGID: 106119] [glusterd-handler.c:794:glusterd_op_txn_begin] 0-management: Unable to acquire lock for supervol01
[2018-07-25 14:47:39.482267] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervole2a03
[2018-07-25 14:48:39.221960] E [MSGID: 106301] [glusterd-op-sm.c:4553:glusterd_op_ac_send_stage_op] 0-management: Staging of operation 'Volume Profile' failed on localhost : Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:48:39.350491] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervol01
[2018-07-25 14:48:39.350578] W [glusterd-locks.c:622:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x24a7b) [0x7f2e41ecba7b] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0x242d2) [0x7f2e41ecb2d2] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xdda56) [0x7f2e41f84a56] ) 0-management: Lock for supervol01 held by 8bbb0dba-d052-4321-b460-030465d89d70
[2018-07-25 14:48:39.350597] E [MSGID: 106119] [glusterd-handler.c:794:glusterd_op_txn_begin] 0-management: Unable to acquire lock for supervol01
The message "I [MSGID: 106488] [glusterd-handler.c:1540:__glusterd_handle_cli_get_volume] 0-management: Received get vol req" repeated 7 times between [2018-07-25 14:47:39.093345] and [2018-07-25 14:48:39.075899]
[2018-07-25 14:48:39.185182] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume gluster_shared_storage
[2018-07-25 14:48:39.221954] E [MSGID: 106524] [glusterd-op-sm.c:1987:glusterd_op_stage_stats_volume] 0-glusterd: Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:48:39.464297] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervole2a03
[2018-07-25 14:48:47.279277] E [rpc-clnt.c:200:call_bail] 0-management: bailing out frame type(brick operations) op(--(2)) xid = 0xe3a2 sent = 2018-07-25 14:38:39.424347. timeout = 600 for /var/run/gluster/ac7892a429909ee0963384af107bfd9e.socket
[2018-07-25 14:48:47.284340] I [socket.c:3659:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2018-07-25 14:48:47.284349] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc cli, ProgVers: 2, Proc: 23) to rpc-transport (socket.management)
[2018-07-25 14:48:47.284381] E [MSGID: 106430] [glusterd-utils.c:539:glusterd_submit_reply] 0-glusterd: Reply submission failed
[2018-07-25 14:49:39.094685] I [MSGID: 106488] [glusterd-handler.c:1540:__glusterd_handle_cli_get_volume] 0-management: Received get vol req
[2018-07-25 14:49:39.207428] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume gluster_shared_storage
[2018-07-25 14:49:39.245229] E [MSGID: 106524] [glusterd-op-sm.c:1987:glusterd_op_stage_stats_volume] 0-glusterd: Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:49:39.245269] E [MSGID: 106301] [glusterd-op-sm.c:4553:glusterd_op_ac_send_stage_op] 0-management: Staging of operation 'Volume Profile' failed on localhost : Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:49:39.369411] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervol01
[2018-07-25 14:49:39.404238] E [MSGID: 106275] [glusterd-rpc-ops.c:878:glusterd_mgmt_v3_lock_peers_cbk_fn] 0-management: Received mgmt_v3 lock RJT from uuid: d5c6e41c-01ec-4879-8b78-f8a8d2573d7c
[2018-07-25 14:49:39.404692] E [MSGID: 106275] [glusterd-rpc-ops.c:878:glusterd_mgmt_v3_lock_peers_cbk_fn] 0-management: Received mgmt_v3 lock RJT from uuid: 7e853480-f053-46f2-9534-1f59a330f4d4
[2018-07-25 14:49:39.536180] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervole2a03
[2018-07-25 14:50:39.243598] E [MSGID: 106524] [glusterd-op-sm.c:1987:glusterd_op_stage_stats_volume] 0-glusterd: Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:50:39.243603] E [MSGID: 106301] [glusterd-op-sm.c:4553:glusterd_op_ac_send_stage_op] 0-management: Staging of operation 'Volume Profile' failed on localhost : Profile on Volume gluster_shared_storage is not started
[2018-07-25 14:50:39.372516] I [MSGID: 106494] [glusterd-handler.c:3087:__glusterd_handle_cli_profile_volume] 0-management: Received volume profile req for volume supervol01

Comment 18 Anand Paladugu 2018-08-20 13:29:58 UTC
I am ok with deferring if we have not been able to reproduce the issue.  Would John me able to reproduce ?

Comment 19 Anand Paladugu 2018-08-20 14:29:55 UTC
Based on the comment that we have not been able to reproduce this after trying the same procedure as John,   I am ok defer it.

Comment 20 Atin Mukherjee 2018-08-20 14:38:56 UTC
Require QE assessment here.

Comment 23 John Strunk 2018-09-10 23:25:28 UTC
I have now hit this in RHGS 3.4.0.

Comment 24 Atin Mukherjee 2018-09-11 04:57:51 UTC
(In reply to John Strunk from comment #23)
> I have now hit this in RHGS 3.4.0.

John - in the email exchange you highlighted the core wasn't available. Just trying to understand if you have observed any patterns to get to a conclusion we hit the same backtrace.

Comment 25 John Strunk 2018-09-11 16:58:42 UTC
I was able to fix the config problem and get cores after my initial e-mail. Backtrace:

Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f185e867128 in memcpy (__len=18446744072547383994, __src=0x7f18400760d0, __dest=0x7f18941ea6b0) at /usr/include/bits/string3.h:51
51	/usr/include/bits/string3.h: No such file or directory.
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libselinux-2.5-12.el7.x86_64 libsepol-2.5-8.1.el7.x86_64 pcre-8.32-17.el7.x86_64 systemd-libs-219-57.el7_5.1.x86_64 xz-libs-5.2.2-1.el7.x86_64
(gdb) bt
#0  0x00007f185e867128 in memcpy (__len=18446744072547383994, __src=0x7f18400760d0, __dest=0x7f18941ea6b0) at /usr/include/bits/string3.h:51
#1  data_to_int32_ptr (data=0x7f1840076020, val=val@entry=0x7f184ed9621c) at dict.c:1555
#2  0x00007f185e86a78d in dict_get_int32 (this=this@entry=0x7f181001f160, key=key@entry=0x7f1853412f5b "count", val=val@entry=0x7f184ed9621c) at dict.c:1813
#3  0x00007f1853335191 in glusterd_profile_volume_use_rsp_dict (aggr=0x7f181001f160, rsp_dict=0x7f184802fff0) at glusterd-utils.c:9950
#4  0x00007f185334a4ee in __glusterd_commit_op_cbk (req=req@entry=0x7f18380b3bf0, iov=iov@entry=0x7f18380b3c30, count=count@entry=1, myframe=myframe@entry=0x7f183809d5d0)
    at glusterd-rpc-ops.c:1436
#5  0x00007f185334c0ea in glusterd_big_locked_cbk (req=0x7f18380b3bf0, iov=0x7f18380b3c30, count=1, myframe=0x7f183809d5d0, fn=0x7f185334a070 <__glusterd_commit_op_cbk>)
    at glusterd-rpc-ops.c:223
#6  0x00007f185e638960 in rpc_clnt_handle_reply (clnt=clnt@entry=0x561848c16c50, pollin=pollin@entry=0x7f184801a820) at rpc-clnt.c:778
#7  0x00007f185e638d03 in rpc_clnt_notify (trans=<optimized out>, mydata=0x561848c16c80, event=<optimized out>, data=0x7f184801a820) at rpc-clnt.c:971
#8  0x00007f185e634a73 in rpc_transport_notify (this=this@entry=0x561848c17990, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f184801a820) at rpc-transport.c:538
#9  0x00007f1850542566 in socket_event_poll_in (this=this@entry=0x561848c17990, notify_handled=notify_handled@entry=_gf_false) at socket.c:2315
#10 0x00007f1850545242 in socket_poller (ctx=0x561848c17990) at socket.c:2590
#11 0x00007f185d6cfdd5 in start_thread (arg=0x7f184ed97700) at pthread_create.c:308
#12 0x00007f185cf98b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


This appears to be caused by an interaction w/ other processes on the machine. Last night, glusterd was repeatedly hitting this, crashing every 5-10 min. I was able to stabilize glusterd by rebooting the server.


# rpm -qa | grep gluster
glusterfs-fuse-3.12.2-18.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.12.2-18.el7rhgs.x86_64
glusterfs-rdma-3.12.2-18.el7rhgs.x86_64
python2-gluster-3.12.2-18.el7rhgs.x86_64
glusterfs-cli-3.12.2-18.el7rhgs.x86_64
glusterfs-api-3.12.2-18.el7rhgs.x86_64
glusterfs-server-3.12.2-18.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
pcp-pmda-gluster-4.1.0-0.201805281909.git68ab4b18.el7.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-3.12.2-18.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
glusterfs-libs-3.12.2-18.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-18.el7rhgs.x86_64
glusterfs-debuginfo-3.12.2-18.el7rhgs.x86_64

Comment 30 Sanju 2018-12-04 11:13:29 UTC
upstream patch: https://review.gluster.org/#/c/glusterfs/+/21736/

Comment 55 Sanju 2019-01-08 06:14:05 UTC
@Bala Here, we need to ensure that, while doing in-service upgrade, profile commands are working as expected.

@Nithya, please add if you want anything else to be tested.

Comment 56 Nithya Balachandran 2019-01-08 10:07:40 UTC
(In reply to Sanju from comment #55)
> @Bala Here, we need to ensure that, while doing in-service upgrade, profile
> commands are working as expected.
> 
> @Nithya, please add if you want anything else to be tested.

There is nothing else that I can think of.

Comment 59 errata-xmlrpc 2019-02-04 07:41:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263