Bug 1639632 - glustershd coredump generated
Summary: glustershd coredump generated
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: selfheal
Version: mainline
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Mohammed Rafi KC
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-16 09:14 UTC by zhou lin
Modified: 2020-03-12 12:39 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-03-12 12:39:11 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
coredump file of glustershd process (4.00 MB, application/octet-stream)
2018-10-16 09:14 UTC, zhou lin
no flags Details
attached is sn log (1.54 MB, application/zip)
2018-10-17 06:13 UTC, zhou lin
no flags Details

Description zhou lin 2018-10-16 09:14:20 UTC
Created attachment 1494315 [details]
coredump file of glustershd process

Description of problem:

sometimes glustershd coredump generated
Version-Release number of selected component (if applicable):


How reproducible:

make split-brain when glustershd working, sometimes glustershd coredump will generate
Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs -s sn-0.local --volfile-id gluster/glustershd -p /var/run/g'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f1b5e5d7d24 in client3_3_lookup_cbk (req=0x7f1b44002300, iov=0x7f1b44002340, count=1, myframe=0x7f1b4401c850) at client-rpc-fops.c:2802
2802	client-rpc-fops.c: No such file or directory.
[Current thread is 1 (Thread 0x7f1b5f00c700 (LWP 1818))]
Missing separate debuginfos, use: dnf debuginfo-install rcp-pack-glusterfs-1.2.0_1_g54e6196-RCP2.wf29.x86_64
(gdb) bt
#0  0x00007f1b5e5d7d24 in client3_3_lookup_cbk (req=0x7f1b44002300, iov=0x7f1b44002340, count=1, myframe=0x7f1b4401c850) at client-rpc-fops.c:2802
#1  0x00007f1b64553d47 in rpc_clnt_handle_reply (clnt=0x7f1b5808bbb0, pollin=0x7f1b580c6620) at rpc-clnt.c:778
#2  0x00007f1b645542e5 in rpc_clnt_notify (trans=0x7f1b5808bde0, mydata=0x7f1b5808bbe0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f1b580c6620) at rpc-clnt.c:971
#3  0x00007f1b64550319 in rpc_transport_notify (this=0x7f1b5808bde0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f1b580c6620) at rpc-transport.c:538
#4  0x00007f1b5f49734d in socket_event_poll_in (this=0x7f1b5808bde0, notify_handled=_gf_true) at socket.c:2315
#5  0x00007f1b5f497992 in socket_event_handler (fd=25, idx=15, gen=7, data=0x7f1b5808bde0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2471
#6  0x00007f1b647fe5ac in event_dispatch_epoll_handler (event_pool=0x230cb00, event=0x7f1b5f00be84) at event-epoll.c:583
#7  0x00007f1b647fe883 in event_dispatch_epoll_worker (data=0x23543d0) at event-epoll.c:659
#8  0x00007f1b6354a5da in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1b62e20cbf in clone () from /lib64/libc.so.6


(gdb) print *(call_frame_t*)myframe
$1 = {root = 0x100000000, parent = 0x100000005, frames = {next = 0x7f1b4401c8a8, prev = 0x7f1b44010190}, local = 0x0, this = 0x0, ret = 0x0, ref_count = 0, lock = {spinlock = 0, mutex = {__data = {
        __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x7f1b44010190, __next = 0x0}}, 
      __size = '\000' <repeats 24 times>, "\220\001\001D\033\177\000\000\000\000\000\000\000\000\000", __align = 0}}, cookie = 0x7f1b4401ccf0, complete = _gf_false, op = GF_FOP_NULL, begin = {
    tv_sec = 139755081730912, tv_usec = 139755081785872}, end = {tv_sec = 448811404, tv_usec = 21474836481}, wind_from = 0x0, wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0}


time when glustershd corecdump generated:Oct 12 13:33:35.233839

the glustershd log does not contain when this issue happened, maybe because this process coredump suddenly, the log prints stops serveral seconds before coredump


[2018-09-26 13:04:35.788472] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-log-replicate-0: Gfid mismatch detected for <gfid:00000000-0000-0000-0000-000000000001>/tmp3.log>, c7c6e434-ea21-4e5d-bf38-aef0cef586d4 on log-client-1 and 4b46e66b-728f-4419-9852-46f233a1327e on log-client-0.
[2018-09-26 13:04:35.788490] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-log-replicate-0: Skipping conservative merge on the file.
[2018-09-26 13:04:35.798852] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-log-replicate-0: All the bricks should be up to resolve the gfid split barin
[2018-09-26 13:04:35.798884] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-log-replicate-0: Gfid mismatch detected for <gfid:00000000-0000-0000-0000-000000000001>/tmpdir2\test>, f9ce3cd5-3d2c-48fc-bdbe-1e478e7a6169 on log-client-1 and 0756665f-3481-4558-bc92-00e1d21d94a5 on log-client-0.
[2018-09-26 13:04:35.798902] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-log-replicate-0: Skipping conservative merge on the file.
[2018-09-26 13:04:35.812233] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-ccs-client-2: changing port to 49152 (from 0)
[2018-09-26 13:04:35.816120] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-ccs-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-09-26 13:04:35.818343] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-ccs-client-2: Connected to ccs-client-2, attached to remote volume '/mnt/bricks/ccs/brick'.
[2018-09-26 13:04:35.818374] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-ccs-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2018-09-26 13:04:35.818712] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-ccs-client-2: Server lk version = 1
[2018-09-26 13:04:35.823312] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-log-replicate-0: All the bricks should be up to resolve the gfid split barin
[2018-09-26 13:04:35.823371] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-log-replicate-0: Gfid mismatch detected for <gfid:00000000-0000-0000-0000-000000000001>/tmp9_soft2.log>, e0c47659-8b6a-4aee-a91f-489865c5d51d on log-client-1 and f3f69269-3995-44c9-9922-96cfadf7fed1 on log-client-0.
[2018-09-26 13:04:35.823389] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-log-replicate-0: Skipping conservative merge on the file.
[2018-09-26 13:04:35.825338] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-export-client-2: changing port to 49153 (from 0)
[2018-09-26 13:04:35.828874] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-export-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-09-26 13:04:35.829371] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-export-client-2: Connected to export-client-2, attached to remote volume '/mnt/bricks/export/brick'.
[2018-09-26 13:04:35.829390] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-export-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2018-09-26 13:04:35.829587] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-export-client-2: Server lk version = 1
[2018-09-26 13:04:35.855548] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-log-client-2: changing port to 49154 (from 0)
[2018-09-26 13:04:35.860969] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-log-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-09-26 13:04:35.863599] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-log-client-2: Connected to log-client-2, attached to remote volume '/mnt/bricks/log/brick'.
[2018-09-26 13:04:35.863620] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-log-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2018-09-26 13:04:35.864266] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-log-client-2: Server lk version = 1
[2018-09-26 13:04:35.871037] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-mstate-client-2: changing port to 49155 (from 0)
[2018-09-26 13:04:35.879356] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-mstate-replicate-0: All the bricks should be up to resolve the gfid split barin
[2018-09-26 13:04:35.879395] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-mstate-replicate-0: Gfid mismatch detected for <gfid:00000000-0000-0000-0000-000000000001>/tmpdir4>, b0aa432d-38a0-426d-98b9-aa4304176d87 on mstate-client-1 and 54a3fb44-34e4-4d9e-b36d-7aaf4fd5f9bf on mstate-client-0.
[2018-09-26 13:04:35.879410] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-mstate-replicate-0: Skipping conservative merge on the file.
[2018-09-26 13:04:35.881894] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-mstate-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-09-26 13:04:35.882558] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-services-client-2: changing port to 49156 (from 0)
[2018-09-26 13:04:35.888949] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-services-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-09-26 13:04:35.891470] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-services-client-2: Connected to services-client-2, attached to remote volume '/mnt/bricks/services/brick'.
[2018-09-26 13:04:35.891577] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-services-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2018-09-26 13:04:35.892489] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-mstate-replicate-0: All the bricks should be up to resolve the gfid split barin
[2018-09-26 13:04:35.892520] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-mstate-replicate-0: Gfid mismatch detected for <gfid:00000000-0000-0000-0000-000000000001>/tmp3.log>, c45aca32-d5e0-42ca-9a49-413d34df5be3 on mstate-client-1 and 763bf6d3-fcc5-4ede-b214-135c82dbe388 on mstate-client-0.
[2018-09-26 13:04:35.892536] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-mstate-replicate-0: Skipping conservative merge on the file.
[2018-09-26 13:04:35.892661] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-services-client-2: Server lk version = 1
[2018-09-26 13:04:35.902781] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-mstate-replicate-0: All the bricks should be up to resolve the gfid split barin
[2018-09-26 13:04:35.903188] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-mstate-replicate-0: Gfid mismatch detected for <gfid:00000000-0000-0000-0000-000000000001>/tmpdir2>, 8047eda2-e006-4720-b230-2dd197fa83da on mstate-client-1 and ba7636e9-01d9-44ba-85ac-708c7b588c27 on mstate-client-0.
[2018-09-26 13:04:35.903213] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-mstate-replicate-0: Skipping conservative merge on the file.
[2018-09-26 13:04:35.915219] E [MSGID: 108008] [afr-self-heal-common.c:213:afr_gfid_split_brain_source] 0-mstate-replicate-0: All the bricks should be up to resolve the gfid split barin
[2018-09-26 13:04:35.915253] E [MSGID: 108008] [afr-self-heal-common.c:336:afr_gfid_split_brain_source] 0-mstate-replicate-0: Gfid mismatch detected for <gfid:00000000-0000-0000-0000-000000000001>/tmp9_soft2.log>, 7e5dc038-0ae6-4ee1-b052-9f492d061071 on mstate-client-1 and 98cc1652-93f6-4c1f-9a04-c8b4daba01c9 on mstate-client-0.
[2018-09-26 13:04:35.915269] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-mstate-replicate-0: Skipping conservative merge on the file.
[2018-09-26 13:04:35.917248] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-mstate-client-2: Connected to mstate-client-2, attached to remote volume '/mnt/bricks/mstate/brick'.
[2018-09-26 13:04:35.922713] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-mstate-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2018-09-26 13:04:35.923249] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-mstate-client-2: Server lk version = 1

Comment 1 zhou lin 2018-10-16 09:15:18 UTC
glusterfs version 3.12.3 with 3 brick config
# gluster v info mstate
 
Volume Name: mstate
Type: Replicate
Volume ID: cdff5a42-3a64-498e-b74b-63659807a063
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: sn-0.local:/mnt/bricks/mstate/brick
Brick2: sn-1.local:/mnt/bricks/mstate/brick
Brick3: sn-2.local:/mnt/bricks/mstate/brick (arbiter)
Options Reconfigured:
performance.client-io-threads: off
server.allow-insecure: on
cluster.quorum-type: auto
network.ping-timeout: 42
cluster.consistent-metadata: on
cluster.favorite-child-policy: mtime
cluster.quorum-reads: no
cluster.server-quorum-type: none
transport.address-family: inet
nfs.disable: on
cluster.server-quorum-ratio: 51%
[root@sn-0:/home/robot]
#

Comment 2 Ravishankar N 2018-10-16 09:28:10 UTC
Hi Cynthia, could you attach all the /var/log/glusterfs/* logs from all 3 nodes too? Thanks.

Comment 3 zhou lin 2018-10-17 06:13:51 UTC
Created attachment 1494713 [details]
attached is sn log

Comment 4 Shyamsundar 2018-10-23 14:55:02 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 5 Worker Ant 2020-03-12 12:39:11 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/919, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.