Bug 1731513

Summary: snapd crashes sometimes
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Raghavendra Bhat <rabhat>
Component: snapshotAssignee: Mohammed Rafi KC <rkavunga>
Status: CLOSED ERRATA QA Contact: Mugdha Soni <musoni>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: amukherj, bugs, musoni, pprakash, rhs-bugs, rkavunga, sheggodu, storage-qa-internal, sunkumar
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.5.z Batch Update 1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-6.0-23 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1726783 Environment:
Last Closed: 2020-01-30 06:42:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1726783    
Bug Blocks: 1731509, 1731512    

Description Raghavendra Bhat 2019-07-19 15:18:45 UTC
+++ This bug was initially created as a clone of Bug #1726783 +++

Description of problem:

This crash was found in the regression runs and it happens once in a while and not always.


Core was generated by `/build/install/sbin/glusterfsd -s localhost --volfile-id snapd/patchy -p /var/r'.
Program terminated with signal SIGSEGV, Segmentation fault.

bt
#0  0x00007f0a5d00b016 in __strcmp_sse42 () from ./lib64/libc.so.6
#1  0x00007f0a501deceb in mgmt_get_snapinfo_cbk (req=0x7f0a44001808, iov=0x7f0a44001840, count=1, myframe=0x7f0a440014b8)
    at /home/raghu/work/src/glusterfs/xlators/features/snapview-server/src/snapview-server-mgmt.c:393
#2  0x00007f0a5e67a58c in rpc_clnt_handle_reply (clnt=0xa4d850, pollin=0x7f0a4c002080) at /home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:764
#3  0x00007f0a5e67aab5 in rpc_clnt_notify (trans=0xa4da80, mydata=0xa4d880, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0a4c002080)
    at /home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:931
#4  0x00007f0a5e676a9c in rpc_transport_notify (this=0xa4da80, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0a4c002080)
    at /home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-transport.c:547
#5  0x00007f0a52c85a19 in socket_event_poll_in (this=0xa4da80, notify_handled=true) at /home/raghu/work/src/glusterfs/rpc/rpc-transport/socket/src/socket.c:2582
#6  0x00007f0a52c86a2e in socket_event_handler (fd=9, idx=1, gen=1, data=0xa4da80, poll_in=1, poll_out=0, poll_err=0, event_thread_died=0 '\000')
    at /home/raghu/work/src/glusterfs/rpc/rpc-transport/socket/src/socket.c:2984
#7  0x00007f0a5e9574f2 in event_dispatch_epoll_handler (event_pool=0x9ff830, event=0x7f0a50be7140)
    at /home/raghu/work/src/glusterfs/libglusterfs/src/event-epoll.c:648
#8  0x00007f0a5e957a0b in event_dispatch_epoll_worker (data=0xa4faf0) at /home/raghu/work/src/glusterfs/libglusterfs/src/event-epoll.c:761
#9  0x00007f0a5d702dd5 in start_thread () from ./lib64/libpthread.so.0
#10 0x00007f0a5cfcaead in clone () from ./lib64/libc.so.6

(gdb) f 1
#1  0x00007f0a501deceb in mgmt_get_snapinfo_cbk (req=0x7f0a44001808, iov=0x7f0a44001840, count=1, myframe=0x7f0a440014b8)
    at /home/raghu/work/src/glusterfs/xlators/features/snapview-server/src/snapview-server-mgmt.c:393
393	                if ((!strcmp(old_dirents[i].name, dirents[j].name)) &&
(gdb) l
388	    LOCK(&priv->snaplist_lock);
389	    {
390	        oldcount = priv->num_snaps;
391	        for (i = 0; i < priv->num_snaps; i++) {
392	            for (j = 0; j < snapcount; j++) {
393	                if ((!strcmp(old_dirents[i].name, dirents[j].name)) &&
394	                    (!strcmp(old_dirents[i].uuid, dirents[j].uuid))) {
395	                    dirents[j].fs = old_dirents[i].fs;
396	                    old_dirents[i].fs = NULL;
397	                    break;
(gdb) p old_dirents
$1 = (snap_dirent_t *) 0x0
(gdb) p snap->dirents
No symbol "snap" in current context.
(gdb) p priv->dirents
$2 = (snap_dirent_t *) 0x7f0a440031e0
(gdb) p *priv->dirents
$3 = {name = "snap1", '\000' <repeats 249 times>, uuid = "da58a428-260e-4c28-abae-6d426f7417fa", snap_volname = "11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[0]
$4 = {name = "snap1", '\000' <repeats 249 times>, uuid = "da58a428-260e-4c28-abae-6d426f7417fa", snap_volname = "11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[1]
$5 = {name = "snap2", '\000' <repeats 249 times>, uuid = "985b0bcc-279a-40d3-8a21-1cb7d250ae25", snap_volname = "202c92fdf9624836931590897482a09b", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[2]
$6 = {name = "snap3", '\000' <repeats 249 times>, uuid = "9c001fcf-b010-48d6-b6f4-b70f120b69f8", snap_volname = "11f4f20ef1514be5a572b602ffe8d96d", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[3]
$7 = {name = "snap4", '\000' <repeats 249 times>, uuid = "3b2eadfe-cad4-4c6f-ba2e-d2c89710d960", snap_volname = "d28715725bb24411a7f4eebdff70e41a", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[4]
$8 = {name = "\000\000\000\000\000\000\000\000a\325\001", '\000' <repeats 243 times>, uuid = '\000' <repeats 36 times>, snap_volname = '\000' <repeats 254 times>, fs = 0x0}
(gdb) p dirents
$9 = (snap_dirent_t *) 0x7f0a4c024190
(gdb) p dirents[0]
$10 = {name = "snap1", '\000' <repeats 249 times>, uuid = "da58a428-260e-4c28-abae-6d426f7417fa", snap_volname = "11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[1]
$11 = {name = "snap2", '\000' <repeats 249 times>, uuid = "985b0bcc-279a-40d3-8a21-1cb7d250ae25", snap_volname = "202c92fdf9624836931590897482a09b", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[2]
$12 = {name = "snap3", '\000' <repeats 249 times>, uuid = "9c001fcf-b010-48d6-b6f4-b70f120b69f8", snap_volname = "11f4f20ef1514be5a572b602ffe8d96d", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[3]
$13 = {name = "snap4", '\000' <repeats 249 times>, uuid = "3b2eadfe-cad4-4c6f-ba2e-d2c89710d960", snap_volname = "d28715725bb24411a7f4eebdff70e41a", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[4]
$14 = {
  name = "\000\000\000\000\000\000\000\000\245\004\000\000\000\000\000\000\000@\315I\n\177\000\000\020\f\000L\n\177\000\000؍\356I\n\177\000\000\000\000\000\000\000\000\000\000@\311\001L\n\177\000\000`J\002L\n\177\000\000\000\000\000\000\000\000\000\000\320N\002L\n\177\000\000\000\000\000\000\000\000\000\000؍\356I\n\177\000\000\270\216\356I\n\177\000\000\250\216\356I\n\177\000\000\000\000\000\000\000\000\000\000h\216\356I\n\177\000\000x\216\356I\n\177\000\000\350\216\356I\n\177\000\000\370\216\356I\n\177\000\000\b\217\356I\n\177\000\000\210\216\356I\n\177\000\000\230\216\356I\n\177\000\000\370\215\356I\n\177\000\000\b\216\356I\n\177\000\000\350\215\356I\n\177", '\000' <repeats 42 times>..., 
  uuid = "\000\000\000\000\000\000\000\000\000؎\356I\n\177\000\000\000\000\000\000\000\000\000\000\030\216\356I\n\177\000\000\070\216\356I", 
  snap_volname = "\n\177\000\000(\216\356I\n\177\000\000H\216\356I\n\177", '\000' <repeats 42 times>, "H\217\356I\n\177\000\000\070\217\356I\n\177\000\000(\217\356I\n\177\000\000\030\217\356I\n\177", '\000' <repeats 18 times>, "h\217\356I\n\177", '\000' <repeats 66 times>, "X\217\356I\n\177", '\000' <repeats 68 times>, fs = 0x0}
(gdb) p snapcount
$15 = 4
(gdb) p oldcount
$16 = 4



Version-Release number of selected component (if applicable):


How reproducible:
Not always


Additional info:

Reason:

In snapshot daemon, mgmt_get_snapinfo_cbk is the callback for the rpc call to the glusterd to get the list of snapshots available. The rpc call to glusterd is made at two instances. One is when snapview-server xlator is spawned it makes the call in the init function. Another is whenever there is connection happening between snapd and glusterd (in svs_rpc_notify). 

But whenever snapd comes up, it is both inited and gets connected to glusterd. So, there can be 2 rpc calls happening to glusterd and the order in which the responses comes can lead to the above crash.

in mgmt_get_snapinfo_cbk we do this

1) Get the list of existing snapshots from the local information private to the snapview-server xlator
2) Then process the list of snapshots that glusterd sent.

Ideally mgmt_get_snapinfo_cbk should be getting the current list of snapshots from its private structure under a lock. If not, then at the time of getting the local list, the list could be empty. But at the time of processing and comparing local list with the new list obtained via rpc, the list could have changed. So the empty lists results in a NULL pointer access. 

So the fix is to ensure that the local list of snapshots from its private structure are obtained inside a lock.

--- Additional comment from Worker Ant on 2019-07-03 18:07:23 UTC ---

REVIEW: https://review.gluster.org/22989 (features/snapview-server: obtain the list of snapshots inside the lock) posted (#1) for review on master by Raghavendra Bhat

--- Additional comment from Worker Ant on 2019-07-03 18:08:58 UTC ---

REVIEW: https://review.gluster.org/22990 (features/snapview-server: obtain the list of snapshots inside the lock) posted (#1) for review on master by Raghavendra Bhat

--- Additional comment from Worker Ant on 2019-07-12 13:37:34 UTC ---

REVIEW: https://review.gluster.org/22990 (features/snapview-server: obtain the list of snapshots inside the lock) merged (#2) on master by mohammed rafi  kc

Comment 9 Mugdha Soni 2019-12-03 05:39:57 UTC
Raghavendra 

It will be a great help if you could provide the steps to reproduce the issues.

Thanks 
Mugdha Soni

Comment 11 Mugdha Soni 2019-12-20 10:25:00 UTC
Ran tests/basic/uss.t from the glusterfs codebase as suggested by Raghavendra .  Ran the test 130 times neither it failed nor there was any halt.
Is there anything i need to take care of for verifying the bug .

Comment 12 Mohammed Rafi KC 2019-12-23 07:30:43 UTC
(In reply to Mugdha Soni from comment #11)
> Ran tests/basic/uss.t from the glusterfs codebase as suggested by
> Raghavendra .  Ran the test 130 times neither it failed nor there was any
> halt.
> Is there anything i need to take care of for verifying the bug .

Nop. I think it is good enough to validate the fix

Comment 13 Mugdha Soni 2019-12-23 08:27:16 UTC
Since no further efforts are required moving this bug to verified on basis of comment#11.

Comment 15 errata-xmlrpc 2020-01-30 06:42:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0288