Bug 1731512

Summary: snapd crashes sometimes
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: snapshotAssignee: bugs <bugs>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 7CC: atumball, bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1726783 Environment:
Last Closed: 2019-07-24 11:04:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1726783, 1731513    
Bug Blocks: 1731509    

Description Raghavendra Bhat 2019-07-19 15:13:52 UTC
+++ This bug was initially created as a clone of Bug #1726783 +++

Description of problem:

This crash was found in the regression runs and it happens once in a while and not always.


Core was generated by `/build/install/sbin/glusterfsd -s localhost --volfile-id snapd/patchy -p /var/r'.
Program terminated with signal SIGSEGV, Segmentation fault.

bt
#0  0x00007f0a5d00b016 in __strcmp_sse42 () from ./lib64/libc.so.6
#1  0x00007f0a501deceb in mgmt_get_snapinfo_cbk (req=0x7f0a44001808, iov=0x7f0a44001840, count=1, myframe=0x7f0a440014b8)
    at /home/raghu/work/src/glusterfs/xlators/features/snapview-server/src/snapview-server-mgmt.c:393
#2  0x00007f0a5e67a58c in rpc_clnt_handle_reply (clnt=0xa4d850, pollin=0x7f0a4c002080) at /home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:764
#3  0x00007f0a5e67aab5 in rpc_clnt_notify (trans=0xa4da80, mydata=0xa4d880, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0a4c002080)
    at /home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:931
#4  0x00007f0a5e676a9c in rpc_transport_notify (this=0xa4da80, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0a4c002080)
    at /home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-transport.c:547
#5  0x00007f0a52c85a19 in socket_event_poll_in (this=0xa4da80, notify_handled=true) at /home/raghu/work/src/glusterfs/rpc/rpc-transport/socket/src/socket.c:2582
#6  0x00007f0a52c86a2e in socket_event_handler (fd=9, idx=1, gen=1, data=0xa4da80, poll_in=1, poll_out=0, poll_err=0, event_thread_died=0 '\000')
    at /home/raghu/work/src/glusterfs/rpc/rpc-transport/socket/src/socket.c:2984
#7  0x00007f0a5e9574f2 in event_dispatch_epoll_handler (event_pool=0x9ff830, event=0x7f0a50be7140)
    at /home/raghu/work/src/glusterfs/libglusterfs/src/event-epoll.c:648
#8  0x00007f0a5e957a0b in event_dispatch_epoll_worker (data=0xa4faf0) at /home/raghu/work/src/glusterfs/libglusterfs/src/event-epoll.c:761
#9  0x00007f0a5d702dd5 in start_thread () from ./lib64/libpthread.so.0
#10 0x00007f0a5cfcaead in clone () from ./lib64/libc.so.6

(gdb) f 1
#1  0x00007f0a501deceb in mgmt_get_snapinfo_cbk (req=0x7f0a44001808, iov=0x7f0a44001840, count=1, myframe=0x7f0a440014b8)
    at /home/raghu/work/src/glusterfs/xlators/features/snapview-server/src/snapview-server-mgmt.c:393
393	                if ((!strcmp(old_dirents[i].name, dirents[j].name)) &&
(gdb) l
388	    LOCK(&priv->snaplist_lock);
389	    {
390	        oldcount = priv->num_snaps;
391	        for (i = 0; i < priv->num_snaps; i++) {
392	            for (j = 0; j < snapcount; j++) {
393	                if ((!strcmp(old_dirents[i].name, dirents[j].name)) &&
394	                    (!strcmp(old_dirents[i].uuid, dirents[j].uuid))) {
395	                    dirents[j].fs = old_dirents[i].fs;
396	                    old_dirents[i].fs = NULL;
397	                    break;
(gdb) p old_dirents
$1 = (snap_dirent_t *) 0x0
(gdb) p snap->dirents
No symbol "snap" in current context.
(gdb) p priv->dirents
$2 = (snap_dirent_t *) 0x7f0a440031e0
(gdb) p *priv->dirents
$3 = {name = "snap1", '\000' <repeats 249 times>, uuid = "da58a428-260e-4c28-abae-6d426f7417fa", snap_volname = "11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[0]
$4 = {name = "snap1", '\000' <repeats 249 times>, uuid = "da58a428-260e-4c28-abae-6d426f7417fa", snap_volname = "11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[1]
$5 = {name = "snap2", '\000' <repeats 249 times>, uuid = "985b0bcc-279a-40d3-8a21-1cb7d250ae25", snap_volname = "202c92fdf9624836931590897482a09b", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[2]
$6 = {name = "snap3", '\000' <repeats 249 times>, uuid = "9c001fcf-b010-48d6-b6f4-b70f120b69f8", snap_volname = "11f4f20ef1514be5a572b602ffe8d96d", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[3]
$7 = {name = "snap4", '\000' <repeats 249 times>, uuid = "3b2eadfe-cad4-4c6f-ba2e-d2c89710d960", snap_volname = "d28715725bb24411a7f4eebdff70e41a", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[4]
$8 = {name = "\000\000\000\000\000\000\000\000a\325\001", '\000' <repeats 243 times>, uuid = '\000' <repeats 36 times>, snap_volname = '\000' <repeats 254 times>, fs = 0x0}
(gdb) p dirents
$9 = (snap_dirent_t *) 0x7f0a4c024190
(gdb) p dirents[0]
$10 = {name = "snap1", '\000' <repeats 249 times>, uuid = "da58a428-260e-4c28-abae-6d426f7417fa", snap_volname = "11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[1]
$11 = {name = "snap2", '\000' <repeats 249 times>, uuid = "985b0bcc-279a-40d3-8a21-1cb7d250ae25", snap_volname = "202c92fdf9624836931590897482a09b", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[2]
$12 = {name = "snap3", '\000' <repeats 249 times>, uuid = "9c001fcf-b010-48d6-b6f4-b70f120b69f8", snap_volname = "11f4f20ef1514be5a572b602ffe8d96d", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[3]
$13 = {name = "snap4", '\000' <repeats 249 times>, uuid = "3b2eadfe-cad4-4c6f-ba2e-d2c89710d960", snap_volname = "d28715725bb24411a7f4eebdff70e41a", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[4]
$14 = {
  name = "\000\000\000\000\000\000\000\000\245\004\000\000\000\000\000\000\000@\315I\n\177\000\000\020\f\000L\n\177\000\000؍\356I\n\177\000\000\000\000\000\000\000\000\000\000@\311\001L\n\177\000\000`J\002L\n\177\000\000\000\000\000\000\000\000\000\000\320N\002L\n\177\000\000\000\000\000\000\000\000\000\000؍\356I\n\177\000\000\270\216\356I\n\177\000\000\250\216\356I\n\177\000\000\000\000\000\000\000\000\000\000h\216\356I\n\177\000\000x\216\356I\n\177\000\000\350\216\356I\n\177\000\000\370\216\356I\n\177\000\000\b\217\356I\n\177\000\000\210\216\356I\n\177\000\000\230\216\356I\n\177\000\000\370\215\356I\n\177\000\000\b\216\356I\n\177\000\000\350\215\356I\n\177", '\000' <repeats 42 times>..., 
  uuid = "\000\000\000\000\000\000\000\000\000؎\356I\n\177\000\000\000\000\000\000\000\000\000\000\030\216\356I\n\177\000\000\070\216\356I", 
  snap_volname = "\n\177\000\000(\216\356I\n\177\000\000H\216\356I\n\177", '\000' <repeats 42 times>, "H\217\356I\n\177\000\000\070\217\356I\n\177\000\000(\217\356I\n\177\000\000\030\217\356I\n\177", '\000' <repeats 18 times>, "h\217\356I\n\177", '\000' <repeats 66 times>, "X\217\356I\n\177", '\000' <repeats 68 times>, fs = 0x0}
(gdb) p snapcount
$15 = 4
(gdb) p oldcount
$16 = 4



Version-Release number of selected component (if applicable):


How reproducible:
Not always


Additional info:

Reason:

In snapshot daemon, mgmt_get_snapinfo_cbk is the callback for the rpc call to the glusterd to get the list of snapshots available. The rpc call to glusterd is made at two instances. One is when snapview-server xlator is spawned it makes the call in the init function. Another is whenever there is connection happening between snapd and glusterd (in svs_rpc_notify). 

But whenever snapd comes up, it is both inited and gets connected to glusterd. So, there can be 2 rpc calls happening to glusterd and the order in which the responses comes can lead to the above crash.

in mgmt_get_snapinfo_cbk we do this

1) Get the list of existing snapshots from the local information private to the snapview-server xlator
2) Then process the list of snapshots that glusterd sent.

Ideally mgmt_get_snapinfo_cbk should be getting the current list of snapshots from its private structure under a lock. If not, then at the time of getting the local list, the list could be empty. But at the time of processing and comparing local list with the new list obtained via rpc, the list could have changed. So the empty lists results in a NULL pointer access. 

So the fix is to ensure that the local list of snapshots from its private structure are obtained inside a lock.

--- Additional comment from Worker Ant on 2019-07-03 18:07:23 UTC ---

REVIEW: https://review.gluster.org/22989 (features/snapview-server: obtain the list of snapshots inside the lock) posted (#1) for review on master by Raghavendra Bhat

--- Additional comment from Worker Ant on 2019-07-03 18:08:58 UTC ---

REVIEW: https://review.gluster.org/22990 (features/snapview-server: obtain the list of snapshots inside the lock) posted (#1) for review on master by Raghavendra Bhat

--- Additional comment from Worker Ant on 2019-07-12 13:37:34 UTC ---

REVIEW: https://review.gluster.org/22990 (features/snapview-server: obtain the list of snapshots inside the lock) merged (#2) on master by mohammed rafi  kc

Comment 1 Worker Ant 2019-07-19 15:15:53 UTC
REVIEW: https://review.gluster.org/23082 (features/snapview-server: obtain the list of snapshots inside the lock) posted (#1) for review on release-7 by Raghavendra Bhat

Comment 2 Worker Ant 2019-07-24 11:04:30 UTC
REVIEW: https://review.gluster.org/23082 (features/snapview-server: obtain the list of snapshots inside the lock) merged (#2) on release-7 by Rinku Kothiya