Description of problem: When a volume is started, in glusterd_op_start_volume(), we increase the refcount of the volume, but in 'out:' the logic to decrement the refcount is faulty. As a result, every time a volume stops and starts, it's refcount has increased by 1. This is pretty serious given that we use refcount as a parameter to delete a volume. What happens when we delete a volume which has gone through the above: We don't see it listed in the vol info, as we explicitly remove it frm the list. But the volinfo continues to stay in memory. In events of reverting a failed snapshot restore, this has catastrophic consequences, as the stale volinfo not only stays back in memory, but also corrupts the volume list. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Master URL: http://review.gluster.org/16108
Master Url: http://review.gluster.org/16108 Release 3.9 Url: http://review.gluster.org/#/c/16113/ Release 3.8 Url: http://review.gluster.org/#/c/16114/ RHGS 3.2.0 Url: https://code.engineering.redhat.com/gerrit/#/c/92782/
@Avra, steps are there to reproduce and verify this issue?
You need to run gdb on glusterd, and put a breakpoint on glusterd_op_start_volume() and check for the value of volinfo->refcnt. This value should not increase every time we stop and again start the volume.
Verified this issue using glusterfs-3.8.4-9 and reproduced the issue with out the fix as well. Fix is working well, below the gdb result with and without fix. Without Fix, ============ 2568 ret = glusterd_volinfo_find (volname, &volinfo); (gdb) 2569 if (ret) { (gdb) p volinfo->refcnt $1 = 7 (gdb) c Continuing. Detaching after fork from child process 31326. Detaching after fork from child process 31345. (gdb) 2568 ret = glusterd_volinfo_find (volname, &volinfo); (gdb) p volinfo->refcnt $2 = 8 (gdb) c Continuing. Detaching after fork from child process 31416. Detaching after fork from child process 31435. 2568 ret = glusterd_volinfo_find (volname, &volinfo); (gdb) 2569 if (ret) { (gdb) p volinfo->refcnt $3 = 9 (gdb) c Continuing. Detaching after fork from child process 31506. With Fix, ========= 2569 if (ret) { (gdb) p volinfo->refcnt $1 = 1 (gdb) c Continuing. Detaching after fork from child process 32164. Detaching after fork from child process 32183. Detaching after fork from child process 32202. Detaching after fork from child process 32205. Detaching after fork from child process 32229. Detaching after fork from child process 32231. Breakpoint 1, glusterd_op_start_volume (dict=dict@entry=0x7f1cf06b2988, op_errstr=op_errstr@entry=0x7f1cd83843c0) at glusterd-volume-ops.c:2544 2544 { (gdb) next 2569 if (ret) { (gdb) p volinfo->refcnt $2 = 1 (gdb) $3 = 1 (gdb) c Continuing. Detaching after fork from child process 32248. Detaching after fork from child process 32267. Detaching after fork from child process 32286. Detaching after fork from child process 32289. Detaching after fork from child process 32319. Detaching after fork from child process 32321. [Switching to Thread 0x7f1ce8f01700 (LWP 31781)] Breakpoint 1, glusterd_op_start_volume (dict=dict@entry=0x7f1cf06b3198, op_errstr=op_errstr@entry=0x7f1cd83843c0) at glusterd-volume-ops.c:2544 2544 { 2568 ret = glusterd_volinfo_find (volname, &volinfo); (gdb) 2569 if (ret) { (gdb) p volinfo->refcnt $4 = 1 (gdb) c Continuing. Detaching after fork from child process 32338. Based on above details moving to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html