Bug 1600790

Summary: Segmentation fault while using gfapi while getting volume utilization
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shubhendu Tripathi <shtripat>
Component: rpcAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Upasana <ubansal>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: amukherj, apaladug, dahorak, jthottan, moagrawa, rhs-bugs, sankarshan, sheggodu, shtripat, skoduri, storage-qa-internal, ubansal
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-15 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1607783 (view as bug list) Environment:
Last Closed: 2018-09-04 06:50:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1607783    
Bug Blocks: 1503137, 1600092    
Attachments:
Description Flags
gfapi-segfault.txt none

Description Shubhendu Tripathi 2018-07-13 03:58:49 UTC
Description of problem:
We have a 24 node gluster cluster with a distribute-disperse volume with bricks from all the nodes (48 bricks). While using the gfapi for getting the volume utilization, it throws a segmentation fault.

the volume info for the concerned volume is as below

# gluster v info volume_gama_disperse_4_plus_2x2
 
Volume Name: volume_gama_disperse_4_plus_2x2
Type: Distributed-Disperse
Volume ID: b7947c8d-c0e6-458a-a3d5-47221a5a0e63
Status: Stopped
Snapshot Count: 0
Number of Bricks: 8 x (4 + 2) = 48
Transport-type: tcp
Bricks:
Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick2: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick3: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick4: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick5: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick6: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick7: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick8: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick9: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick10: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick11: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick12: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick13: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick14: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick15: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick16: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick17: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick18: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick19: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick20: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick21: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick22: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick23: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick24: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick25: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick26: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick27: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick28: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick29: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick30: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick31: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick32: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick33: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick34: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick35: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick36: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick37: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick38: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick39: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick40: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick41: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick42: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick43: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick44: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick45: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick46: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick47: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick48: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: on


Version-Release number of selected component (if applicable):

glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
tendrl-gluster-integration-1.6.4-7.fc23.noarch
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-rdma-3.12.2-13.el7rhgs.x86_64
glusterfs-cli-3.12.2-13.el7rhgs.x86_64
python2-gluster-3.12.2-13.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64
glusterfs-libs-3.12.2-13.el7rhgs.x86_64
glusterfs-3.12.2-13.el7rhgs.x86_64
glusterfs-events-3.12.2-13.el7rhgs.x86_64
glusterfs-server-3.12.2-13.el7rhgs.x86_64
glusterfs-api-3.12.2-13.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64

The WA code which invokes volume utilization using gfapi is at https://github.com/Tendrl/gluster-integration/blob/master/tendrl/gluster_integration/gfapi.py

How reproducible:
Always

Steps to Reproduce:
1. Create a 24 node gluster cluster
2. Create a 8 x (4+2) distribute-disperse volume with bricks from all the 24 nodes
3. Run volume utilization utility

Actual results:
Throws a segmentation fault. For other volumes of type distribute-replicate it shows the volume utilization as expected.

Expected results:
It should show the volume utilization details for all the volumes

Additional info:
The trace while segmentation fault is attached for reference

Comment 2 Shubhendu Tripathi 2018-07-13 04:00:30 UTC
Created attachment 1458625 [details]
gfapi-segfault.txt

Comment 3 Shubhendu Tripathi 2018-07-13 04:19:03 UTC
If I create a distribute-disperse volume with smaller no of bricks and from few nodes, the volume utilization details are shown properly as show below

# gluster v info test-disp
 
Volume Name: test-disp
Type: Distributed-Disperse
Volume ID: b2d2d004-34be-4448-9320-6a952b562447
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b1
Brick2: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b2
Brick3: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b3
Brick4: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b4
Brick5: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b5
Brick6: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b6
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

# tendrl-gluster-vol-utilization test-disp
{"test-disp": {"pcnt_used": 16.85206375567077, "used": 2130144.0, "used_inode": 22788, "free": 10510112.0, "pcnt_inode_used": 0.7207355373730537, "total_inode": 3161770, "total": 12640256.0}}

Comment 4 Poornima G 2018-07-20 06:19:30 UTC
Can you provide the core dump to debug this further? Without core its not possible to analyse what caused the crash. Also, installing debuginfo and pasting the backtrace of core is more helpful.

Comment 12 Mohit Agrawal 2018-07-24 09:05:20 UTC
Hi,

RCA: A gf_client program is getting crash in rpc_clnt_connection_cleanup at the 
     time of destroying saved frames on connection because saved frames are 
     already destroyed by rpc_clnt_destroy.To avoid this race set NULL to saved_frames in critical section in rpc_clnt_destroy.

     I have tried to execute client program with valgrind and found "O bytes inside a block" at the time of destroying frame like below

     =9735==  Address 0x18abbe70 is 0 bytes inside a block of size 272 free'd
==9735==    at 0x4C2ACBD: free (vg_replace_malloc.c:530)
==9735==    by 0x5645B9D: rpc_clnt_destroy (rpc-clnt.c:1777)
==9735==    by 0x5645B9D: rpc_clnt_notify (rpc-clnt.c:950)
==9735==    by 0x56419AB: rpc_transport_unref (rpc-transport.c:517)
==9735==    by 0x5644A38: rpc_clnt_trigger_destroy (rpc-clnt.c:1766)
==9735==    by 0x5644A38: rpc_clnt_unref (rpc-clnt.c:1803)
==9735==    by 0x5644E3F: call_bail (rpc-clnt.c:197)
==9735==    by 0x5AA6981: gf_timer_proc (timer.c:165)
==9735==    by 0x689DDD4: start_thread (pthread_create.c:308)
==9735==    by 0x515DB3C: clone (clone.S:113)


Regards
Mohit Agrawal

Comment 23 errata-xmlrpc 2018-09-04 06:50:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607