Bug 1600790 - Segmentation fault while using gfapi while getting volume utilization
Summary: Segmentation fault while using gfapi while getting volume utilization
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Mohit Agrawal
QA Contact: Upasana
URL:
Whiteboard:
Depends On: 1607783
Blocks: 1503137 1600092
TreeView+ depends on / blocked
 
Reported: 2018-07-13 03:58 UTC by Shubhendu Tripathi
Modified: 2018-09-18 10:24 UTC (History)
12 users (show)

Fixed In Version: glusterfs-3.12.2-15
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1607783 (view as bug list)
Environment:
Last Closed: 2018-09-04 06:50:20 UTC
Target Upstream Version:


Attachments (Terms of Use)
gfapi-segfault.txt (6.22 KB, text/plain)
2018-07-13 04:00 UTC, Shubhendu Tripathi
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:51:44 UTC
Red Hat Bugzilla 1600092 None CLOSED Importing bigger cluster failing: Timing out import job, Cluster data still not fully updated 2019-03-12 04:47:35 UTC

Internal Links: 1600092

Description Shubhendu Tripathi 2018-07-13 03:58:49 UTC
Description of problem:
We have a 24 node gluster cluster with a distribute-disperse volume with bricks from all the nodes (48 bricks). While using the gfapi for getting the volume utilization, it throws a segmentation fault.

the volume info for the concerned volume is as below

# gluster v info volume_gama_disperse_4_plus_2x2
 
Volume Name: volume_gama_disperse_4_plus_2x2
Type: Distributed-Disperse
Volume ID: b7947c8d-c0e6-458a-a3d5-47221a5a0e63
Status: Stopped
Snapshot Count: 0
Number of Bricks: 8 x (4 + 2) = 48
Transport-type: tcp
Bricks:
Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick2: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick3: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick4: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick5: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick6: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick7: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick8: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick9: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick10: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick11: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick12: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick13: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick14: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick15: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick16: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick17: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick18: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick19: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick20: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick21: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick22: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick23: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick24: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick25: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick26: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick27: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick28: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick29: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick30: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick31: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick32: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick33: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick34: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick35: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick36: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick37: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick38: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick39: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick40: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick41: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick42: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick43: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick44: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick45: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick46: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick47: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick48: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: on


Version-Release number of selected component (if applicable):

glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
tendrl-gluster-integration-1.6.4-7.fc23.noarch
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-rdma-3.12.2-13.el7rhgs.x86_64
glusterfs-cli-3.12.2-13.el7rhgs.x86_64
python2-gluster-3.12.2-13.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64
glusterfs-libs-3.12.2-13.el7rhgs.x86_64
glusterfs-3.12.2-13.el7rhgs.x86_64
glusterfs-events-3.12.2-13.el7rhgs.x86_64
glusterfs-server-3.12.2-13.el7rhgs.x86_64
glusterfs-api-3.12.2-13.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64

The WA code which invokes volume utilization using gfapi is at https://github.com/Tendrl/gluster-integration/blob/master/tendrl/gluster_integration/gfapi.py

How reproducible:
Always

Steps to Reproduce:
1. Create a 24 node gluster cluster
2. Create a 8 x (4+2) distribute-disperse volume with bricks from all the 24 nodes
3. Run volume utilization utility

Actual results:
Throws a segmentation fault. For other volumes of type distribute-replicate it shows the volume utilization as expected.

Expected results:
It should show the volume utilization details for all the volumes

Additional info:
The trace while segmentation fault is attached for reference

Comment 2 Shubhendu Tripathi 2018-07-13 04:00:30 UTC
Created attachment 1458625 [details]
gfapi-segfault.txt

Comment 3 Shubhendu Tripathi 2018-07-13 04:19:03 UTC
If I create a distribute-disperse volume with smaller no of bricks and from few nodes, the volume utilization details are shown properly as show below

# gluster v info test-disp
 
Volume Name: test-disp
Type: Distributed-Disperse
Volume ID: b2d2d004-34be-4448-9320-6a952b562447
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b1
Brick2: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b2
Brick3: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b3
Brick4: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b4
Brick5: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b5
Brick6: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b6
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

# tendrl-gluster-vol-utilization test-disp
{"test-disp": {"pcnt_used": 16.85206375567077, "used": 2130144.0, "used_inode": 22788, "free": 10510112.0, "pcnt_inode_used": 0.7207355373730537, "total_inode": 3161770, "total": 12640256.0}}

Comment 4 Poornima G 2018-07-20 06:19:30 UTC
Can you provide the core dump to debug this further? Without core its not possible to analyse what caused the crash. Also, installing debuginfo and pasting the backtrace of core is more helpful.

Comment 12 Mohit Agrawal 2018-07-24 09:05:20 UTC
Hi,

RCA: A gf_client program is getting crash in rpc_clnt_connection_cleanup at the 
     time of destroying saved frames on connection because saved frames are 
     already destroyed by rpc_clnt_destroy.To avoid this race set NULL to saved_frames in critical section in rpc_clnt_destroy.

     I have tried to execute client program with valgrind and found "O bytes inside a block" at the time of destroying frame like below

     =9735==  Address 0x18abbe70 is 0 bytes inside a block of size 272 free'd
==9735==    at 0x4C2ACBD: free (vg_replace_malloc.c:530)
==9735==    by 0x5645B9D: rpc_clnt_destroy (rpc-clnt.c:1777)
==9735==    by 0x5645B9D: rpc_clnt_notify (rpc-clnt.c:950)
==9735==    by 0x56419AB: rpc_transport_unref (rpc-transport.c:517)
==9735==    by 0x5644A38: rpc_clnt_trigger_destroy (rpc-clnt.c:1766)
==9735==    by 0x5644A38: rpc_clnt_unref (rpc-clnt.c:1803)
==9735==    by 0x5644E3F: call_bail (rpc-clnt.c:197)
==9735==    by 0x5AA6981: gf_timer_proc (timer.c:165)
==9735==    by 0x689DDD4: start_thread (pthread_create.c:308)
==9735==    by 0x515DB3C: clone (clone.S:113)


Regards
Mohit Agrawal

Comment 23 errata-xmlrpc 2018-09-04 06:50:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.