1600790 – Segmentation fault while using gfapi while getting volume utilization

Bug 1600790 - Segmentation fault while using gfapi while getting volume utilization

Summary: Segmentation fault while using gfapi while getting volume utilization

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rpc
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Mohit Agrawal
QA Contact:	Upasana
Docs Contact:
URL:
Whiteboard:
Depends On:	1607783
Blocks:	1503137 1600092
TreeView+	depends on / blocked

Reported:	2018-07-13 03:58 UTC by Shubhendu Tripathi
Modified:	2018-09-18 10:24 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.12.2-15
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1607783 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:50:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
gfapi-segfault.txt (6.22 KB, text/plain) 2018-07-13 04:00 UTC, Shubhendu Tripathi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1600092	0	unspecified	CLOSED	Importing bigger cluster failing: Timing out import job, Cluster data still not fully updated	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:51:44 UTC

Internal Links: 1600092

Description Shubhendu Tripathi 2018-07-13 03:58:49 UTC

Description of problem:
We have a 24 node gluster cluster with a distribute-disperse volume with bricks from all the nodes (48 bricks). While using the gfapi for getting the volume utilization, it throws a segmentation fault.

the volume info for the concerned volume is as below

# gluster v info volume_gama_disperse_4_plus_2x2
 
Volume Name: volume_gama_disperse_4_plus_2x2
Type: Distributed-Disperse
Volume ID: b7947c8d-c0e6-458a-a3d5-47221a5a0e63
Status: Stopped
Snapshot Count: 0
Number of Bricks: 8 x (4 + 2) = 48
Transport-type: tcp
Bricks:
Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick2: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick3: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick4: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick5: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick6: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick7: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick8: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick9: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick10: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick11: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick12: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick13: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick14: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick15: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick16: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick17: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick18: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick19: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick20: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick21: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick22: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick23: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick24: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick25: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick26: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick27: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick28: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick29: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick30: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick31: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick32: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick33: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick34: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick35: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick36: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick37: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick38: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick39: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick40: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick41: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick42: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick43: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick44: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick45: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick46: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick47: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick48: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: on


Version-Release number of selected component (if applicable):

glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
tendrl-gluster-integration-1.6.4-7.fc23.noarch
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-rdma-3.12.2-13.el7rhgs.x86_64
glusterfs-cli-3.12.2-13.el7rhgs.x86_64
python2-gluster-3.12.2-13.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64
glusterfs-libs-3.12.2-13.el7rhgs.x86_64
glusterfs-3.12.2-13.el7rhgs.x86_64
glusterfs-events-3.12.2-13.el7rhgs.x86_64
glusterfs-server-3.12.2-13.el7rhgs.x86_64
glusterfs-api-3.12.2-13.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64

The WA code which invokes volume utilization using gfapi is at https://github.com/Tendrl/gluster-integration/blob/master/tendrl/gluster_integration/gfapi.py

How reproducible:
Always

Steps to Reproduce:
1. Create a 24 node gluster cluster
2. Create a 8 x (4+2) distribute-disperse volume with bricks from all the 24 nodes
3. Run volume utilization utility

Actual results:
Throws a segmentation fault. For other volumes of type distribute-replicate it shows the volume utilization as expected.

Expected results:
It should show the volume utilization details for all the volumes

Additional info:
The trace while segmentation fault is attached for reference

Comment 2 Shubhendu Tripathi 2018-07-13 04:00:30 UTC

Created attachment 1458625 [details]
gfapi-segfault.txt

Comment 3 Shubhendu Tripathi 2018-07-13 04:19:03 UTC

If I create a distribute-disperse volume with smaller no of bricks and from few nodes, the volume utilization details are shown properly as show below

# gluster v info test-disp
 
Volume Name: test-disp
Type: Distributed-Disperse
Volume ID: b2d2d004-34be-4448-9320-6a952b562447
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b1
Brick2: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b2
Brick3: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b3
Brick4: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b4
Brick5: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b5
Brick6: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b6
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

# tendrl-gluster-vol-utilization test-disp
{"test-disp": {"pcnt_used": 16.85206375567077, "used": 2130144.0, "used_inode": 22788, "free": 10510112.0, "pcnt_inode_used": 0.7207355373730537, "total_inode": 3161770, "total": 12640256.0}}

Comment 4 Poornima G 2018-07-20 06:19:30 UTC

Can you provide the core dump to debug this further? Without core its not possible to analyse what caused the crash. Also, installing debuginfo and pasting the backtrace of core is more helpful.

Comment 12 Mohit Agrawal 2018-07-24 09:05:20 UTC

Hi,

RCA: A gf_client program is getting crash in rpc_clnt_connection_cleanup at the 
     time of destroying saved frames on connection because saved frames are 
     already destroyed by rpc_clnt_destroy.To avoid this race set NULL to saved_frames in critical section in rpc_clnt_destroy.

     I have tried to execute client program with valgrind and found "O bytes inside a block" at the time of destroying frame like below

     =9735==  Address 0x18abbe70 is 0 bytes inside a block of size 272 free'd
==9735==    at 0x4C2ACBD: free (vg_replace_malloc.c:530)
==9735==    by 0x5645B9D: rpc_clnt_destroy (rpc-clnt.c:1777)
==9735==    by 0x5645B9D: rpc_clnt_notify (rpc-clnt.c:950)
==9735==    by 0x56419AB: rpc_transport_unref (rpc-transport.c:517)
==9735==    by 0x5644A38: rpc_clnt_trigger_destroy (rpc-clnt.c:1766)
==9735==    by 0x5644A38: rpc_clnt_unref (rpc-clnt.c:1803)
==9735==    by 0x5644E3F: call_bail (rpc-clnt.c:197)
==9735==    by 0x5AA6981: gf_timer_proc (timer.c:165)
==9735==    by 0x689DDD4: start_thread (pthread_create.c:308)
==9735==    by 0x515DB3C: clone (clone.S:113)


Regards
Mohit Agrawal

Comment 23 errata-xmlrpc 2018-09-04 06:50:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.