Bug 1600790

Summary:

Segmentation fault while using gfapi while getting volume utilization

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Shubhendu Tripathi <shtripat>

Component:

rpc

Assignee:

Mohit Agrawal <moagrawa>

Status:

CLOSED ERRATA

QA Contact:

Upasana <ubansal>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.4

CC:

amukherj, apaladug, dahorak, jthottan, moagrawa, rhs-bugs, sankarshan, sheggodu, shtripat, skoduri, storage-qa-internal, ubansal

Target Milestone:

---

Target Release:

RHGS 3.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.12.2-15

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1607783 (view as bug list)

Environment:

Last Closed:

2018-09-04 06:50:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1607783

Bug Blocks:

1503137, 1600092

Attachments:

Description	Flags
gfapi-segfault.txt	none

Description Shubhendu Tripathi 2018-07-13 03:58:49 UTC

Description of problem:
We have a 24 node gluster cluster with a distribute-disperse volume with bricks from all the nodes (48 bricks). While using the gfapi for getting the volume utilization, it throws a segmentation fault.

the volume info for the concerned volume is as below

# gluster v info volume_gama_disperse_4_plus_2x2
 
Volume Name: volume_gama_disperse_4_plus_2x2
Type: Distributed-Disperse
Volume ID: b7947c8d-c0e6-458a-a3d5-47221a5a0e63
Status: Stopped
Snapshot Count: 0
Number of Bricks: 8 x (4 + 2) = 48
Transport-type: tcp
Bricks:
Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick2: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick3: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick4: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick5: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick6: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick7: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick8: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick9: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick10: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick11: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick12: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick13: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick14: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick15: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick16: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick17: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick18: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick19: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick20: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick21: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick22: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick23: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick24: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1
Brick25: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick26: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick27: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick28: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick29: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick30: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick31: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick32: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick33: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick34: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick35: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick36: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick37: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick38: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick39: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick40: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick41: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick42: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick43: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick44: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick45: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick46: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick47: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Brick48: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: on


Version-Release number of selected component (if applicable):

glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
tendrl-gluster-integration-1.6.4-7.fc23.noarch
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-rdma-3.12.2-13.el7rhgs.x86_64
glusterfs-cli-3.12.2-13.el7rhgs.x86_64
python2-gluster-3.12.2-13.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64
glusterfs-libs-3.12.2-13.el7rhgs.x86_64
glusterfs-3.12.2-13.el7rhgs.x86_64
glusterfs-events-3.12.2-13.el7rhgs.x86_64
glusterfs-server-3.12.2-13.el7rhgs.x86_64
glusterfs-api-3.12.2-13.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64

The WA code which invokes volume utilization using gfapi is at https://github.com/Tendrl/gluster-integration/blob/master/tendrl/gluster_integration/gfapi.py

How reproducible:
Always

Steps to Reproduce:
1. Create a 24 node gluster cluster
2. Create a 8 x (4+2) distribute-disperse volume with bricks from all the 24 nodes
3. Run volume utilization utility

Actual results:
Throws a segmentation fault. For other volumes of type distribute-replicate it shows the volume utilization as expected.

Expected results:
It should show the volume utilization details for all the volumes

Additional info:
The trace while segmentation fault is attached for reference

Comment 2 Shubhendu Tripathi 2018-07-13 04:00:30 UTC

Created attachment 1458625 [details]
gfapi-segfault.txt

Comment 3 Shubhendu Tripathi 2018-07-13 04:19:03 UTC

If I create a distribute-disperse volume with smaller no of bricks and from few nodes, the volume utilization details are shown properly as show below

# gluster v info test-disp
 
Volume Name: test-disp
Type: Distributed-Disperse
Volume ID: b2d2d004-34be-4448-9320-6a952b562447
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b1
Brick2: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b2
Brick3: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b3
Brick4: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b4
Brick5: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b5
Brick6: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b6
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

# tendrl-gluster-vol-utilization test-disp
{"test-disp": {"pcnt_used": 16.85206375567077, "used": 2130144.0, "used_inode": 22788, "free": 10510112.0, "pcnt_inode_used": 0.7207355373730537, "total_inode": 3161770, "total": 12640256.0}}

Comment 4 Poornima G 2018-07-20 06:19:30 UTC

Can you provide the core dump to debug this further? Without core its not possible to analyse what caused the crash. Also, installing debuginfo and pasting the backtrace of core is more helpful.

Comment 12 Mohit Agrawal 2018-07-24 09:05:20 UTC

Hi,

RCA: A gf_client program is getting crash in rpc_clnt_connection_cleanup at the 
     time of destroying saved frames on connection because saved frames are 
     already destroyed by rpc_clnt_destroy.To avoid this race set NULL to saved_frames in critical section in rpc_clnt_destroy.

     I have tried to execute client program with valgrind and found "O bytes inside a block" at the time of destroying frame like below

     =9735==  Address 0x18abbe70 is 0 bytes inside a block of size 272 free'd
==9735==    at 0x4C2ACBD: free (vg_replace_malloc.c:530)
==9735==    by 0x5645B9D: rpc_clnt_destroy (rpc-clnt.c:1777)
==9735==    by 0x5645B9D: rpc_clnt_notify (rpc-clnt.c:950)
==9735==    by 0x56419AB: rpc_transport_unref (rpc-transport.c:517)
==9735==    by 0x5644A38: rpc_clnt_trigger_destroy (rpc-clnt.c:1766)
==9735==    by 0x5644A38: rpc_clnt_unref (rpc-clnt.c:1803)
==9735==    by 0x5644E3F: call_bail (rpc-clnt.c:197)
==9735==    by 0x5AA6981: gf_timer_proc (timer.c:165)
==9735==    by 0x689DDD4: start_thread (pthread_create.c:308)
==9735==    by 0x515DB3C: clone (clone.S:113)


Regards
Mohit Agrawal

Comment 23 errata-xmlrpc 2018-09-04 06:50:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607