Description of problem: We have a 24 node gluster cluster with a distribute-disperse volume with bricks from all the nodes (48 bricks). While using the gfapi for getting the volume utilization, it throws a segmentation fault. the volume info for the concerned volume is as below # gluster v info volume_gama_disperse_4_plus_2x2 Volume Name: volume_gama_disperse_4_plus_2x2 Type: Distributed-Disperse Volume ID: b7947c8d-c0e6-458a-a3d5-47221a5a0e63 Status: Stopped Snapshot Count: 0 Number of Bricks: 8 x (4 + 2) = 48 Transport-type: tcp Bricks: Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick2: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick3: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick4: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick5: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick6: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick7: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick8: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick9: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick10: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick11: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick12: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick13: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick14: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick15: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick16: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick17: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick18: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick19: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick20: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick21: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick22: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick23: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick24: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_1/1 Brick25: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick26: dahorak-usm3-gl02.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick27: dahorak-usm3-gl03.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick28: dahorak-usm3-gl04.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick29: dahorak-usm3-gl05.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick30: dahorak-usm3-gl06.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick31: dahorak-usm3-gl07.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick32: dahorak-usm3-gl08.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick33: dahorak-usm3-gl09.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick34: dahorak-usm3-gl10.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick35: dahorak-usm3-gl11.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick36: dahorak-usm3-gl12.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick37: dahorak-usm3-gl13.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick38: dahorak-usm3-gl14.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick39: dahorak-usm3-gl15.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick40: dahorak-usm3-gl16.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick41: dahorak-usm3-gl17.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick42: dahorak-usm3-gl18.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick43: dahorak-usm3-gl19.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick44: dahorak-usm3-gl20.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick45: dahorak-usm3-gl21.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick46: dahorak-usm3-gl22.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick47: dahorak-usm3-gl23.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Brick48: dahorak-usm3-gl24.usmqe.lab.eng.blr.redhat.com:/mnt/brick_gama_disperse_2/2 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on transport.address-family: inet nfs.disable: on Version-Release number of selected component (if applicable): glusterfs-fuse-3.12.2-13.el7rhgs.x86_64 vdsm-gluster-4.19.43-2.3.el7rhgs.noarch tendrl-gluster-integration-1.6.4-7.fc23.noarch gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-rdma-3.12.2-13.el7rhgs.x86_64 glusterfs-cli-3.12.2-13.el7rhgs.x86_64 python2-gluster-3.12.2-13.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64 glusterfs-libs-3.12.2-13.el7rhgs.x86_64 glusterfs-3.12.2-13.el7rhgs.x86_64 glusterfs-events-3.12.2-13.el7rhgs.x86_64 glusterfs-server-3.12.2-13.el7rhgs.x86_64 glusterfs-api-3.12.2-13.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64 The WA code which invokes volume utilization using gfapi is at https://github.com/Tendrl/gluster-integration/blob/master/tendrl/gluster_integration/gfapi.py How reproducible: Always Steps to Reproduce: 1. Create a 24 node gluster cluster 2. Create a 8 x (4+2) distribute-disperse volume with bricks from all the 24 nodes 3. Run volume utilization utility Actual results: Throws a segmentation fault. For other volumes of type distribute-replicate it shows the volume utilization as expected. Expected results: It should show the volume utilization details for all the volumes Additional info: The trace while segmentation fault is attached for reference
Created attachment 1458625 [details] gfapi-segfault.txt
If I create a distribute-disperse volume with smaller no of bricks and from few nodes, the volume utilization details are shown properly as show below # gluster v info test-disp Volume Name: test-disp Type: Distributed-Disperse Volume ID: b2d2d004-34be-4448-9320-6a952b562447 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b1 Brick2: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b2 Brick3: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b3 Brick4: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b4 Brick5: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b5 Brick6: dahorak-usm3-gl01.usmqe.lab.eng.blr.redhat.com:/root/gluster_bricks/test-disp_b6 Options Reconfigured: transport.address-family: inet nfs.disable: on # tendrl-gluster-vol-utilization test-disp {"test-disp": {"pcnt_used": 16.85206375567077, "used": 2130144.0, "used_inode": 22788, "free": 10510112.0, "pcnt_inode_used": 0.7207355373730537, "total_inode": 3161770, "total": 12640256.0}}
Can you provide the core dump to debug this further? Without core its not possible to analyse what caused the crash. Also, installing debuginfo and pasting the backtrace of core is more helpful.
Hi, RCA: A gf_client program is getting crash in rpc_clnt_connection_cleanup at the time of destroying saved frames on connection because saved frames are already destroyed by rpc_clnt_destroy.To avoid this race set NULL to saved_frames in critical section in rpc_clnt_destroy. I have tried to execute client program with valgrind and found "O bytes inside a block" at the time of destroying frame like below =9735== Address 0x18abbe70 is 0 bytes inside a block of size 272 free'd ==9735== at 0x4C2ACBD: free (vg_replace_malloc.c:530) ==9735== by 0x5645B9D: rpc_clnt_destroy (rpc-clnt.c:1777) ==9735== by 0x5645B9D: rpc_clnt_notify (rpc-clnt.c:950) ==9735== by 0x56419AB: rpc_transport_unref (rpc-transport.c:517) ==9735== by 0x5644A38: rpc_clnt_trigger_destroy (rpc-clnt.c:1766) ==9735== by 0x5644A38: rpc_clnt_unref (rpc-clnt.c:1803) ==9735== by 0x5644E3F: call_bail (rpc-clnt.c:197) ==9735== by 0x5AA6981: gf_timer_proc (timer.c:165) ==9735== by 0x689DDD4: start_thread (pthread_create.c:308) ==9735== by 0x515DB3C: clone (clone.S:113) Regards Mohit Agrawal
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607