1601791 – Core dump in CTR xlator while running pv create, delete and gluster volume heal in parallel

Bug 1601791 - Core dump in CTR xlator while running pv create, delete and gluster volume heal in parallel

Summary: Core dump in CTR xlator while running pv create, delete and gluster volume he...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhgs-server-container
Sub Component:
Version:	cns-3.9
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	OCS 3.11.z Batch Update 4
Assignee:	Raghavendra Talur
QA Contact:	Sri Vignesh Selvan
Docs Contact:
URL:
Whiteboard:
Depends On:	1601841
Blocks:	1707226
TreeView+	depends on / blocked

Reported:	2018-07-17 09:06 UTC by vinutha
Modified:	2019-10-30 12:33 UTC (History)
CC List:	9 users (show)
Fixed In Version:	rhgs-server-container-3.11.4-1
Doc Type:	Bug Fix
Doc Text:	Previously, a race condition during execution caused the CTR translator to crash when bricks were added and removed from the brick multiplexing process. Hence, the bricks which crashed were not available online for the volume. With this fix, the CTR translator is not loaded to the volumes that do not need it and the crash is no longer observed.
Clone Of:
Clones:	1601841 (view as bug list)
Environment:
Last Closed:	2019-10-30 12:32:53 UTC
Embargoed:
Dependent Products:
Flags:	sselvan: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:3257	0	None	None	None	2019-10-30 12:33:10 UTC

Description vinutha 2018-07-17 09:06:51 UTC

Description of problem:
In a CNS 4 node setup observed core file generated on 1 of the gluster pods while creating and deleting pvs in a loop alongwith running gluster volume heal on all gluster pods. 

Version-Release number of selected component (if applicable):
# rpm -qa| grep openshift
openshift-ansible-roles-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-master-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-sdn-ovs-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-docs-3.9.31-1.git.34.154617d.el7.noarch
openshift-ansible-playbooks-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-docker-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-node-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-clients-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-3.9.31-1.git.34.154617d.el7.noarch

# oc rsh glusterfs-storage-mrfh4 
sh-4.2# rpm -qa| grep gluster 
glusterfs-libs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-api-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-fuse-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-server-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
gluster-block-0.2.1-14.1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-cli-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-geo-replication-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-debuginfo-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64

# oc rsh heketi-storage-1-55bw4 
sh-4.2# rpm -qa| grep heketi
python-heketi-6.0.0-7.4.el7rhgs.x86_64
heketi-client-6.0.0-7.4.el7rhgs.x86_64
heketi-6.0.0-7.4.el7rhgs.x86_64


How reproducible:
1/1

Steps to Reproduce:
CNS 4 node setup each node having 1TB device and CPU = 32 (4 cores) Memory = 72GB

1.Created 100 1Gb mongodb pods and ran IO (using dd) 

2.Upgraded the system from 3.9 live build to the experian hotfix build 

3.After all 4 gluster pods have spinned up and in 1/1 running state. All mongodb pods are also in running state. 

4. Initiated creation and deletion of 200 pvs alongwith running gluster volume heal on all 4 gluster pods. 

---- creation and delation of pvs ----------
while true
do
    for i in {101..300}
    do
        ./pvc_create.sh c$i 1; sleep 30;
    done

    sleep 40

    for i in {101..300}
    do
        oc delete pvc c$i; sleep 20;
    done
done
---------------------pv creation/deletion-------------

running gluster volume heal : while true; do for i in $(gluster v list | grep vol); do gluster v heal $i; sleep 2; done; done 

5. A core is generated on a gluster pod  


Actual results:
core file is generated on 1 of the gluster pod and 2 gluster pods are in 0/1 state 

Expected results:
No core files should be generated and all gluster pods should be in 1/1 Running state. 

Additional info:

Comment 3 Amar Tumballi 2018-07-17 09:42:10 UTC

Considering the issue is about 'fini' path, and also the CTR xlator, used by tiering. We should be considering removing it from the volgen in CNS builds altogether. That should get it fixed, and for backward compatibility in RHGS (for 1-2% of customers), we can consider making it an option.

Mohit's patch in this regard should help: https://review.gluster.org/#/c/20501/

Comment 8 vinutha 2018-07-17 19:47:31 UTC

(in reply to comment#5 )
Karthick has changed component to CNS as per comment#7. Hence clearing the needinfo

Comment 25 errata-xmlrpc 2019-10-30 12:32:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3257

Note You need to log in before you can comment on or make changes to this bug.