Bug 1601791

Summary: Core dump in CTR xlator while running pv create, delete and gluster volume heal in parallel
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: vinutha <vinug>
Component: rhgs-server-containerAssignee: Raghavendra Talur <rtalur>
Status: CLOSED ERRATA QA Contact: Sri Vignesh Selvan <sselvan>
Severity: high Docs Contact:
Priority: urgent    
Version: cns-3.9CC: amukherj, asriram, knarra, kramdoss, madam, rhs-bugs, rtalur, sselvan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: OCS 3.11.z Batch Update 4Flags: sselvan: needinfo-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rhgs-server-container-3.11.4-1 Doc Type: Bug Fix
Doc Text:
Previously, a race condition during execution caused the CTR translator to crash when bricks were added and removed from the brick multiplexing process. Hence, the bricks which crashed were not available online for the volume. With this fix, the CTR translator is not loaded to the volumes that do not need it and the crash is no longer observed.
Story Points: ---
Clone Of:
: 1601841 (view as bug list) Environment:
Last Closed: 2019-10-30 12:32:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1601841    
Bug Blocks: 1707226    

Description vinutha 2018-07-17 09:06:51 UTC
Description of problem:
In a CNS 4 node setup observed core file generated on 1 of the gluster pods while creating and deleting pvs in a loop alongwith running gluster volume heal on all gluster pods. 

Version-Release number of selected component (if applicable):
# rpm -qa| grep openshift
openshift-ansible-roles-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-master-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-sdn-ovs-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-docs-3.9.31-1.git.34.154617d.el7.noarch
openshift-ansible-playbooks-3.9.31-1.git.34.154617d.el7.noarch
atomic-openshift-docker-excluder-3.9.31-1.git.0.ef9737b.el7.noarch
atomic-openshift-node-3.9.31-1.git.0.ef9737b.el7.x86_64
atomic-openshift-clients-3.9.31-1.git.0.ef9737b.el7.x86_64
openshift-ansible-3.9.31-1.git.34.154617d.el7.noarch

# oc rsh glusterfs-storage-mrfh4 
sh-4.2# rpm -qa| grep gluster 
glusterfs-libs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-api-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-fuse-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-server-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
gluster-block-0.2.1-14.1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-cli-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-geo-replication-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64
glusterfs-debuginfo-3.8.4-54.10.el7rhgs.1.HOTFIX.CASE02129707.BZ1484412.x86_64

# oc rsh heketi-storage-1-55bw4 
sh-4.2# rpm -qa| grep heketi
python-heketi-6.0.0-7.4.el7rhgs.x86_64
heketi-client-6.0.0-7.4.el7rhgs.x86_64
heketi-6.0.0-7.4.el7rhgs.x86_64


How reproducible:
1/1

Steps to Reproduce:
CNS 4 node setup each node having 1TB device and CPU = 32 (4 cores) Memory = 72GB

1.Created 100 1Gb mongodb pods and ran IO (using dd) 

2.Upgraded the system from 3.9 live build to the experian hotfix build 

3.After all 4 gluster pods have spinned up and in 1/1 running state. All mongodb pods are also in running state. 

4. Initiated creation and deletion of 200 pvs alongwith running gluster volume heal on all 4 gluster pods. 

---- creation and delation of pvs ----------
while true
do
    for i in {101..300}
    do
        ./pvc_create.sh c$i 1; sleep 30;
    done

    sleep 40

    for i in {101..300}
    do
        oc delete pvc c$i; sleep 20;
    done
done
---------------------pv creation/deletion-------------

running gluster volume heal : while true; do for i in $(gluster v list | grep vol); do gluster v heal $i; sleep 2; done; done 

5. A core is generated on a gluster pod  


Actual results:
core file is generated on 1 of the gluster pod and 2 gluster pods are in 0/1 state 

Expected results:
No core files should be generated and all gluster pods should be in 1/1 Running state. 

Additional info:

Comment 3 Amar Tumballi 2018-07-17 09:42:10 UTC
Considering the issue is about 'fini' path, and also the CTR xlator, used by tiering. We should be considering removing it from the volgen in CNS builds altogether. That should get it fixed, and for backward compatibility in RHGS (for 1-2% of customers), we can consider making it an option.

Mohit's patch in this regard should help: https://review.gluster.org/#/c/20501/

Comment 8 vinutha 2018-07-17 19:47:31 UTC
(in reply to comment#5 )
Karthick has changed component to CNS as per comment#7. Hence clearing the needinfo

Comment 25 errata-xmlrpc 2019-10-30 12:32:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3257